Combining Multiple Datasets in Pandas

Glenn Prince

Rate me:

5.00/5 (9 votes)

29 May 2020CPOL4 min read

11.8K

119

In this fourth part of the Data Cleaning with Python and Pandas series, we look at a few of the simpler methods for combining data

Most analysis on data is not performed on a single system or dataset, so in this step we look at combining multiple data sets to provide a bigger picture of our data.

Introduction

This article is part of the Data Cleaning with Python and Pandas series. It’s aimed at getting developers up and running quickly with data science tools and techniques.

If you’d like to check out the other articles in the series, you can find them here:

Part 1 - Introducing Jupyter and Pandas
Part 2 - Loading CSV and SQL Data into Pandas
Part 3 - Correcting Missing Data in Pandas
Part 4 - Combining Multiple Datasets in Pandas
Part 5 - Cleaning Data in a Pandas DataFrame
Part 6 - Reshaping Data in a Pandas DataFrame
Part 7 - Data Visualization using Seaborn and Pandas

With all the missing values dealt with, let’s combine data from the product, customer, and purchase datasets to get a more complete set of data in a single DataFrame. This will provide a better view of where we’re going with this data set and what overall insights we can leverage.

Note that we've created a complete Jupyter Notebook with the source data files for this series of modules, which you can download and install locally.

Pandas offers a number of methods for combining data with different sets of logic. Before diving into some of the more complex combination sets we might use, let’s take a look at a few of the simpler methods.

Concatenating DataFrames

The core function for combining data is concat(). This function provides simple joining of two DataFrames that can be expanded with the union option or intersection logic similar to traditional relational databases.

The concat method joins DataFrames together when columns match.

By itself, concat() will join two or more DataFrames with the same keys or "column headings," and push the rows together one after the other. For example, two DataFrames with the columns X, Y, Z and 10 rows each will join together into a single DataFrame with the columns X, Y, Z and 20 rows of data.

The concat function has a number of different options for combining data, including, but not limited to:

Outer join joins the data from two or more DataFrames and includes rows that don't have matching keys (and the result may contain no values).
Inner join joins the data from two or more DataFrames only where the frames match keys (and the result may drop rows that don't match).

Merging DataFrames

Pandas also includes options to merge datasets using the rows of one set of data as inputs against keys from another set of data. This form of joining and merging is pretty powerful and it’s what we’re going to do with our datasets.

The merge method joins DataFrames together using left and right joins

At the moment, our dataset includes three separate DataFrames: customers, products, and purchases. For an application, this makes a lot of sense as your products and customers don't change too much, but your purchases may change every day.

For data analysis and visualization, having all the data in one big DataFrame makes more sense. So we’re going to merge our customer and product datasets into our purchases data.

Create a new code block and add the following:

Python

combinedData = pd.merge(purchases, customers, left_on='customer_num', right_on='id', sort=False)
combinedData = pd.merge(combinedData, products, left_on='product_num', right_on='id', sort=False)

We’re using the Pandas merge function to merge the three DataFrames. The first merge takes the purchases DataFrame and merges it with the customers DataFrame.

We’re also using two optional parameters here, left_on and right_on. These parameters merge the table based on the knowledge that the left_on key matches the right_on key even if the key names are different.

Finally, we specify the sort option to be false as we don't need to sort data yet and it makes this method a little quicker to execute.

Add the following lines to examine our new combined DataFrame:

Python

print(combinedData.head(5))
print(combinedData.shape)
print(purchases.shape)

And here’s the result:

As you can see, we now have one big DataFrame with a number of columns combined from all three DataFrames.

You’ll also see that when we compare row counts between the purchases DataFrame and the resultant DataFrame, we’re down to 5069 rows out of 6000. Because merge uses an inner join by default, the rows that couldn't be matched to a customer (as they were removed through the first stage of data cleaning) were dropped from the combined DataFrame.

Summary

We looked at the Pandas function to concatenate data sets, then moved on to merge our three DataFrames into a single, complete DataFrame that we can manipulate further to present data for visualization.

We are now going to look at cleaning up the last of the values and keys that may cause some issues before reshaping our data for visualization.

This article is part of the series 'Data Cleaning with Python and Pandas ◁ Prev View All Next ▷

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Glenn Prince

Architect

United States

Hi! I'm a Solution Architect, planning and designing systems based in Denver, Colorado. I also occasionally develop web applications and games, as well as write. My blog has articles, tutorials and general thoughts based on more than twenty years of misadventures in IT.