GEOG 489
Advanced Python Programming for GIS

3.8.5 Joining data frames

PrintPrint

The following short example demonstrates how pandas can be used to merge two data frames based on a common key, so to perform a join operation in database terms. Let’s say we have two tables, one listing capitals of states in the U.S. and one listing populations for each state. For simplicity we just define data frames for these with just entries for two states, Washington and Oregon:

df1 = pd.DataFrame( {'state': ['Washington', 'Oregon'], 'capital': ['Olympia', 'Salem']} )
print(df1)
df2 = pd.DataFrame( {'name': ['Washington', 'Oregon'], 'population':[7288000, 4093000]} )
print(df2)

The two data frames produced by this code look like this:

Table 3.5 Data frame 1 (df1) listing states and state capitals
capital state
0 Olympia Washington
1 Salem Oregon
Table 3.6 Data frame 2 (df2) listing states and population numbers
name population
0 Washington 7288000
1 Oregon 4093000

We here use a new way of creating a data frame, namely from a dictionary that has entries for each of the columns where the keys are the column names (‘state’ and ‘capital’ in the case of df1, and ‘name’ and ‘population’ in case of df2) and the values stored are lists of the values for that column. We now invoke the merge(...) method on df1 and provide df2 as the first parameter meaning that a new data frame will be produced by merging df1 and df2. We further have to specify which columns should be used as keys for the join operation. Since the two columns containing the state names are called differently, we have to provide the name for df1 through the ‘left_on’ argument and the name for df2 through the ‘right_on’ argument.

merged = df1.merge(df2, left_on='state', right_on='name')
merged

The joined data frame produced by the code will look like this:

Table 3.7 Joined data frame
capital state name population
0 Olympia Washington Washington 7288000
1 Salem Oregon Oregon 4093000

As you see, the data frames have been merged correctly. However, we do not want two columns with the state names, so, as a last step, we remove the column called ‘name’ with the previously mentioned drop(...) method. As explained, we have to use the keyword argument axis=1 to indicate that we want to drop a column, not a row.

newMerged = merged.drop('name', axis=1)
newMerged

Result:

Table 3.8 Joined data frame after dropping the 'name' column
capital state population
0 Olympia Washington 7288000
1 Salem Oregon 4093000

If you print out variable merged, you will see that it still contains the 'name' column. That is because drop(...) doesn't change the original data frame but rather produces a copy with the column/row removed.