How to Remove or Prevent Duplicate Columns From a Pandas Merge

In this post, I’ll show you three methods to remove or prevent duplicate columns when merging two DataFrames.

Why Does This Happen?

Duplicate columns will appear in your merged DataFrame if you have columns in both DataFrames with identical names and they are not used in the join statement. The resulting DataFrame will look something like this:

#Create test data
df1 = pd.DataFrame(np.random.randint(100,size=(1000, 3)),columns=['A','B','C'])
df2 = pd.DataFrame(np.random.randint(100,size=(1000, 3)),columns=['B','C','D'])

#Merge the DataFrames
pd.merge(df1, df2, how='inner', left_index=True, right_index=True)

In this example, columns B and C appear in both DataFrames and that’s why you’ll see the _x and _y appended to those columns.

Method #1 – Join on all Identical Columns

To avoid having the _x and _y, you can merge on all identical columns between both DataFrames. The resulting DataFrame won’t have any duplicate columns.

pd.merge(df1, df2, how='inner', left_on=['B','C'], right_on=['B','C'])

Method #2 – Change the Suffix and Drop the Duplicates

One of the parameters of the merge is to apply your own set of suffixes for duplicate columns. This means you label the second DataFrame columns with a keyword that you will use to identify and remove them from the merged DataFrame. The argument I’ll use below is suffixes=('', '_drop'). Then I’ll find any column with the word _drop and remove them from the DataFrame.

#Merge the DataFrames
df_merged = pd.merge(df1, df2, how='inner', left_index=True, right_index=True, suffixes=('', '_drop'))

#Drop the duplicate columns
df_merged.drop([col for col in df_merged.columns if 'drop' in col], axis=1, inplace=True)

Method #3 – Filter out Duplicate Columns Before Merging

In this method, you first find the columns that are different from the DataFrames then perform the merge operation. If you use this method, then you will have to join using the indexes.

#Find the columns that aren't in the first DataFrame 
diff_cols = df2.columns.difference(df1.columns)

#Filter out the columns that are different. You could pass in the df2[diff_cols] directly into the merge as well.
df3 = df2[diff_cols]

#Merge the DataFrames
df_merged = pd.merge(df1, df3, left_index=True, right_index=True, how='inner')

This method utilizes the difference() function that compares two sets and returns anything that is not in both sets.

Final Thoughts

Check out more Python tricks in this Colab Notebook or in my recent Python Posts.

Thanks for reading!


Posted

in

by

Tags: