How to Dynamically Format Pandas DataFrame Columns to be Database and Parquet Ready

Pandas allow for almost anything as a column header and I’ll show you how to get your columns parquet and database ready.

Here is an example error you will see if trying to save a DataFrame to parquet with invalid characters for columns. Since databases and Parquet files have more restrictive naming conventions for columns than Pandas, you need to ensure that your columns contain only letters, numbers, and underscores.

Here is what you will learn in this post:

How to keep only letters, numbers, and underscores in column names
How to rename columns that only contain numbers
How to deduplicate column names

Keep only letters, numbers, and underscores in column names

Consider the following DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(100,size=(5, 12)),columns=['Column with @#$@#$ characters','Question #1','Test?','Test','TEST','Hello','hello', 'test.testing', 'comma,testing',0,1,2])
df

As you can see, Pandas let me create the DataFrame with no errors. We can clean these up by leveraging a combination of string methods to remove special characters.

df.columns = (df.columns
              .astype(str) #convert all the elements to a string because Pandas allows you to have numbers for column names
              .str.replace('[.,]', '_') #Replace specific symbols with underscores
              .str.extractall(r'([\w\d]+)') #extract just words and numbers
              .unstack() #bring all the groups back into one row
              .apply(lambda x: '_'.join(x.dropna() #replace empty space with an underscore
              .astype(str)),axis=1) #convert column names to a string
              .str.lower() #lowercase
              .str.strip() #clean up white spaces if applicable
              ) 

df

Here is the resulting DataFrame:

Renaming columns that are numbers only

Numbers as column names only pose a problem if the column name is composed entirely of numbers. My trick is to dynamically convert the numbers to words using the inflect package. So for example, 1 will turn into one. I’ll show you how to apply this using a function in the next section.

!pip install inflect
import inflect
p = inflect.engine()

Deduplicate column names

You can first check for duplicate columns with the following snippet:

#Check for duplicate columns
df.columns.duplicated().any()

In my case, this returns True because some columns are duplicates. I’ve created the following function that loops through the column names and appends 2, 3, etc. depending on the occurrence of the duplicate column.

I am also checking to see if the column name is a straight-up number with the function .isdecimal() . If that returns True, then it runs the inflect module to convert the number to a word.

def create_unique_cols(x):
  col_lst = [] #existing col list
  unique_col_lst = [] #new col list with modified col names for the duplicates

  for i in x.columns:
    col_lst.append(i) #build the list as you go through the elements

    if i.isdecimal(): #convert numbers to a string. In my case, I am just spelling out the number
      i = p.number_to_words(i)
    else:
      pass 

    if col_lst.count(i) > 1:
      unique_col_lst.append(f'{i}_{col_lst.count(i)}') #add the duplicated column with a number
    else:
      unique_col_lst.append(i) #add the column as is to the new list
  
  return unique_col_lst

The Final Output

Now you can safely export to a database or parquet!

Final Thoughts

Check out more Python tricks in this Colab Notebook or in my recent Python Posts.

Thanks for reading!

Posted

June 3, 2022

Python

Paul

Tags:

data_cleanup