Enhancing Data Accuracy: How to Fill Missing Date Gaps in Analysis with Python

Data gaps can occur when data is organized into time intervals but observations are missing for certain intervals. For example, let’s say you are tracking sales of snow shovels by month. Snow shovels are typically only in demand during winter months, so it is likely that there will be months with no sales at all. When charting sales data for the last 12 months, for example, it is important to include all months in order to get an accurate picture of the trend. Failing to do so and only showing months with sales can be misleading and hinder accurate interpretation of the results. In this post, we will explore a few techniques on to fill in date gaps.

Here’s an example of a SQL query that might return data with gaps:

SELECT
	MONTH(date_column) AS month,
    YEAR(date_column) AS year,
    SUM(sales) AS total_sales
FROM sales_table
GROUP BY 
	MONTH(date_column),
    YEAR(date_column)

The query could return the results below. At first glance, the data looks continuous which can be misleading if your analysis stops here. For this type of data, it’s better to show the months where there are 0 sales to get a true picture.

monthyeartotal_sales
12022470
22022200
3202285
7202250
92022100
102022250
112022410
122022550

Tips for Dealing with Data Gaps

  • Create a date range list then left-join the original data
  • Resampling data using Python
  • The visualization tool does it for you

Create a date range list then left-join original data

There are many ways to perform this action but it depends on the tool you are using. Creating a range of dates can be done once and placed in a permanent table or dynamically every time you run the data. Python Pandas has a date_range function that allows you to create a date range at your desired frequency.

#Create the sample data
df = pd.DataFrame(
    {'dates': ['1/31/2022', '2/28/2022', '3/31/2022', '7/31/2022', '9/30/2022', '10/31/2022', '11/30/2022', '12/31/2022'], 
     'sales': [470,200,85,50,100,250,410,550]}
df['dates'] = pd.to_datetime(df['dates'])

#Create a dataset with the dates
date_range = pd.date_range('1/1/2022','12/31/2022', freq='M')
date_df = pd.DataFrame(date_range, columns=['dates'])

#Left join the data
df2 = pd.merge(
    date_df,
    df,
    how='left',
    left_on=['dates'],
    right_on=['dates']
)

#Fill in the gap dates with 0
df2.fillna(0, inplace=True)

df2

Some databases allow you to dynamically generate a date range. Here’s how you could do it in BigQuery:

WITH date_range AS (
  SELECT 
    DATE_TRUNC(DATE('2022-02-01'), MONTH) AS start_date,
    DATE_TRUNC(DATE('2023-01-01'), MONTH) AS end_date
)

SELECT
  DATE_SUB(DATE_TRUNC(month_date, MONTH), INTERVAL 1 DAY) AS month_end_date
FROM
  date_range,
  UNNEST(GENERATE_DATE_ARRAY(start_date, end_date, INTERVAL 1 MONTH)) AS month_date

Resampling Data Using Python

Using Python Pandas, you can quickly and efficiently fill in the gaps by using Re Sampling. Here’s an example of a data table with date gaps:

week_df = pd.DataFrame(data=[1,2,3,4], index=pd.date_range(start='1/1/2020', end='1/31/2020', freq = 'W-MON'), columns=['number'])
week_df

Now that we have some data, we can resample the data to a day frequency:

week_df.resample("D").sum()

As you can see, the resample function added the missing days from the start and end of the dataset.

The Visualization Tool Does It for You

The above methods to fix the gaps in the data are not necessary if your visualization tool will fill in the gaps. Here’s an example of how Plotly correctly displays the data.

First I’ll create a dataset:

df = pd.DataFrame(
    {'dates': pd.date_range('1/1/2022', '12/31/2022', freq="M"), 
     'sales': [470,200,85,None,None,None,50,None,100,250,410,550]})

df.dropna(inplace=True)

df

Next, I’ll plot it.

import plotly.express as px

fig = px.bar(df, x='dates', y="sales")
fig.update_layout(template='simple_white')

fig.show()

Final Thoughts

Check out more Python tricks in this Colab Notebook or in my recent Python Posts.

Thanks for reading!


Posted

in

by

Tags: