A Short Practical Guide to Window Functions in SQL

A window function allows you to concisely compare rows in a single table.

Some practical examples of a window function include:

  • Ranking data overall or ranking subsets of data
  • Adding a row number to the table
  • Calculating the time difference from the previous row
  • Calculating a 7-day moving average

Are Window Functions the same as GROUP BY Statements?

Window functions that aggregate data are a lot like GROUP BY statements. However, the main difference is that your result is not aggregated and you maintain the row-level detail.

With a GROUP BY statement, oftentimes your goal is to summarize the data. Whereas with a window function, your goal is to add an additional column to filter the original data.

Consider the following example:

To find the highest score, you could run a SELECT MAX(score) FROM table GROUP BY type. But what if you wanted to find the date that corresponds to that type? Well, with a GROUP BY you cannot simply add the date or do MAX(date) in your SELECT statement because the column operations are independent of each other. You will need to add a window function and then filter your results to get all the corresponding rows that match the column with the highest score.

Constructing a Window Function

There are three parts to structuring a window function:

  1. The desired calculation (eg. sum, max, lag, etc.)
  2. How the data should be partitioned (optional)
  3. The column(s) that should be used for sorting and how it should be sorted (ie ASC or DESC)

Example

  SELECT 
	  *,
	  SUM(score) OVER (PARTITION by type ORDER BY date) as score_cumulative
  FROM window_test

Filtering Results using a CTE Statement

Oftentimes, my goal with a window function is to add a column I can use to filter the data. To accomplish this, I typically leverage a CTE statement which is another way of creating a temporary table. You will see examples of this in the next section.

More Examples of Window Functions

Here are some code snippets of window functions I use and reference often.

De-duplicate data / Find the latest updated row by partition

Isolating and removing duplicate rows can be a tricky operation but window functions make it really easy. In this example, I am partitioning my data based on the id and sorting by date. The id is the primary key in the original database table and the date is representing the last time the row data was updated. Lastly, I am using ROW_NUMBER() to rank the rows.

WITH T AS (
  SELECT 
  *,
  ROW_NUMBER() OVER (PARTITION BY id ORDER BY date DESC) AS version_number
  FROM window_test
)

SELECT * FROM T WHERE version_number = 1;

Find the rank

The ROW_NUMBER() operation is very similar to the two other ranking operations in window functions. However, there is a difference in how it handles duplicate rankings:

  • ROW_NUMBER() gives every row a unique rank and is incremented by 1 (eg. 1, 2, 3, 4, 5)
  • RANK() will give identical rows the same rank but then skips ranks from the identical values (eg. 1, 2, 3, 3, 5)
  • DENSE_RANK() gives identical rows the same rank and then uses the next numerical rank on the next value (eg. 1, 2, 3, 3, 4)

Shift values up or down within a particular grouping / Find the time difference between related rows using a window function

In this example, we can shift values from rows up or down. This is helpful when you are trying to compare times between events.

Here is an example data set:

SELECT 
  id,
  datetime,
  lead(datetime) OVER (PARTITION BY id ORDER BY datetime) as next_datetime,
  DATEDIFF(lead(datetime) OVER (PARTITION BY id ORDER BY datetime),datetime) as duration_in_days

FROM window_test

Calculate a 7-day Moving Average

Calculating a 7-day moving average can be achieved by adding this extra snippet to your window function: RANGE BETWEEN INTERVAL 6 DAYS PRECEDING AND CURRENT ROW

SELECT 
  date,
  company_id,
  number,
  AVG(number) OVER (PARTITION BY company_id ORDER BY date ASC RANGE BETWEEN INTERVAL 6 DAYS PRECEDING AND CURRENT ROW) as last_7_day_avg
FROM window_data

Limit the number of results per group

SQL has a LIMIT function where you can limit your overall results, but sometimes you need to limit the results per grouping. For example, you may want to return the latest 100 orders per company. In Python, you could loop through your results but you do not have that same exact functionality in SQL. Instead, you can apply the ROW_NUMBER() function per company and then sort by the order date. Finally, you can filter out the calculated column where it is less than 100. You will have to break up the operation into two queries and you can use a CTE to help.

WITH T AS (
  SELECT 
    company_id,
    number,
    ROW_NUMBER() OVER (PARTITION BY company_id ORDER BY number) AS row_number
  FROM window_test_limits
    )

SELECT * FROM T WHERE row_number <= 100

A Quick Caveat

Window functions are a newer capability and are not available in every SQL application. If you are using MySQL you will not be able to use them unless you are running MySQL 8.0 or later.

The examples above are being run using Spark SQL. To run these exaples and learn more about using PySpark, check out this Colab Notebook.

Final Thoughts

Check out more Python tricks in this Colab Notebook or in my recent Python Posts.

Thanks for reading!


Posted

in

,

by

Tags: