A wide, landscape-oriented image featuring a traveler at a pivotal crossroads under bright, colorful skies. The background shows a cozy suburban neighborhood with charming houses and small streets bustling with people, symbolizing the tool 'Pandas.' This area radiates warmth, comfort, and familiarity. To the left, a path leads towards a modern cityscape representing 'Apache Spark,' with towering skyscrapers, cranes, and construction, indicating power, heavy loads, and complexity. The atmosphere is dynamic but intimidating. To the right, another path leads to a futuristic city, embodying 'DuckDB.' This city showcases sleek, streamlined structures and advanced technology, blending efficiency with high performance. In the center, a figure of a traveler stands at the crossroads, contemplating the paths ahead, symbolizing the decision-making process of data engineers. The overall scene is optimistic, highlighting the exciting possibilities of each tool in data engineering.

Solving Pandas Memory Issues: When to Switch to Apache Spark or DuckDB

Data Engineers often face the challenge of Jupyter Notebooks crashing when loading large datasets into Pandas DataFrames. This problem signals a need to explore alternatives to Pandas for data processing. While common solutions like processing data in chunks or using Apache Spark exist, they come with their own complexities. In this post, we’ll examine these challenges and introduce DuckDB, a streamlined approach for managing large datasets in Python. We’ll compare Pandas and Apache Spark, highlighting their strengths and limitations, and then delve into why DuckDB might be a better choice for your data needs.

Pandas: Intuitive but Limited for Large Data

Pandas is a go-to library for data manipulation using Python. It’s known for its fast in-memory processing, easy setup, and an extensive range of APIs. Its popularity is well-deserved, thanks to its straightforward approach and comprehensive documentation that makes data tasks feel more manageable.

However, when we start to handle larger datasets, Pandas begins to show its limitations. The core issue lies in how Pandas handles data – it keeps everything in-memory. This means the size of the data you can process is directly tied to the memory capacity of your system. So, when you’re dealing with particularly large datasets or complex transformations, you might hit a wall with memory constraints, leading to your Jupyter Notebook crashing or those frustrating ‘out of memory’ errors.

Another aspect to consider is Pandas’ distinct approach to data manipulation. With SQL being a universally recognized language for database interactions, moving to Pandas entails grappling with a new language. While this isn’t inherently negative, it introduces a significant learning curve, particularly for professionals who are deeply accustomed to SQL for their data processing tasks.

Apache Spark: Powerful, But Not Without Its Headaches

Let’s delve into Apache Spark. It’s a familiar recommendation in large-scale data handling, renowned for its robust processing capabilities. What sets Spark apart is its strategic use of lazy evaluation. This method cleverly postpones computations until they’re absolutely necessary, optimizing resource usage in handling extensive datasets.

However, integrating Spark into your workflow is not without its challenges. For those who’ve grown accustomed to the more straightforward environment of Pandas, adapting to Spark can be akin to learning a new language. The setup process is intricate, and sometimes the documentation might leave you wanting clearer guidance.

Spark does offer a Pandas API, attempting to bridge the familiarity gap. But it’s not just about mastering a different set of functions – it’s about acclimating to a distinct processing paradigm. This transition requires a considerable amount of time and effort, a significant factor for professionals seeking efficient and straightforward data processing solutions.

Pandas and Apache Spark – A Comparison

Feature	Pandas	Apache Spark
Primary Language	Python	Scala (also supports Python, Java, R)
Processing	In-memory, single-node	In-memory, distributed over multiple nodes
Data Size	Best for small to medium datasets (up to GBs)	Designed for large datasets (TBs or more)
Ease of Use	Simple API, great for Python users	More complex, especially for large-scale data processing
Performance	Fast on smaller datasets, but can struggle with very large data	Optimized for distributed computing, handles large datasets efficiently
Scalability	Limited to single machine memory and CPU	Highly scalable, can process data across a cluster
Real-time Processing	Not designed for real-time processing	Supports real-time data processing (with Spark Streaming)
Ecosystem	Rich set of libraries for data analysis in Python	More focused ecosystem for big data processing
Documentation	Extensive documentation	Adequate documentation with room for expansion

DuckDB: A Streamlined Alternative for Large Datasets

Now, let’s turn our attention to DuckDB, a solution that might not be as widely known as Pandas or Spark but is certainly deserving of attention. DuckDB brings a refreshing perspective to data handling, particularly when dealing with large datasets.

What makes DuckDB stand out? First, it offers flexibility in how you work with data. Whether it’s in-memory processing or working directly with data loaded into a DuckDB database, DuckDB adapts to your needs. This versatility is a significant advantage, especially when comparing it with the in-memory constraints of Pandas.

But the real game-changer here is DuckDB’s alignment with SQL. By allowing data manipulation through SQL queries, it provides a familiar ground for many data professionals. This feature is particularly appealing because it eliminates the need to learn an entirely new framework for data manipulation, as is the case with Pandas. You can leverage your existing SQL skills, making the transition smoother and more intuitive.

Performance is another area where DuckDB shines. It’s designed for speed and efficiency, often outperforming Pandas in various tasks. And for those who are fond of Pandas, DuckDB still has something to offer – you can materialize your final results in a Pandas DataFrame if needed. Plus, like Spark, DuckDB employs lazy evaluation, ensuring that you don’t overload your system’s memory with intermediate results.

It’s also worth noting that DuckDB operates on a single node, and today’s single-node systems have advanced significantly in CPU power, memory, and storage. As a result, they can efficiently handle medium to large datasets. This enhanced capability comes without the complexity and high resource demands of distributed systems like Apache Spark, making single-node systems a simpler, more cost-effective option for such data processing tasks.

In essence, DuckDB presents a compelling option for Data Engineers looking for a balance between performance, ease of use, and familiarity. Its SQL-friendly approach, combined with efficient data handling capabilities, makes it a noteworthy contender in the realm of large-scale data processing.

Practical Comparison: DuckDB vs. Pandas

Since DuckDB is likely new to most, I wanted to provide a quick comparison to Pandas to show how the learning curve is less steep for DuckDB. Not only is DuckDB more performant, but you don’t have to learn a new way to manipulate data if you already know SQL. I’ll focus on three common data tasks: importing a CSV file, performing a simple manipulation, and executing a join operation.

Importing a CSV File

For the examples, I am going to use this public COVID data set – https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv

Pandas: Importing a CSV in Pandas is straightforward. You typically use the read_csv function, which is efficient for small to medium-sized files. However, when dealing with very large files, you might encounter memory issues.

import pandas as pd 
path_to_csv = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv"

df = pd.read_csv(path_to_csv)

DuckDB: DuckDB also makes importing CSV files simple. The advantage here is its ability to handle larger files more efficiently, without the memory constraints you might face with Pandas.

import duckdb 
covid_data = duckdb.sql(f"SELECT * FROM read_csv_auto('{path_to_csv}')") #the name of the Python variable is now the table name in SQL

#Alternatively, you can load the data into a new duckdb for persistant storage
con = duckdb.connect('covid_data.duckdb')
con.execute(f"CREATE TABLE covid_data AS SELECT * FROM read_csv_auto('{csv_path}')")

Simple Data Manipulation

Pandas: Manipulating data with Pandas involves chaining various functions. Alternatively, you could create intermediate DataFrames to store results, which might improve readability but would use more RAM. Although the function names are descriptive, following the process can sometimes be challenging.

grouped_results_df = df.groupby('location')['new_cases'].sum().reset_index(name='total_cases')
final_df = grouped_results_df.sort_values('total_cases', ascending=False)

DuckDB: In DuckDB, you can express all operations in SQL, a standard and readable data manipulation language. The .df() method converts the results into a Pandas DataFrame, enabling the use of Pandas’ extensive API for tasks like charting. Importantly, this approach doesn’t require loading the entire dataset into RAM—only the results.

duckdb.sql("""
SELECT
  location,
  SUM(new_cases) as total_cases
  FROM covid_data
  GROUP BY 1
  ORDER BY SUM(new_cases) DESC
""").df()

Executing a Join Operation

Pandas: Performing a join between two DataFrames in Pandas is intuitive, using functions like merge:

merged_df = pd.merge(df1, df2, on='common_column')

DuckDB: DuckDB handles joins using SQL-style syntax, which can be more efficient, especially with larger datasets:

join_query = "SELECT * FROM table1 JOIN table2 ON table1.common_column = table2.common_column"
joined_df = conn.execute(join_query).df() #materialize results as a Pandas DataFrame

Observations

In short, DuckDB not only makes data import simple but also gives you the ease of using SQL for data manipulation. This means less time spent learning new techniques and more time efficiently handling your data, making DuckDB a practical choice for many users.

If you want to check out the examples for yourself, you can run them in this Google Colab Notebook.

Conclusion: Matching Tools to Tasks in Data Engineering

As we wrap up our exploration of Pandas, Spark, and DuckDB, imagine yourself as a traveler standing at a crucial crossroads. Behind you is the familiar suburb of Pandas — cozy, well-trodden, bustling with the activities of everyday data tasks. It’s comfortable and familiar, but you know its streets and corners too well.

To your left, the path leads to a modern city representing Spark. This city, with its towering skyscrapers, is dynamic and powerful, designed for handling heavy loads and large-scale operations. However, its landscape is dotted with cranes, symbolizing the ongoing maintenance and learning that the city demands. It’s a place of great potential, but navigating its complex streets requires time and effort.

On the right, there’s a path to a futuristic city, epitomizing DuckDB. This city is sleek, efficient, and streamlined, blending the charm of the suburb with the capacity of the modern city. It’s an innovative space where traditional structures are enhanced by advanced technology, offering a balance between ease of use and high performance.

This moment at the crossroads isn’t just about choosing a direction — it’s about understanding which path aligns best with your journey in data engineering. Each route — Pandas, Spark, and DuckDB — offers different experiences and opportunities. Your choice depends on the size of the data load you’re carrying and the speed and ease with which you want to travel. Bright and inviting, each path holds the promise of new challenges and achievements in the diverse landscape of data engineering.

Posted

November 16, 2023

Data Engineering

Paul

Tags: