Imagine a metaphorical scene depicting a moving company specialized in data transfer. The movers, dressed in futuristic uniforms, are using advanced technology-themed equipment to carry and transfer glowing symbolic data units from an old-fashioned house to a modern, high-tech storage facility. The data units, representing files, folders, and multimedia, glow brightly in various colors, emphasizing the transfer of information. The old house is traditional and vintage, while the storage facility is sleek, contemporary, and brimming with cutting-edge technology. The environment is bright and clear, underscoring the transition from the analog past into the digital future, making the scene visually compelling and full of contrast between the old and the new. This scene illustrates the process of data migration in an imaginative and engaging way.

How to Export Data from MySQL to Parquet with DuckDB

In this post, I will guide you through the process of using DuckDB to seamlessly transfer data from a MySQL database to a Parquet file, highlighting its advantages over the traditional Pandas-based approach.

A Moving Analogy

Imagine your data is a collection of belongings in an old house (MySQL). This old house (MySQL) has been a cozy home for your data, but it’s time to relocate your belongings to a modern storage facility (Parquet file). The new place isn’t just a shelter; it’s a state-of-the-art warehouse designed for efficiency. Here, your data isn’t just stored; it’s optimized for faster retrieval (improved query performance), arranged in a way that takes up less space (efficient data storage), and is in a prime location that many other analytical tools find easy to visit and work with (a better ecosystem for analysis). This transition ensures your data is not only safer but also primed for insights and discovery in the realm of analytics and data science.

Enter DuckDB, which acts as a highly efficient moving service. Instead of haphazardly packing and moving your belongings piece by piece on your own (the traditional Pandas-based approach), DuckDB offers a streamlined process. It’s like having a professional team of movers. This team efficiently packs up all your belongings into specialized containers (exporting data) and then transports them directly to the new storage facility (Parquet), ensuring that everything from your fragile glassware (sensitive data) to your bulky furniture (large datasets) is transferred safely and placed exactly where it needs to be (enhanced data type support) in the new storage facility, ready for use (analysis). This service is not only faster but also minimizes the risk of damaging your belongings during the move (data loss or corruption). It handles the heavy lifting, making the transition smooth and efficient.

By the end of the moving process, you’ll find that accessing and using your belongings in the new facility (Parquet file) is much more convenient and efficient, thanks to the expert help of DuckDB, making your decision to move a truly beneficial one for your analytical and data science needs.

Challenges with Exporting Data to Parquet Using Pandas

Many guides recommend using Pandas for extracting data from MySQL and exporting it to Parquet. While the process might seem straightforward, ensuring a one-to-one data match poses significant challenges due to several limitations inherent in Pandas:

  1. Type Inference: Pandas automatically infers data types during import, which can lead to mismatches with the original MySQL types, especially for numeric and date/time columns.
  2. Handling Missing Values: Pandas uses NaN (Not a Number) and NaT (Not a Time) for missing data, which may not align with SQL’s NULL values, causing inconsistencies.
  3. Indexing: The difference in indexing systems between MySQL and Pandas can disrupt database constraints and relationships, as Pandas uses a default integer-based index.
  4. Text Data Compatibility: The wide range of MySQL character sets may not directly align with Python’s string representation, potentially causing encoding issues or loss of data fidelity.
  5. Large Data Sets: Pandas processes data in memory, limiting its efficiency with large datasets and possibly necessitating data sampling or chunking.
  6. Numerical Precision: Subtle discrepancies can arise due to differences in handling numerical precision and floating-point representation between MySQL and Pandas.
  7. Boolean Data: Pandas may interpret MySQL boolean values (tinyint(1)) as integers unless converted explicitly, which could lead to errors.
  8. Datetime Formats: Variations in datetime handling, especially regarding time zones, between Pandas and MySQL could result in discrepancies needing extra manipulation.

In an earlier post – Exporting Database Tables to Parquet Files Using Python and Pandas, I showed code examples of how Pandas can be used for the job. However, this was before I discovered how DuckDB streamlines the process. Now the earlier post illustrates how using Pandas is verbose and error-prone.

Streamlining Data Export with DuckDB

DuckDB streamlines the data export process with its ability to accurately preserve data types directly from the database, effectively leveraging the table schema for error-free exports. This is a significant improvement over Pandas which can involve complex type conversions and additional steps to handle discrepancies. With DuckDB, the transition to Parquet format is streamlined into three clear steps:

  1. Set Up Connection to Database and DuckDB: Establish a secure link between your MySQL database and DuckDB.
  2. Read Data into DuckDB (optional): Import your data from MySQL into DuckDB to inspect or run queries on it before step 3.
  3. Export Data from DuckDB: Once your data is in DuckDB, exporting it to a Parquet file is a one-line statement: COPY mysql_db.tbl TO 'data.parquet';

To start this process, I recommend storing your database connection in a separate JSON file. Here’s an example of the database connection string:

{
  "database_string":"host=test.com user=username password=password123 port=3306 database=database_name"
}

This next code block sets up the database and DuckDB connections

import duckdb
import json

# Specify the path to your JSON file
file_path = '/your_path/connection_string.json'

# Open the file and load the JSON data
with open(file_path, 'r') as file:
    db_creds = json.load(file)

#Retrieve the connection string
connection_string = db_creds['database_string']

# connect to database (if it doesn't exist, a new database will be created)
con = duckdb.connect('/path_to_new_or_existing_duck_db/test.db')

# Set up the MySQL Extension
con.install_extension("mysql")
con.load_extension("mysql")

# Add MySQL database
con.sql(f"""
ATTACH '{connection_string}' AS mysql_db (TYPE mysql_scanner, READ_ONLY);
""")

Now with the connection setup, you can read the data from MySQL into DuckDB and export it to Parquet:

#Set the target table
db_name = 'my_database' #replace with name of the MySQL database
table = 'accounts' #replace with the name of the target table

#Read data from MySQL and replicate in DuckDB table
con.sql(f"CREATE OR REPLACE TABLE test.{table} AS FROM mysql_db.{db_name}.{table};")

#Export the DuckDB table to parket at the path specified
con.sql("COPY accounts TO 'accounts.parquet';")

That’s it! The one line to copy a table to a parquet file is incredibly efficient and shows the simplicity of this approach.

A notable feature in DuckDB that enhances efficiency is the mysql_bit1_as_boolean setting, which is enabled by default. This setting automatically interprets MySQL BIT(1) columns as boolean values. This contrasts with Pandas, where these values are imported as binary strings (b'\x00' and b'\x01'), requiring cumbersome conversions, particularly when dealing with databases that contain many such columns. For further details and examples of this feature, DuckDB’s documentation offers comprehensive insights.

The Advantages of Exporting to Parquet Format

Exporting data to Parquet format is a strategic choice for data engineers and analysts aimed at optimizing data storage and query performance. Here’s why Parquet stands out as a preferred format for data-driven initiatives:

  1. Efficient Data Compression and Storage: Parquet is a columnar storage format, enabling it to compress data very efficiently, significantly reducing the storage space required for large datasets. This efficiency does not compromise the data’s fidelity, making Parquet ideal for archival purposes and reducing infrastructure costs.
  2. Improved Query Performance: By storing data by columns instead of rows, Parquet allows for more efficient data retrieval. Analytics and reporting queries often require only a subset of data columns; Parquet can read the needed columns without loading the entire dataset into memory, enhancing performance and reducing I/O.
  3. Enhanced Data Analysis with Big Data Technologies: Parquet is widely supported by many data processing frameworks. Its compatibility facilitates seamless integration into big data pipelines and ecosystems, allowing for flexible data analysis and processing at scale.
  4. Schema Evolution: Parquet supports schema evolution, allowing you to add new columns to your data without modifying existing data. This feature enables backward compatibility and simplifies data management over time, as your datasets evolve.
  5. Optimized for Complex Data Structures: Parquet is designed to efficiently store nested data structures, such as JSON and XML. This capability makes it an excellent choice for modern applications that often involve complex data types and hierarchical data.
  6. Compatibility with Data Lakes and Cloud Storage: Parquet’s efficient storage and performance characteristics make it compatible with data lakes and cloud storage solutions, facilitating cost-effective data storage and analysis in the cloud.
  7. Cross-platform Data Sharing: Given its open standard format and broad support across various tools and platforms, Parquet enables seamless data sharing between different systems and teams, promoting collaboration and data interoperability.

By exporting data to Parquet, organizations can leverage these advantages to enhance their data analytics capabilities, achieve cost efficiencies in data management, and ensure their data infrastructure is scalable, performant, and future-proof.

Conclusion: Elevating Data Engineering with DuckDB

Navigating the complexities of data extraction and format conversion demands not just skill but the right tools. Through this exploration, we’ve seen how DuckDB simplifies the data export process, providing a seamless bridge from MySQL to Parquet. By preserving data integrity, automatically handling data types, and eliminating the cumbersome data type conversion required by Pandas, DuckDB presents a compelling solution for data engineers seeking efficiency and reliability. Embracing DuckDB not only streamlines your data workflows but also empowers you to unlock new levels of performance and insight from your data, marking a significant leap forward in the pursuit of advanced data engineering.

Thanks for reading!


Posted

in

by

Tags: