How to query Parquet files

A lot of the world's data lives in Amazon S3 buckets. In this guide, we'll learn how to query that data using chDB.

Setup

Let's first create a virtual environment:

And now we'll install chDB. Make sure you have version 2.0.2 or higher:

And now we're going to install IPython:

We're going to use ipython to run the commands in the rest of the guide, which you can launch by running:

You can also use the code in a Python script or in your favorite notebook.

Exploring Parquet metadata

We're going to explore a Parquet file from the Amazon reviews dataset. But first, let's install chDB:

When querying Parquet files, we can use the ParquetMetadata input format to have it return Parquet metadata rather than the content of the file. Let's use the DESCRIBE clause to see the fields returned when we use this format:

Let's have now have a look at the metadata for this file. columns and row_groups both contain arrays of tuples containing many properties, so we'll exclude those for now.

From this output, we learn that this Parquet file has over 40 million rows, split across 42 row groups, with 15 columns of data per row. A row group is a logical horizontal partitioning of the data into rows. Each row group has associated metadata and querying tools can make use of that metadata to efficiently query the file.

Let's take a look at one of the row groups:

Querying Parquet files

Next, let's query the contents of the file. We can do this by adjusting the above query to remove ParquetMetadata and then, say, compute the most popular star_rating across all reviews:

Interestingly, there are more 5 star reviews than all the other ratings combined! It looks like people like the products on Amazon or, if they don't, they just don't submit a rating.

Setup​

Exploring Parquet metadata​

Querying Parquet files​

Setup

Exploring Parquet metadata

Querying Parquet files