How to query big CSV files

You have several big (1GB - 1TB) CSV/TSV files, zipped. You’d like to query and join those CSV files using SQL commands.

What solutions are there?

Just to keep expectations realistic, this is a hard problem. There’s no magic tool, but here are the best alternatives.

Setup

For testing, I used Yellow Taxi Trip Data, a 3.1GB gzipped CSV. My laptop has an Intel Core i7 CPU @ 2.60GHz, 16GB RAM and runs Fedora Linux.

xsv

While not a solution per se, xsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files. Use it to trim CSV files to have only desired data before using them with other tools. It does NOT support gzipped files.

ClickHouse

ClickHouse is a high performance, open source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP).

Clickhouse is meant solely for analytic workloads, so as long as you have a lot of GROUP BY and aggregated functions such as COUNT, SUM and DISTINCT it will be blazing fast. But if you want to do a lot of large scale joins and point queries, a typical RDBMS is still going to be the better choice. Clickhouse does not support queries like DELETE and UPDATE. They will be slow due to its compression mechanism. Again, this is not a database meant for transactional workloads.

Setup on Linux is straightforward, as they provide Ubuntu/Debian packages. RHEL packages are available, Fedora packages need to be built.

TODO: import CSV, run queries.

Apache Drill

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Performance-wise, this is much weaker than Clickhouse, but it might still be useful in some cases. It’s quite simple to get started!

Install dependencies:

dnf -y install java-1.8.0-openjdk-devel

Unzip and run

$ apache-drill/bin/drill-embedded

-- test listing some files. Note that full path is required.
SHOW FILES IN dfs.`/home/data/query/apache-drill/sample-data`;

SELECT * FROM dfs.`/home/data/query/apache-drill/sample-data/nation.parquet` limit 3;

-- it recognizes headers in .csvh files
SELECT * FROM dfs.`/home/data/query/yellow_tripdata_2015-01-06.csvh.gz` limit 3;

-- the only compression codecs that work with Drill out of the box are gz, and bz2. Additional codecs can be added by including the relevant libraries in the Drill classpath.


SELECT count(1) FROM dfs.`/home/data/query/yellow_tripdata_2015-01-06.csvh.gz`;
+----------+
|  EXPR$0  |
+----------+
| 77080575 |
+----------+
1 row selected (196.048 seconds)


SELECT passenger_count, avg(CAST(total_amount as FLOAT) )
FROM dfs.`/home/data/query/yellow_tripdata_2015-01-06.csvh.gz`
GROUP BY passenger_count;
 
+-----------------+--------------------+
| passenger_count |       EXPR$1       |
+-----------------+--------------------+
| 3               | 16.0610506037027   |
| 5               | 16.066410538884224 |
| 7               | 46.41927935505651  |
| 9               | 63.1076127344103   |
| 1               | 15.71317216447603  |
| 6               | 15.6777531727075   |
| 2               | 16.523870934969587 |
| 8               | 58.093379282870806 |
| 4               | 16.16237197893557  |
| 0               | 14.743438223497185 |
+-----------------+--------------------+
10 rows selected (294.822 seconds)

Here’s how to do JOINs.

You can also view queries at http://localhost:8047/

Presto

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Let’s set it up!

Install dependencies:

dnf -y install java-1.8.0-openjdk-devel maven

Next, complete these setup steps.

mkdir -p /var/presto/data
chown marius /var/presto/data

# use localhost instead of example.net
discovery.uri=http://localhost:8080

Presto is not a database, but a query engine on top of database data. Unlike Clickhouse, this requires data to be available in a separate database system. Because presto does not support connecting to data in CSV, we need to load that into something else first: hadoop-hdfs.

hadoop-hdfs enables us to load CSV data into presto using the Hive connector

dnf -y install hadoop-hdfs
hdfs-create-dirs
systemctl start hadoop-namenode hadoop-datanode
hadoop-resourcemanager
hdfs dfs -ls /

# as user, create a directory where to store data and change the owner to your user:
sudo -u hdfs hdfs dfs -mkdir /user/data
sudo -u hdfs hdfs dfs -chown marius /user/data
hdfs dfs -ls /user

# finally, copy the data file(s)
hdfs dfs -copyFromLocal *.csv.gz /user/data
hdfs dfs -ls /user/data

Next, we still need to setup Apache Hive. After looking at what’s involved, I gave up on this approach.

To sum up, importing a CSV in Presto requires setting up and importing it first in hadoop-hdfs, then using Hive to create a table on it. Way too complex. Here’s how somebody else did it.

Hopefully Presto team will consider supporting Parquet format directly, without any hadoop/hive mess.

Another approach might be to import CSV in Postgres, then use Presto Postgres connector.

Traditional DBs

Even using a columnar store extension, importing data in postgres and querying takes quite long.

SQLite takes even longer.

Commercial cloud services

If you can afford the cost and there are no legal/privacy issues with uploading your data to a commercial cloud provider, see Amazon’s Athena and Google’s BigQuery.

Sidenote: Consider Parquet file format instead of CSV

Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. This results in a file that is optimized for query performance and minimizing I/O (smaller size than CSV).

If you intend to query CSV files in the cloud, here’s how Parquet saves costs.

Appendix: Benchmarks

Here’s a top of how various solutions compare performance-wise on 1.1 billion records. Clickhouse, which you just learned how to use earlier, is among the best.