You have several big (1GB - 1TB) CSV/TSV files, zipped. You’d like to query and join those CSV files using SQL commands.
What solutions are there?
Just to keep expectations realistic, this is a hard problem. There’s no magic tool, but here are the best alternatives.
For testing, I used Yellow Taxi Trip Data, a 3.1GB gzipped CSV. My laptop has an Intel Core i7 CPU @ 2.60GHz, 16GB RAM and runs Fedora Linux.
While not a solution per se, xsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files. Use it to trim CSV files to have only desired data before using them with other tools. It does NOT support gzipped files.
ClickHouse is a high performance, open source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP).
Clickhouse is meant solely for analytic workloads, so as long as you have a lot of GROUP BY and aggregated functions such as COUNT, SUM and DISTINCT it will be blazing fast. But if you want to do a lot of large scale joins and point queries, a typical RDBMS is still going to be the better choice. Clickhouse does not support queries like DELETE and UPDATE. They will be slow due to its compression mechanism. Again, this is not a database meant for transactional workloads.
TODO: import CSV, run queries.
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Performance-wise, this is much weaker than Clickhouse, but it might still be useful in some cases. It’s quite simple to get started!
dnf -y install java-1.8.0-openjdk-devel
Unzip and run
-- test listing some files. Note that full path is required. SHOW FILES IN dfs.`/home/data/query/apache-drill/sample-data`; SELECT * FROM dfs.`/home/data/query/apache-drill/sample-data/nation.parquet` limit 3; -- it recognizes headers in .csvh files SELECT * FROM dfs.`/home/data/query/yellow_tripdata_2015-01-06.csvh.gz` limit 3; -- the only compression codecs that work with Drill out of the box are gz, and bz2. Additional codecs can be added by including the relevant libraries in the Drill classpath. SELECT count(1) FROM dfs.`/home/data/query/yellow_tripdata_2015-01-06.csvh.gz`; +----------+ | EXPR$0 | +----------+ | 77080575 | +----------+ 1 row selected (196.048 seconds) SELECT passenger_count, avg(CAST(total_amount as FLOAT) ) FROM dfs.`/home/data/query/yellow_tripdata_2015-01-06.csvh.gz` GROUP BY passenger_count; +-----------------+--------------------+ | passenger_count | EXPR$1 | +-----------------+--------------------+ | 3 | 16.0610506037027 | | 5 | 16.066410538884224 | | 7 | 46.41927935505651 | | 9 | 63.1076127344103 | | 1 | 15.71317216447603 | | 6 | 15.6777531727075 | | 2 | 16.523870934969587 | | 8 | 58.093379282870806 | | 4 | 16.16237197893557 | | 0 | 14.743438223497185 | +-----------------+--------------------+ 10 rows selected (294.822 seconds)
You can also view queries at http://localhost:8047/
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Let’s set it up!
dnf -y install java-1.8.0-openjdk-devel maven
Next, complete these setup steps.
mkdir -p /var/presto/data chown marius /var/presto/data # use localhost instead of example.net discovery.uri=http://localhost:8080
Presto is not a database, but a query engine on top of database data. Unlike Clickhouse, this requires data to be available in a separate database system. Because presto does not support connecting to data in CSV, we need to load that into something else first: hadoop-hdfs.
hadoop-hdfs enables us to load CSV data into presto using the Hive connector
dnf -y install hadoop-hdfs hdfs-create-dirs systemctl start hadoop-namenode hadoop-datanode hadoop-resourcemanager hdfs dfs -ls / # as user, create a directory where to store data and change the owner to your user: sudo -u hdfs hdfs dfs -mkdir /user/data sudo -u hdfs hdfs dfs -chown marius /user/data hdfs dfs -ls /user # finally, copy the data file(s) hdfs dfs -copyFromLocal *.csv.gz /user/data hdfs dfs -ls /user/data
Next, we still need to setup Apache Hive. After looking at what’s involved, I gave up on this approach.
Hopefully Presto team will consider supporting Parquet format directly, without any hadoop/hive mess.
Another approach might be to import CSV in Postgres, then use Presto Postgres connector.
SQLite takes even longer.
Apache Parquet is a self-describing data format which embeds the schema, or structure, within the data itself. This results in a file that is optimized for query performance and minimizing I/O (smaller size than CSV).
If you intend to query CSV files in the cloud, here’s how Parquet saves costs.
Here’s a top of how various solutions compare performance-wise on 1.1 billion records. Clickhouse, which you just learned how to use earlier, is among the best.