csvsql: A script to perform database queries on csv files.

##About csvsql is a command line script that treates csv and tsv files as database tables. csvsql allows you to perform arbitrary queries on those files, including multi-file joins. The core functionality is provided by the h2 database engine. csvsql itself is written in Groovy, but is simply run like like any command-line program.

##Installation You will need to have Groovy installed, which in turn requires that you have a fairly recent version of Java installed. Visit the Groovy website for instructions. Once Groovy is installed, just clone durbinlib, build it, and point your path to durbinlib/scripts:

 
git clone git://github.com/jdurbin/durbinlib.git
cd durbinlib
ant install
export PATH=$PATH:durbinlib/scripts/

durbinlib will give you many other useful tools in addition to csvsql, but csvsql makes it worth the effort alone.

It is suggested that you set your environment variable to tell Groovy to start up Java with a lot of RAM and to run in server mode which encourages early JIT which improves performance of long running jobs at the expense of startup time (or use -client if startup is more important to you and your queries are small):

export JAVA_OPTS='-Xms500m -Xmx3000m -server'

##Usage Any query that is valid sql for the h2 database engine should work.

csvsql "select * from people.csv where age > 40"
csvsql "select name,score from people.csv where age >40"
csvsql "select name,score from people.csv where age <50 and score > 100"

Full path names should work fine:

csvsql "select * from /users/data/people.csv where age > 40"
csvsql "select people.name from /users/data/people.csv where age > 40"
csvsql "select name from /users/data/people.csv where age > 40"
csvsql "select name,age from /users/data/people.csv where age > 40"

You can even do queries with sum and average and so on like:

csvsql "select sum(score) from people.csv where age < 40"
csvsql "select alcohol,count(*) from beer.tab group by alcohol"

If children.csv is a file with same key name as people, then you can join query like:

csvsql "select people.name,children.child from people.csv, children.csv where people.name=children.name"

You can also enter the query on multiple lines like:

csvsql "
> select people.name,children.child
> from people.csv,children.csv
> where people.name=children.name and people.age < 40"

get distribution of values:

csvsql "select alcohol,count(*) from beer.tab group by alcohol"

rename an output column:

csvsql "select brand,alcohol,price as SalesPrice from beer.tab order by price"

alias names for convenience:

 
csvsql "select p.name,c.child from people.csv as p, children.csv as c where p.name=c.name"

And so on…

csvsql "select brand,count(*) from beer.tab group by brand having count(*) > 2"
csvsql "select * from beer.tab where brand like 'Red%'"
csvsql "select distinct(type) from beer.tab"

##Caveats

Files can be either tab delimited or comma delimited and must end in one of .csv, .tsv, or .tab. It is necessary refer to the file by file name at least once in the query, but thereafter it can be referred to either with the file name or the root name (e.g. beer for beer.tab).

The file name can only have one “.” in it. So vijver2002.clinical.tab won’t work, but vijver2002_clinical.tab will. This restriction comes from h2 database functions and is not easily fixed given the exposed API. Update: Actually, this may be fixable… not trivial, but not too hard.

In the query there must be a space before each file name, so in this query:

 
 csvsql "select p.name,c.child from people.csv as p, children.csv as c where p.name=c.name"

the space between the “,” and children.csv is obligatory.

Currently all files referenced in a given query must be in same format (tab or comma). The output format of the query will be in the same format as the input files. csvsql infers file type by looking at extension so tab files are required to end in either .tsv or .tab and comma separated files are required to end in .csv

##Implementation Details All of the real power for the sql queries comes from the excellent h2 database engine. The h2 engine allows full SQL and very fast execution.

##Performance The main reason for writing csvsql was to work with tables that are too large and cumbersome to explore in a spreadsheet. I typically process csv files containing between 100k and 1M records, with usually a few hundred to a few thousand columns. csvsql can perform a select on a file 100k rows in length and output 10k results to a file in 5-8 seconds on a 2007 MacBook Pro. That includes the time to read and create the in-RAM database from the CSV file. That seems pretty quick to me, quick enough that it isn’t a problem creating the in-RAM database with each invocation. Nonetheless, I may add an option to this script that creates a daemon that loads the tables and serves them as long as the daemon is running, and when run with a different option will look to an existing daemon for the tables. The idea is to still be lighter-weight than most database creation options, for quick and dirty file exploration, but to get a speed boost for repeated queries to the same large file.

Bayesian Sleeper Cell

A site about programming, biology, and math.

csvsql: A script to perform database queries on csv files.