There's an old story about two senior architects that were friends in college, and met again thirty years later. After a few minutes they started talking about their favorite achievements. The first described office towers, airports, and universities he was quite proud of. The second didn't have any monuments to talk about, but shared that he thought he may have designed the perfect chair.t chair. Clearly trumped, his friend congratulated him, and asked to hear more - since the perfect chair is far more significant than yet another monument.
Sometimes, I feel that small unix utilities are to a programmer what a chair is to an architect: they continue to be essential, and are typically small, spare, do just a single thing and can clearly show elegance.
I've written quite a number of them, and have recently started packaging those related to data analysis into a project called DataGristle. My favorite utility of the set is gristle_slicer - a tool similar to the Unix program cut. While cut allows the user to select columns out of a file, gristle_slicer selects columns and rows - and uses the more functional Python string slicing syntax to do it.
It's no perfect chair but it might be a good utility.
How it worksThe user provides 4 optional specifications:
- -r or --records (record inclusion criteria)
- -R or --exrecords (record exclusion criteria)
- -c or --columns (column inclusion criteria)
- -C or --excolumns (column exclusion criteria)
- -r "3,5,20:30" - will include records 3,5 and those from 20 to 29 (based on 0 offset)
- -r ":15" - will include all records from the beginning to 14
- -c ":-1" - will include all columns except for the last one
Beside the specifications there's other options as well. Because it uses the internal DataGristle libraries - it will figure out the csv file delimiter and quoting by itself. And it's right about 95% of the time. In case it isn't, there are options to manually override it:
- -d '|' - allows user to explicitly identify a single-column delimiter
- -q <quotetype> - allows user to explicitly identify quoting type - using one of quote_none, quote_all, quote_minimal, or quote_nonnumeric
Lets look at some examples
Lets assume a seven-row, seven-column, pipe-delimited csv file called "sample.csv" that looks like this:
But from here on forward we'll depict it as a table.
If we want to see the entire table we can simply cat it, or we could use gristle_slicer like this:
$ gristle_slicer sample.csv
Which defaults to print all columns & rows:
1. Select just the first row$ gristle_slicer sample.csv -r 0 -d'|'
We'll provide the delimiter since this small file with consistent dashes can throw off the automatic detection. And it will write the high-lighted line below to stdout:
2. Select just the first column$ gristle_slicer sample.csv -c 0 -d'|'
Again, this will write the high-lighted column below to stdout:
3. Select the intersection of the first row and first column:$ gristle_slicer sample.csv -r 0 -c 0 -d'|'
4. Select the first 4 rows:$ gristle_slicer sample.csv -r 0:3 -d'|'
5. Select the first 4 rows but exclude the 2nd$ gristle_slicer sample.csv -r 0:4 -R 1 -d'|'
6. Select first and last column on the first and last rows
This can be done by including the first & last row and the first & last column:
$ gristle_slicer sample.csv -r 0,-1 -c 0, -1 -d'|'
Or by including the first & last rows and excluding the column range between the 2nd and 2nd to the last column:
$ gristle_slicer sample.csv -r 0, -1 -C 1:-1 -d'|'
7. Select the middleThere's lots of different ways to do this one. The first is to just include the 2nd to 2nd-to-last rows and 2nd to 2nd-to-last columns:
$ gristle_slicer sample.csv -r 1:-1 -c 1:-1 -d'|'
Another way is to include all rows and columns (it's implied if you leave off -r & -c) and just exclude the two you don't want:
$ gristle_slicer sample.csv -R 0,-1 -C 0,-1 -d '|'
8. Select rows 2 & 4, columns 2, 4 through 6One way to do this is to just include rows 2-4, exclude row 3, then include columns 2 and 4 through 6:
$ gristle_slicer -r 2:5 -R 4 -c2,4:7 -d '|'
Another way is to just specifically include rows and columns you need:
$ gristle_slicer -r 2,4 -c 2,4,5,6 -d '|'
Another way is to combine multiple commands:
$ gristle_slicer -r 2,4 -d'|' | gristle_slicer -c 2,4,5,6 -d '|'
I'm probably done with this for a while. But eventually, I'll probably return and add Python slicing step logic to allow the user to pick every N item. I might modify it to work on files without delimiters. And I might upgrade it to allow columns to be referenced by their names in the headers.