Using AWK with CSV Files
In this Series
Table of Contents
This article discusses the challenges of handling AWK CSV in data processing. Earthly integrates tools to improve your build process. Check it out.
When the AWK tutorial came out, one of the questions on hacker news was how you use AWK with CSV (comma-separated value) files.
The thing that prevents awk from being a major part of my daily routine is that it (amazingly) has poor CSV support.
I would love to see awk with “CSV mode”, where it intelligently handles formats like this if you just pass a flag. -
jrumbut
This was a common complaint about AWK, and it seems like it should be simple to handle CSV files: tell AWK to use a comma as the separator:
> echo one,two,three | awk -F "," '{ print $2}'
two
The comma separator also works if your CSV files use quotes to wrap some fields:
> echo one,\"two\",three | awk -F "," '{ print $2}'
"two"
Unfortunately, things get more complex from there.
CSV files can contain commas, line-breaks, and delimited quotes within the quoted values, which is great for storing data in a CSV file, but is something that AWK is just not well suited to handle:
> echo 1,\"2a,2b\", 3 | awk -F "," '{ print $3}'
2b"
3
but AWK gets 2b"
A Solution
A simple solution to this problem is to use csvquote
1.
Are you looking for a way to process CSV data with standard UNIX shell commands?
Are you running into problems with embedded commas and newlines that mess everything up?
Let me show you how it works.
If I take a messy CSV row (1,"2a""2b", 3
) and pass it to csvquote
the tricky values disappear:
> echo 1,\"2a,\"\"2b\", 3 | csvquote
1,"2a2b", 3
But, when I pass it back through csvquote
again with the undo flag (-u
) the values are back:
> echo 1,\"2a,\"\"2b\", 3 | csvquote | csvquote -u
1,"2a,""2b", 3
How does it do this? csvquote
works by temporarily replacing the problematic characters inside quoted fields with harmless non-printing characters. These non-printing characters are rarely used ASCII control codes. Later these non-printing characters are restored to their previous values.
This makes dealing with CSV files as simple as any other format:
> echo 1,\"2a,2b\", 3 | csvquote | awk -F "," '{ print $2 }' | csvquote -u
"2a,2b"
Installing csvquote
To install cvsquote
, you need to grab the source and build it:
> git clone https://github.com/adamgordonbell/csvquote
> cd csvquote
> go build -o csvquote cmd/cvsquote/main.go
> cp ./csvquote /usr/local/bin
Or alternatively use this earthly
one-liner.
> earthly github.com/adamgordonbell/csvquote+build
> cp ./csvquote /usr/local/bin
> earthly github.com/adamgordonbell/csvquote+for-darwin-amd64
> cp ./csvquote /usr/local/bin
More installation instructions found in the readme.
Once you have csvquote
, wrap your AWK calls like this:
csvquote file.csv | awk ... | csvquote -u
And you can pretend that AWK natively supports CSV files. ( You can use this same trick with other UNIX line-oriented tools. head
, tail
and sort
don’t understand CSV either, but if you wrap them in csvquote
you will be able to handle delimited line breaks correctly.)
Other Ways to Handle CSVs
You can find lots of other potential solutions on Stack Overflow for dealing with CSV files in AWK. For completion’s sake, here are two approaches:
FPAT
FPAT
allows you to describe the contents of a row, rather than the delimiter. You can use this to handle many CSV files. The gawk manual has good coverage of this approach. Unfortunately, newlines inside of a row break any FPAT
solution.
gawkextlib
gawkextlib
is a collection of extensions for gawk
, one of which is specifically for handling CSV files. After building gawkextlib
, you should be able to parse CSV files by including the csv
extension:
@include "csv"
BEGIN { CSVMODE = 1 }
print $2 } {
However, you might run into trouble building gawkextlib
. And you’ll also have to rebuild gawk
to include it:
To use this library, you must first build the new version of gawk containing shared extension library support. Please use gawk 4.1 or later, or use this recipe to build from the git sources.
So I’m going to stick with csvquote
for now.
(Also, if you’re the type of person who’s not afraid to do things on the command line then you might like Earthly:)
Earthly Cloud: Consistent, Fast Builds, Any CI
Consistent, repeatable builds across all environments. Advanced caching for faster builds. Easy integration with any CI. 6,000 build minutes per month included.