Using AWK with CSV Files

9 minute read     Updated:

Adam Gordon Bell %
Adam Gordon Bell

This article discusses the challenges of handling AWK CSV in data processing. Earthly integrates tools to improve your build process. Check it out.

When the AWK tutorial came out, one of the questions on hacker news was how you use AWK with CSV (comma-separated value) files.

The thing that prevents awk from being a major part of my daily routine is that it (amazingly) has poor CSV support.

I would love to see awk with “CSV mode”, where it intelligently handles formats like this if you just pass a flag. - jrumbut

This was a common complaint about AWK, and it seems like it should be simple to handle CSV files: tell AWK to use a comma as the separator:

> echo one,two,three | awk -F "," '{ print $2}'
two

The comma separator also works if your CSV files use quotes to wrap some fields:

> echo one,\"two\",three | awk -F "," '{ print $2}'
"two"

Unfortunately, things get more complex from there.

CSV files can contain commas, line-breaks, and delimited quotes within the quoted values, which is great for storing data in a CSV file, but is something that AWK is just not well suited to handle:

> echo 1,\"2a,2b\", 3 | awk -F "," '{ print $3}'
2b"
The third CSV value is 3 but AWK gets 2b"

A Solution

A simple solution to this problem is to use csvquote1.

Are you looking for a way to process CSV data with standard UNIX shell commands?

Are you running into problems with embedded commas and newlines that mess everything up?

Let me show you how it works.

If I take a messy CSV row (1,"2a""2b", 3) and pass it to csvquote the tricky values disappear:

> echo 1,\"2a,\"\"2b\", 3 | csvquote
1,"2a2b", 3

But, when I pass it back through csvquote again with the undo flag (-u) the values are back:

> echo 1,\"2a,\"\"2b\", 3 | csvquote | csvquote -u
1,"2a,""2b", 3

How does it do this? csvquote works by temporarily replacing the problematic characters inside quoted fields with harmless non-printing characters. These non-printing characters are rarely used ASCII control codes. Later these non-printing characters are restored to their previous values.

This makes dealing with CSV files as simple as any other format:

> echo 1,\"2a,2b\", 3 | csvquote | awk -F "," '{ print $2 }' | csvquote -u 
"2a,2b" 

Installing csvquote

To install cvsquote, you need to grab the source and build it:

> git clone https://github.com/adamgordonbell/csvquote
> cd csvquote
> go build -o csvquote cmd/cvsquote/main.go
> cp ./csvquote /usr/local/bin

Or alternatively use this earthly one-liner.

> earthly github.com/adamgordonbell/csvquote+build
> cp ./csvquote /usr/local/bin
> earthly github.com/adamgordonbell/csvquote+for-darwin-amd64
> cp ./csvquote /usr/local/bin

More installation instructions found in the readme.

Once you have csvquote, wrap your AWK calls like this:

csvquote file.csv | awk ... | csvquote -u

And you can pretend that AWK natively supports CSV files. ( You can use this same trick with other UNIX line-oriented tools. head, tail and sort don’t understand CSV either, but if you wrap them in csvquote you will be able to handle delimited line breaks correctly.)

Other Ways to Handle CSVs

You can find lots of other potential solutions on Stack Overflow for dealing with CSV files in AWK. For completion’s sake, here are two approaches:

FPAT

FPAT allows you to describe the contents of a row, rather than the delimiter. You can use this to handle many CSV files. The gawk manual has good coverage of this approach. Unfortunately, newlines inside of a row break any FPAT solution.

gawkextlib

gawkextlib is a collection of extensions for gawk, one of which is specifically for handling CSV files. After building gawkextlib, you should be able to parse CSV files by including the csv extension:

@include "csv"
BEGIN { CSVMODE = 1 }
{ print $2 }

However, you might run into trouble building gawkextlib. And you’ll also have to rebuild gawk to include it:

To use this library, you must first build the new version of gawk containing shared extension library support. Please use gawk 4.1 or later, or use this recipe to build from the git sources.

So I’m going to stick with csvquote for now.

(Also, if you’re the type of person who’s not afraid to do things on the command line then you might like Earthly:)

Earthly Cloud: Consistent, Fast Builds, Any CI
Consistent, repeatable builds across all environments. Advanced caching for faster builds. Easy integration with any CI. 6,000 build minutes per month included.

Get Started Free


  1. Dan Brown created the original CSVquote. I’ve forked it, using his go implementation and added some installation helpers, but all credit should go to Dan for this technique and the initial implementation.↩︎

Adam Gordon Bell %
Spreading the word about Earthly. Host of CoRecursive podcast. Physical Embodiment of Cunningham's Law.
@adamgordonbell
✉Email Adam✉

Updated:

Published:

Get notified about new articles!
We won't send you spam. Unsubscribe at any time.