Skip to content
All posts

CSV parsing is harder than you think (and why `split(',')` isn't enough)

Quoted fields, escaped quotes, embedded newlines, UTF-8 BOMs, Excel's opinions. A tour of every CSV footgun and the libraries that handle them.

DDDev DeskDeveloper Tools EditorPublished April 27, 20262 min readintermediate

# The naive version


row.split(",")

This works until it doesn't. Here's the data that breaks it:


"Smith, John",42,"Software engineer at ""Acme, Inc."""

A naive split produces six fields. It should produce three. Let's unpack.

# The five things CSVs can do

# 1. Quoted fields with commas inside


"Smith, John",42

Two fields. Split on comma produces three. You need a parser that respects quotes.

# 2. Escaped quotes inside quoted fields

The standard is "double the quote":


"She said ""hi""",answered

Two fields: She said "hi" and answered. Some ancient dialects use backslash-escape ("She said \"hi\"") — always respect what your tool produces, not what you wish it produced.

# 3. Embedded newlines inside quoted fields


"line one
line two",next-field

This is a single row with two fields. The first field contains a newline. If your CSV parser reads line-by-line, you'll read half a row and break.

<div class="callout callout-warning" role="note"><div class="callout-title">Warning</div><div class="callout-body"><p>You cannot parse CSV line-by-line unless you have a guarantee that no field contains a newline. Spreadsheet exports routinely produce multi-line quoted fields.</p></div></div>

# 4. UTF-8 BOM at the start

Excel saves CSVs with a UTF-8 BOM (bytes EF BB BF). Your parser sees the first column header as \ufeffname instead of name, and no field lookups match. Strip the BOM before parsing, or use a parser that does.

# 5. Delimiter ambiguity

CSV stands for Comma-Separated Values, but:

  • European locales often use ; because , is the decimal separator
  • Tab-separated (\t) exists for the same reason
  • Pipe-separated (|) shows up in data exports
  • Excel looks at locale settings and picks

Your parser needs to accept or detect the delimiter. Auto-detection works ~95% of the time — sniff the header line for the most common delimiter character.

# The quick rule

Do not write a CSV parser. Use one of these:

| Language | Package / stdlib |

|------------|------------------------|

| JavaScript | papaparse or csv-parse |

| Python | csv (stdlib) |

| Go | encoding/csv (stdlib)|

| Rust | csv crate |

| Java | opencsv, commons-csv |

| PostgreSQL | COPY command |

Every one of these handles the five cases above correctly by default.

# Common bugs even with a library

# 1. Not reading in binary / bytes mode


# Wrong on Windows — converts \r\n to \n silently, then quoted-newlines break
with open("data.csv", "r") as f:
    reader = csv.reader(f)
# Right
with open("data.csv", "r", newline="") as f:
    reader = csv.reader(f)

The newline="" prevents Python from normalising line endings.

# 2. Forgetting to handle the BOM


# Right: 'utf-8-sig' strips the BOM
with open("data.csv", "r", encoding="utf-8-sig", newline="") as f:
    reader = csv.reader(f)

# 3. Excel's special leading characters

Excel treats these as formulas, not text, and will execute them:


=SUM(...)
+, -, @ at start of a field

If your CSV contains user input that starts with these characters, Excel will try to evaluate them. This is CVE-worthy when combined with =HYPERLINK(...). Escape leading =, +, -, @ with a leading apostrophe or tab.

# 4. Trailing commas


col1,col2,col3,
value,value,value,

Some exporters include a trailing comma. Good parsers interpret this as a 4th empty field. Bad parsers error. Test both.

# Excel's opinions

Excel is the biggest consumer of CSV files in the world and has strong opinions:

  • It auto-detects data types. 123e5 becomes scientific notation. Product SKUs get mangled.
  • Zip codes starting with 0 lose the leading zero.
  • Dates get reformatted based on the user's locale.
  • UTF-8 without BOM is interpreted as Windows-1252 — non-ASCII characters become gibberish.

<div class="callout callout-tip" role="note"><div class="callout-title">Tip</div><div class="callout-body"><p>If your CSV is meant to be opened in Excel: include the UTF-8 BOM, quote every field that contains special characters, and prefix number-looking strings with tab (<code>\t</code>) to prevent auto-conversion. Or generate .xlsx directly — our <a href="/json-to-excel">JSON to Excel</a> converter does this in-browser.</p></div></div>

# Convert with confidence

Our CSV to JSON tool handles all five edge cases — quoted fields, escaped quotes, embedded newlines, BOM, custom delimiters. Drop a messy CSV and get clean JSON. Everything client-side; your data doesn't upload anywhere.

Common questions

Frequently asked.

Is there actually a CSV spec?

RFC 4180 exists, published in 2005, but every implementation treats it as advisory. Excel, Google Sheets, Postgres COPY, Python's `csv` module all interpret edge cases differently. The spec is the closest thing to a lingua franca, not a strict law.

What's a BOM and why does it matter?

UTF-8 BOM is three bytes (EF BB BF) that Excel puts at the start of CSV files it saves. Most parsers treat the BOM as part of the first field value — so your first column header becomes `\ufeffname` instead of `name`. Strip it before parsing.

Should I just use TSV instead?

It avoids comma-in-field collision but not the other issues (quoting, newlines, escaping). And Excel won't open a `.tsv` as nicely. CSV with a good parser is still usually the right answer.

Postingan baru, sekali seminggu.

Panduan pengembang praktis. Tanpa spam. Berhenti berlangganan kapan saja.

Tools mentioned

Pick up where the post leaves off.

Keep reading

More from the field notes.