Skip to content
All posts

CSV parsing is harder than you think (and why `split(',')` isn't enough)

Quoted fields, escaped quotes, embedded newlines, UTF-8 BOMs, Excel's opinions. A tour of every CSV footgun and the libraries that handle them.

DDDev DeskDeveloper Tools EditorPublished April 27, 20262 min readintermediate

# The naive version


row.split(",")

This works until it doesn't. Here's the data that breaks it:


"Smith, John",42,"Software engineer at ""Acme, Inc."""

A naive split produces six fields. It should produce three. Let's unpack.

# The five things CSVs can do

# 1. Quoted fields with commas inside


"Smith, John",42

Two fields. Split on comma produces three. You need a parser that respects quotes.

# 2. Escaped quotes inside quoted fields

The standard is "double the quote":


"She said ""hi""",answered

Two fields: She said "hi" and answered. Some ancient dialects use backslash-escape ("She said \"hi\"") — always respect what your tool produces, not what you wish it produced.

# 3. Embedded newlines inside quoted fields


"line one
line two",next-field

This is a single row with two fields. The first field contains a newline. If your CSV parser reads line-by-line, you'll read half a row and break.

<div class="callout callout-warning" role="note"><div class="callout-title">Warning</div><div class="callout-body"><p>You cannot parse CSV line-by-line unless you have a guarantee that no field contains a newline. Spreadsheet exports routinely produce multi-line quoted fields.</p></div></div>

# 4. UTF-8 BOM at the start

Excel saves CSVs with a UTF-8 BOM (bytes EF BB BF). Your parser sees the first column header as \ufeffname instead of name, and no field lookups match. Strip the BOM before parsing, or use a parser that does.

# 5. Delimiter ambiguity

CSV stands for Comma-Separated Values, but:

  • European locales often use ; because , is the decimal separator
  • Tab-separated (\t) exists for the same reason
  • Pipe-separated (|) shows up in data exports
  • Excel looks at locale settings and picks

Your parser needs to accept or detect the delimiter. Auto-detection works ~95% of the time — sniff the header line for the most common delimiter character.

# The quick rule

Do not write a CSV parser. Use one of these:

| Language | Package / stdlib |

|------------|------------------------|

| JavaScript | papaparse or csv-parse |

| Python | csv (stdlib) |

| Go | encoding/csv (stdlib)|

| Rust | csv crate |

| Java | opencsv, commons-csv |

| PostgreSQL | COPY command |

Every one of these handles the five cases above correctly by default.

# Common bugs even with a library

# 1. Not reading in binary / bytes mode


# Wrong on Windows — converts \r\n to \n silently, then quoted-newlines break
with open("data.csv", "r") as f:
    reader = csv.reader(f)
# Right
with open("data.csv", "r", newline="") as f:
    reader = csv.reader(f)

The newline="" prevents Python from normalising line endings.

# 2. Forgetting to handle the BOM


# Right: 'utf-8-sig' strips the BOM
with open("data.csv", "r", encoding="utf-8-sig", newline="") as f:
    reader = csv.reader(f)

# 3. Excel's special leading characters

Excel treats these as formulas, not text, and will execute them:


=SUM(...)
+, -, @ at start of a field

If your CSV contains user input that starts with these characters, Excel will try to evaluate them. This is CVE-worthy when combined with =HYPERLINK(...). Escape leading =, +, -, @ with a leading apostrophe or tab.

# 4. Trailing commas


col1,col2,col3,
value,value,value,

Some exporters include a trailing comma. Good parsers interpret this as a 4th empty field. Bad parsers error. Test both.

# Excel's opinions

Excel is the biggest consumer of CSV files in the world and has strong opinions:

  • It auto-detects data types. 123e5 becomes scientific notation. Product SKUs get mangled.
  • Zip codes starting with 0 lose the leading zero.
  • Dates get reformatted based on the user's locale.
  • UTF-8 without BOM is interpreted as Windows-1252 — non-ASCII characters become gibberish.

<div class="callout callout-tip" role="note"><div class="callout-title">Tip</div><div class="callout-body"><p>If your CSV is meant to be opened in Excel: include the UTF-8 BOM, quote every field that contains special characters, and prefix number-looking strings with tab (<code>\t</code>) to prevent auto-conversion. Or generate .xlsx directly — our <a href="/json-to-excel">JSON to Excel</a> converter does this in-browser.</p></div></div>

# Convert with confidence

Our CSV to JSON tool handles all five edge cases — quoted fields, escaped quotes, embedded newlines, BOM, custom delimiters. Drop a messy CSV and get clean JSON. Everything client-side; your data doesn't upload anywhere.

Frequently asked questions

Is there actually a CSV spec?

RFC 4180 exists, published in 2005, but every implementation treats it as advisory. Excel, Google Sheets, Postgres COPY, Python's `csv` module all interpret edge cases differently. The spec is the closest thing to a lingua franca, not a strict law.

What's a BOM and why does it matter?

UTF-8 BOM is three bytes (EF BB BF) that Excel puts at the start of CSV files it saves. Most parsers treat the BOM as part of the first field value — so your first column header becomes `\ufeffname` instead of `name`. Strip it before parsing.

Should I just use TSV instead?

It avoids comma-in-field collision but not the other issues (quoting, newlines, escaping). And Excel won't open a `.tsv` as nicely. CSV with a good parser is still usually the right answer.

새 게시물, 주 1회.

실용적인 개발자 가이드. 스팸 없음. 언제든지 구독 취소.

Tools mentioned

Keep reading