promptdojo_

CSV and JSONL — the two formats AI moves data in — step 3 of 9

JSONL — the format AI training data ships in

CSV is fine until your rows have nested data. The moment a column needs to be a list, a dict, or anything richer than a string, CSV breaks. The replacement that's everywhere in modern AI work is JSONL: one JSON object per line, separated by newlines. OpenAI's fine-tuning format is JSONL. Anthropic's batch API is JSONL. Every LLM eval framework writes JSONL. Read one, you've read them all.

The mental model

A JSONL file looks like this:

{"id": 1, "prompt": "summarize", "tokens": 240}
{"id": 2, "prompt": "translate", "tokens": 180}
{"id": 3, "prompt": "rewrite", "tokens": 95}

Three lines, three independent JSON objects. Not a JSON array of objects (no surrounding [ ], no commas between lines). This is the key shape: each line stands on its own.

Why that matters: you can read the file one line at a time. A 5GB JSONL file streams through Python without ever holding the whole thing in memory. A 5GB JSON array can't — json.load(f) would parse the whole array at once. JSONL is the streaming-friendly cousin of JSON.

The shape AI expects you to write

Every JSONL reader in the world looks like this:

import json

with open("data.jsonl") as f:
    for line in f:
        row = json.loads(line)
        # row is a dict, do something with it

Iterating a file handle yields one line at a time, including the trailing \n. json.loads is forgiving about trailing whitespace, so you don't need to .strip(). Each row is a dict you can index by key.

Writing JSONL is the symmetric shape:

import json

with open("data.jsonl", "w") as f:
    for row in records:
        f.write(json.dumps(row) + "\n")

json.dumps is the inverse of json.loads — dict in, JSON string out. Append "\n" so each record gets its own line.

A worked example

The editor on the right writes three JSONL records and reads them back:

import json
from pathlib import Path

data = [
    {"id": 1, "prompt": "summarize", "tokens": 240},
    {"id": 2, "prompt": "translate", "tokens": 180},
    {"id": 3, "prompt": "rewrite", "tokens": 95},
]

path = Path("/tmp/log.jsonl")
path.write_text("\n".join(json.dumps(d) for d in data) + "\n")

with open(path) as f:
    for line in f:
        row = json.loads(line)
        print(row["id"], row["prompt"], row["tokens"])

The write step uses a "\n".join(...) trick to format the file in one go: dump each dict, glue them with newlines, and append a final "\n" so the last line is properly terminated.

The read step iterates lines, parses each, and prints three columns of structured output:

1 summarize 240
2 translate 180
3 rewrite 95

Where AI specifically gets this wrong

Three patterns to watch for in code Cursor writes you.

One: confusing JSON and JSONL. Cursor will sometimes write json.load(f) against a JSONL file and get a JSONDecodeError on line 1, character 0 of line 2. The error message is cryptic. The fix is the line-by-line shape above. If a file ends in .jsonl (or even just has multiple } lines), it's not one JSON document.

Two: forgetting the trailing \n. Some tools strip blank lines, some don't. Some readers tolerate a missing final newline, some don't. Always end JSONL files with \n — it's free safety. The "\n".join + "\n" shape in the example handles it.

Three: hand-rolling the parser. When AI doesn't know the file is JSONL, it sometimes writes f.read().split("\n") and tries to json.loads each piece. That works until a value in the data contains an embedded \n (which JSON allows in strings). Then your hand-rolled splitter chops a record in half. Iterating the file handle directly gives you actual lines, not "split-on-newline." JSONL forbids embedded newlines inside values, so if you wrote the file with json.dumps, line iteration is safe.

Run the editor. Three records in, three records out, all streamed.