Phase 1: Transcript Attribution Pipeline

The first challenge: raw transcripts from subtitle sites have no speaker labels. Every line is just text with a timestamp. To build a social graph, I need to know who said what.

The pipeline

Built two Python scripts:

scrape.py - Pulls raw transcripts from subslikescript.com
attribute.py - Uses AI to attribute each line to a speaker

The attribution works surprisingly well because:

The AI knows The Wire extremely well from training data
Speech patterns are distinctive per character (Omar’s poetic threats vs Bunk’s exasperated sighs)
Episode context helps disambiguate similar-sounding characters

Sample output

{
  "episode": "S01E01",
  "title": "The Target",
  "total_lines": 571,
  "attributed": 568,
  "attribution_rate": "99.5%",
  "lines": [
    {"line_num": 1, "speaker": "McNulty", "text": "So, your boy's name is what?"},
    {"line_num": 2, "speaker": "Witness", "text": "Snot."},
    {"line_num": 3, "speaker": "McNulty", "text": "You called the guy Snot?"}
  ]
}

Cost

Running Sonnet on all 60 episodes would cost around $15-25. Haiku brings that down to $2-3.

Next steps

Batch process all 60 episodes
Design the Neo4j schema
Start loading data