Phase 1: Transcript Attribution Pipeline
pythonai
The first challenge: raw transcripts from subtitle sites have no speaker labels. Every line is just text with a timestamp. To build a social graph, I need to know who said what.
The pipeline
Built two Python scripts:
- scrape.py - Pulls raw transcripts from subslikescript.com
- attribute.py - Uses AI to attribute each line to a speaker
The attribution works surprisingly well because:
- The AI knows The Wire extremely well from training data
- Speech patterns are distinctive per character (Omar’s poetic threats vs Bunk’s exasperated sighs)
- Episode context helps disambiguate similar-sounding characters
Sample output
{
"episode": "S01E01",
"title": "The Target",
"total_lines": 571,
"attributed": 568,
"attribution_rate": "99.5%",
"lines": [
{"line_num": 1, "speaker": "McNulty", "text": "So, your boy's name is what?"},
{"line_num": 2, "speaker": "Witness", "text": "Snot."},
{"line_num": 3, "speaker": "McNulty", "text": "You called the guy Snot?"}
]
}
Cost
Running Sonnet on all 60 episodes would cost around $15-25. Haiku brings that down to $2-3.
Next steps
- Batch process all 60 episodes
- Design the Neo4j schema
- Start loading data