Methodology — BOTWAVEBOMBA

What BOTWAVEBOMBA Is

BOTWAVEBOMBA is the public face of a private journalism research infrastructure called TELOS+PAI. Every article shown here was ingested by global_ingestor.py, clustered across information blocs, and tagged with a per-source five-axis bias fingerprint. The pipeline runs every six hours on a systemd timer. (bomba_pipeline.sh, botwave-bomba-pipeline.timer: OnCalendar=*-*-* 00,06,12,18:00:00)

244 sources are deep-fingerprinted across the five bias axes (api/pipeline_state.json: .source_counts.fingerprinted). 248 sources are in the registry and awaiting fingerprinting (api/pipeline_state.json: .source_counts.awaiting_fingerprinting). The substrate runs every six hours; fingerprinting cadence is 20 sources per month; full parity (492/492) is targeted for Q4 2026 (see _index/receipts/ for the per-month cadence receipts). Every source is classified by bloc (Western / Adversarial / Non-Aligned), factuality (high / mixed / low), and primary-vs-launderer status.

The five-axis fingerprints are per-source, computed at fingerprinting time and looked up at render time. Story-level scores (the bloc coverage percentages you see in the feed) are aggregations across the source mix, not fresh per-article inference. The substrate is a coverage-and-bias comparison engine — the per-article sentence-level framing pass is on the decomposition roadmap (see api/pipeline_state.json: .decomposition.modules_extracted) and is not in the live pipeline yet.

Why Not Left / Center / Right?

Ground News, AllSides, and similar services organize sources on a left–center–right axis calibrated to American domestic politics. That axis is useful in a country where the main conflict is between two parties managing the same system. It is useless when you are trying to understand what the Iranian press says about the Strait of Hormuz, what the Swedish press says about NATO, or what the Chinese press says about Taiwan.

BOTWAVEBOMBA uses five independent axes, each scored from -1.0 to +1.0 (per-source, in data/source_registry.json under each entry's axis block):

Atlanticist — does this outlet assume US-led world order is legitimate?
Interventionist — does this outlet support military intervention?
Zionist — how does this outlet frame Israeli state action?
Statist — does this outlet reinforce or challenge state authority?
Financialized — does this outlet treat financial capitalism as natural?

These axes produce a five-dimensional fingerprint per source — not a simple left/right label. A source can be anti-interventionist and pro-state simultaneously (much of the RT editorial line). A source can be pro-market and anti-zionist simultaneously (some European financial press on Israel/Palestine coverage). The fingerprint captures that complexity. A label cannot.

What The Pipeline Does Today

The pipeline currently runs in two analytical stages, plus deployment plumbing. The full seven-stage decomposition described historically on this page is in progress — see the roadmap below.

Stage 1 — Ingest. global_ingestor.py pulls RSS feeds and full article text (via httpx + readability-lxml) from the 1,651-source ingest set, deduplicates by content hash, writes to news_cache.jsonl. (book_arm/pai_modules/global_ingestor.py:46-55)
Stage 2 — Generate feed (monolith). generate_feed.py reads news_cache.jsonl + data/source_registry.json, clusters articles by multi-entity overlap, attaches per-source five-axis bias data via lookup, computes per-cluster bloc-coverage percentages (Western / Neutral / Adversarial), flags blindspots (Blackout / Spotlight thresholds), and serializes the result to api/latest.json — the JSON this site reads. Story-card PNGs are rendered by generate_cards.py for the top stories. After that the staging Pages repo is committed and pushed. Discord and Telegram digests follow. (zombie760.github.io/scripts/generate_feed.py + scripts/bomba_pipeline.sh)

The monolith does inline what the seven-stage decomposition will do as discrete modules. Behaviors present (entity clustering, per-source bias lookup, bloc-coverage computation, blindspot flagging) are all in generate_feed.py. Behaviors named on the historical version of this page but not present anywhere in the live pipeline (per-article propaganda-lexicon scoring, agency-verb analysis, sentence-level verb choice diff) are roadmap, not current.

Decomposition Roadmap

The plan: extract one analytical stage from generate_feed.py per calendar week, with fixture-equivalence to the monolith as the acceptance test. After six weeks the monolith is six discrete modules; a seventh week replaces the glue with a proper orchestrator. Full plan: PLAN_HYBRID.md. Live state: pipeline_state.json.

The roadmap entries below reflect pipeline_state.json at page-render time. The status badges flip as each weekly extraction lands. Status page shows the same data live and auto-refreshing.

discrete global_ingestor.py — pulls RSS + article full-text from 1,651 sources, dedup by hash, writes news_cache.jsonl. Pre-existing; not part of the decomposition. (book_arm/pai_modules/global_ingestor.py)
extracted event_clusterer.py — groups articles about the same event using entity-pair overlap (primary path, ≥3 articles ≥2 sources) and single-entity fallback. Extracted 2026-05-10 from generate_feed.py:538-623; fixture-equivalence verified (160/160 stories identical between pre- and post-extraction outputs). Module lives at book_arm/pai_modules/event_clusterer.py; schemas at schemas/event_clusterer_{input,output}.schema.json; fixture-equivalence test at tests/test_event_clusterer.py.
extracted bias_scorer.py — attaches per-source bias enrichment to each cluster (bias_tier, bias_bucket, bloc, geo_cluster, atlanticist_norm) and computes per-cluster bias_variance (stdev of atlanticist scores) + five-axis averages (interventionist, zionist, atlanticist, statist, financialized). Per-source LOOKUP from data/source_registry.json, not per-article propaganda-lexicon scoring. Extracted 2026-05-10 from generate_feed.py:589-622 + :667-677; fixture-equivalence verified (0/160 stories diverged in bias data). Original single-axis -6..+6 propaganda-lexicon standalone reference preserved at book_arm/pai_modules/_reference/bias_scorer.py.original — sentence-level lexicon scoring remains a methodology-roadmap item, separate from this extraction.
extracted framing_differ.py — for each cluster, picks the consensus framing (article whose headline shares the most meaningful words with the rest of the cluster), builds per-article framing cards (headline + lede via get_snippet + source metadata), and selects the cluster's hero image and video. Extracted 2026-05-10 from generate_feed.py:474-513 + :29-44 + :644-657. Module lives at book_arm/pai_modules/framing_differ.py; tests at tests/test_framing_differ.py. Sentence-level verb-choice and named-entity diff (richer analysis described historically) remain a methodology-roadmap item — not in this extraction. Fixture-equivalence verified (0/160 stories diverged).
extracted blindspot_analyzer.py — applies the Ground News-style left/center/right coverage breakdown and the Ground News blindspot formula (one side <17% AND other side ≥33%) per cluster. Also runs the Western Mono-Frame / Blackout geo-frame detection over geopolitical watchlists (Middle East, cartel/intel, US foreign policy, Africa-suppressed). Extracted 2026-05-10 from generate_feed.py:291-401 + :51-101 + :606-614 + :681-689. Writes blindspots.jsonl. Fixture-equivalence verified (0/160 diff on is_blindspot/blindspot_score/coverage/geo_frame).
extracted coverage_mapper.py — pure projection of the heatmap-distribution fields (left_pct, center_pct, right_pct, state_count, dominant_bucket, dominant_pct) from blindspot_analyzer's combined output into a standalone coverage.jsonl artifact. Sharing compute_coverage across the two modules avoids duplicate code paths; coverage_mapper depends on blindspot_analyzer's output, not its own independent computation. Extracted 2026-05-10 from generate_feed.py:702-709.
extracted broadcast.py — serializes pipeline output to api/latest.json and api/blindspots.json. Extracted 2026-05-11 from generate_feed.py:770-778 (output dict build + JSON write). Thin wrapper — all analytical computation happens upstream; this module only writes. Fixture-equivalence verified: JSON serialization output identical between baseline and extracted module. (book_arm/pai_modules/broadcast.py)

All six modules have flipped to extracted. Next: wire the modules behind a proper orchestrator (run_pipeline.py) that emits a per-cycle pipeline_run.json the status page reads for live per-stage health. The orchestrator schema is already drafted: schemas/pipeline_run.schema.json.

What We Do Not Do

We do not editorialize. The bloc-coverage comparison shows the delta; you read it.
We do not write summaries. Every summary field is the article's own description, truncated. (generate_feed.py:631: summary = headline_art.get('description', ''))
We do not rate factuality of individual claims. Source-level factuality labels appear in source_registry.json under each source's factuality field (values: high / mixed / low). The labels draw on MBFC where available, supplemented by hand-curation. Per-source provenance trace is on the methodology-audit roadmap.
We do not suppress adversarial sources. rt, rt_arabic, sputnik, tass, and other adversarial-bloc outlets are in source_registry.json with bloc: "adversarial" labels visible.
We do not invent. Every claim traces to an article URL surfaced under sources[].url in api/latest.json.

The Journalism Connection

BOTWAVEBOMBA is the public surface of a private investigative research substrate that supports long-form journalism. The same ingest layer (global_ingestor.py → book_arm/memory/news_cache.jsonl) feeds both the public site and the book-arm's primary-source discovery for book-length investigation — the kind that requires knowing not just what AP reported, but what IRNA, Lenta.ru, and SCMP said on the same day about the same event, in their original framing, before any translation layer added editorial distance. Both sides consume from the 1,651-source ingest set.

The methodology is the same as the journalism: primary sources or nothing. Every source URL is live. Every score is computed or looked up from a versioned registry, not assigned post-hoc. Every blindspot is measured against the cache, not asserted.

BE UNDENIABLE. Every claim filed. Every source named. A single unanchored assertion is the lever a critic uses to dismiss the whole work. We do not write one.

Bias Scoring Baseline

The five-axis fingerprints in source_registry.json were hand-curated, drawing on AllSides and MBFC bias ratings where available. Per-source provenance (which axis value for which source came from which input) is not yet machine-traceable as a single artifact — documenting it is on the methodology-audit roadmap.

The current registry ships 244 deep-fingerprinted sources with full 5-axis bias data, bloc classification, factuality ratings, and primary-vs-launderer tags (api/pipeline_state.json: .source_counts.fingerprinted). The original 244-source registry was introduced in commit c6a088bf; the next 248 sources are in the fingerprinting queue at a cadence of 20 sources/month, with full parity targeted for Q4 2026. Per-source provenance (which axis value for which source came from which input) is published on /sources.html with each source card; the underlying machine-traceable artifact is on the methodology-audit roadmap.

Hand-curation methodology — editorial stance analysis, entity framing patterns, named-state-alignment patterns, institutional-alignment signals — is summarized in the methodology audit (audit/about_audit_2026-05-10.md). Not duplicated here to avoid drift between the page and the audit.