LAST METHODOLOGY AUDIT — 2026-05-10
Pipeline decomposition: 6 of 6 modules extracted (event_clusterer + bias_scorer + framing_differ + blindspot_analyzer + coverage_mapper + broadcast — all extracted, all fixture-equivalence verified). The monolith is now six discrete modules plus the pre-existing global_ingestor. Live pipeline state: /status.html. Audit trail: audit/about_audit_2026-05-10.md · plan: PLAN_HYBRID.md. Cannot-verify is a first-class output — every factual claim below references the file and line that backs it.

THE METHOD

Every system that tells you what to read has a hidden architecture. Here is ours. Every component named. Every assumption declared. Every claim sourced.

What BOTWAVEBOMBA Is

BOTWAVEBOMBA is the public face of a private journalism research infrastructure called TELOS+PAI. Every article shown here was ingested by global_ingestor.py, clustered across information blocs, and tagged with a per-source five-axis bias fingerprint. The pipeline runs every six hours on a systemd timer. (bomba_pipeline.sh, botwave-bomba-pipeline.timer: OnCalendar=*-*-* 00,06,12,18:00:00)

492 sources are ingested every six hours (book_arm/memory/sources_global.json). 496 carry full five-axis bias fingerprints (data/source_registry.json). Every source is classified by bloc (Western / Adversarial / Non-Aligned), factuality (high / mixed / low), and primary-vs-launderer status.

It is not a news aggregator in the Google News sense. It is not a fact-checker. It is a bloc-level coverage-and-bias comparison engine: which information bloc covered each story, where the volume gaps are, and how each source's five-axis bias profile maps across blocs. Sentence-level framing analysis (verb-choice, propaganda lexicon) is on the decomposition roadmap below — it is not in the live pipeline yet.

Why Not Left / Center / Right?

Ground News, AllSides, and similar services organize sources on a left–center–right axis calibrated to American domestic politics. That axis is useful in a country where the main conflict is between two parties managing the same system. It is useless when you are trying to understand what the Iranian press says about the Strait of Hormuz, what the Swedish press says about NATO, or what the Chinese press says about Taiwan.

BOTWAVEBOMBA uses five independent axes, each scored from -1.0 to +1.0 (per-source, in data/source_registry.json under each entry's axis block):

  • Atlanticist — does this outlet assume US-led world order is legitimate?
  • Interventionist — does this outlet support military intervention?
  • Zionist — how does this outlet frame Israeli state action?
  • Statist — does this outlet reinforce or challenge state authority?
  • Financialized — does this outlet treat financial capitalism as natural?

These axes produce a five-dimensional fingerprint per source — not a simple left/right label. A source can be anti-interventionist and pro-state simultaneously (much of the RT editorial line). A source can be pro-market and anti-zionist simultaneously (some European financial press on Israel/Palestine coverage). The fingerprint captures that complexity. A label cannot.

What The Pipeline Does Today

The pipeline currently runs in two analytical stages, plus deployment plumbing. The full seven-stage decomposition described historically on this page is in progress — see the roadmap below.

  1. Stage 1 — Ingest. global_ingestor.py pulls RSS feeds and full article text (via httpx + readability-lxml) from the 492-source ingest set, deduplicates by content hash, writes to news_cache.jsonl. (book_arm/pai_modules/global_ingestor.py:46-55)
  2. Stage 2 — Generate feed (monolith). generate_feed.py reads news_cache.jsonl + data/source_registry.json, clusters articles by multi-entity overlap, attaches per-source five-axis bias data via lookup, computes per-cluster bloc-coverage percentages (Western / Neutral / Adversarial), flags blindspots (Blackout / Spotlight thresholds), and serializes the result to api/latest.json — the JSON this site reads. Story-card PNGs are rendered by generate_cards.py for the top stories. After that the staging Pages repo is committed and pushed. Discord and Telegram digests follow. (zombie760.github.io/scripts/generate_feed.py + scripts/bomba_pipeline.sh)

The monolith does inline what the seven-stage decomposition will do as discrete modules. Behaviors present (entity clustering, per-source bias lookup, bloc-coverage computation, blindspot flagging) are all in generate_feed.py. Behaviors named on the historical version of this page but not present anywhere in the live pipeline (per-article propaganda-lexicon scoring, agency-verb analysis, sentence-level verb choice diff) are roadmap, not current.

Decomposition Roadmap

The plan: extract one analytical stage from generate_feed.py per calendar week, with fixture-equivalence to the monolith as the acceptance test. After six weeks the monolith is six discrete modules; a seventh week replaces the glue with a proper orchestrator. Full plan: PLAN_HYBRID.md. Live state: pipeline_state.json.

The roadmap entries below reflect pipeline_state.json at page-render time. The status badges flip as each weekly extraction lands. Status page shows the same data live and auto-refreshing.

  • discrete global_ingestor.py — pulls RSS + article full-text from 492 sources, dedup by hash, writes news_cache.jsonl. Pre-existing; not part of the decomposition. (book_arm/pai_modules/global_ingestor.py)
  • extracted event_clusterer.py — groups articles about the same event using entity-pair overlap (primary path, ≥3 articles ≥2 sources) and single-entity fallback. Extracted 2026-05-10 from generate_feed.py:538-623; fixture-equivalence verified (160/160 stories identical between pre- and post-extraction outputs). Module lives at book_arm/pai_modules/event_clusterer.py; schemas at schemas/event_clusterer_{input,output}.schema.json; fixture-equivalence test at tests/test_event_clusterer.py.
  • extracted bias_scorer.py — attaches per-source bias enrichment to each cluster (bias_tier, bias_bucket, bloc, geo_cluster, atlanticist_norm) and computes per-cluster bias_variance (stdev of atlanticist scores) + five-axis averages (interventionist, zionist, atlanticist, statist, financialized). Per-source LOOKUP from data/source_registry.json, not per-article propaganda-lexicon scoring. Extracted 2026-05-10 from generate_feed.py:589-622 + :667-677; fixture-equivalence verified (0/160 stories diverged in bias data). Original single-axis -6..+6 propaganda-lexicon standalone reference preserved at book_arm/pai_modules/_reference/bias_scorer.py.original — sentence-level lexicon scoring remains a methodology-roadmap item, separate from this extraction.
  • extracted framing_differ.py — for each cluster, picks the consensus framing (article whose headline shares the most meaningful words with the rest of the cluster), builds per-article framing cards (headline + lede via get_snippet + source metadata), and selects the cluster's hero image and video. Extracted 2026-05-10 from generate_feed.py:474-513 + :29-44 + :644-657. Module lives at book_arm/pai_modules/framing_differ.py; tests at tests/test_framing_differ.py. Sentence-level verb-choice and named-entity diff (richer analysis described historically) remain a methodology-roadmap item — not in this extraction. Fixture-equivalence verified (0/160 stories diverged).
  • extracted blindspot_analyzer.py — applies the Ground News-style left/center/right coverage breakdown and the Ground News blindspot formula (one side <17% AND other side ≥33%) per cluster. Also runs the Western Mono-Frame / Blackout geo-frame detection over geopolitical watchlists (Middle East, cartel/intel, US foreign policy, Africa-suppressed). Extracted 2026-05-10 from generate_feed.py:291-401 + :51-101 + :606-614 + :681-689. Writes blindspots.jsonl. Fixture-equivalence verified (0/160 diff on is_blindspot/blindspot_score/coverage/geo_frame).
  • extracted coverage_mapper.py — pure projection of the heatmap-distribution fields (left_pct, center_pct, right_pct, state_count, dominant_bucket, dominant_pct) from blindspot_analyzer's combined output into a standalone coverage.jsonl artifact. Sharing compute_coverage across the two modules avoids duplicate code paths; coverage_mapper depends on blindspot_analyzer's output, not its own independent computation. Extracted 2026-05-10 from generate_feed.py:702-709.
  • extracted broadcast.py — serializes pipeline output to api/latest.json and api/blindspots.json. Extracted 2026-05-11 from generate_feed.py:770-778 (output dict build + JSON write). Thin wrapper — all analytical computation happens upstream; this module only writes. Fixture-equivalence verified: JSON serialization output identical between baseline and extracted module. (book_arm/pai_modules/broadcast.py)

All six modules have flipped to extracted. Next: wire the modules behind a proper orchestrator (run_pipeline.py) that emits a per-cycle pipeline_run.json the status page reads for live per-stage health. The orchestrator schema is already drafted: schemas/pipeline_run.schema.json.

What We Do Not Do

  • We do not editorialize. The bloc-coverage comparison shows the delta; you read it.
  • We do not write summaries. Every summary field is the article's own description, truncated. (generate_feed.py:631: summary = headline_art.get('description', ''))
  • We do not rate factuality of individual claims. Source-level factuality labels appear in source_registry.json under each source's factuality field (values: high / mixed / low). The labels draw on MBFC where available, supplemented by hand-curation. Per-source provenance trace is on the methodology-audit roadmap.
  • We do not suppress adversarial sources. rt, rt_arabic, sputnik, tass, and other adversarial-bloc outlets are in source_registry.json with bloc: "adversarial" labels visible.
  • We do not invent. Every claim traces to an article URL surfaced under sources[].url in api/latest.json.

The Journalism Connection

BOTWAVEBOMBA is the public surface of a private investigative research substrate that supports long-form journalism. The same ingest layer (global_ingestor.pybook_arm/memory/news_cache.jsonl) feeds both the public site and the book-arm's primary-source discovery for book-length investigation — the kind that requires knowing not just what AP reported, but what IRNA, Lenta.ru, and SCMP said on the same day about the same event, in their original framing, before any translation layer added editorial distance. Both sides consume from the 492-source ingest set.

The methodology is the same as the journalism: primary sources or nothing. Every source URL is live. Every score is computed or looked up from a versioned registry, not assigned post-hoc. Every blindspot is measured against the cache, not asserted.

BE UNDENIABLE. Every claim filed. Every source named. A single unanchored assertion is the lever a critic uses to dismiss the whole work. We do not write one.

Bias Scoring Baseline

The five-axis fingerprints in source_registry.json were hand-curated, drawing on AllSides and MBFC bias ratings where available. Per-source provenance (which axis value for which source came from which input) is not yet machine-traceable as a single artifact — documenting it is on the methodology-audit roadmap.

The current registry ships 496 deep-fingerprinted sources with full 5-axis bias data, bloc classification, factuality ratings, and primary-vs-launderer tags (data/source_registry.json). The original 244-source registry was introduced in commit c6a088bf; the enriched 496-source version merges data from source_fingerprints.json. Subsequent updates are logged in pipeline_state.json.

Hand-curation methodology — editorial stance analysis, entity framing patterns, named-state-alignment patterns, institutional-alignment signals — is summarized in the methodology audit (audit/about_audit_2026-05-10.md). Not duplicated here to avoid drift between the page and the audit.