Methodology

How the outbreak data on this site is sourced, attributed, counted, and corrected. Every choice below has a real-world failure mode we're trying to prevent. Last reviewed May 2026.

TL;DR

Every case-count, death, and country attribution on the map traces back to a verbatim quote from a Tier-1 health authority or major news outlet. Counts come from a MAX over all non-disputed claims, refreshed every 6 hours so in-place source updates (WHO appending an “Update: 12 May” section to an existing DON page) reach the map within the next cycle. Curator overrides exist as an editorial layer; the AI ingest can't silently fabricate.

01 · Sources

What counts as a source

Every pin is backed by one or more claims, each drawn from a single source URL. Sources are tiered:

Tier 1 — Official health authorities

WHO Disease Outbreak News (DON), CDC Health Alert Network (HAN), PAHO, ECDC, ReliefWeb, national health ministries (RIVM, UKHSA, BAG, Salud Argentina, gov.za, etc.)

Tier 2 — Quality news

Reuters, AP, AFP, BBC, NYT, Washington Post, Guardian, ABC, NBC, CNN, NPR, PBS — major outlets with editorial standards and named-byline reporting.

Excluded entirely

Personal blogs, social media, opinion sites, content farms, and any domain not on the allowlist — including URLs the LLM “found” that don't resolve to a real outlet. The discover step rejects anything off-list before extraction even runs.

02 · Pipeline

How a source becomes a pin

A GitHub Actions cron runs every 6 hours and triggers two phases on the production server:

Discover phase

An online-search-enabled call to Claude Sonnet 4.5 returns URLs of news/official reports published since the last run, constrained to the Tier 1 + Tier 2 allowlist. Each novel URL is fetched, the article body cleaned of boilerplate, and passed to a second (non-search) Sonnet call that extracts structured per-country claim rows.

Refresh phase

For every URL already in our database (published within the last 30 days), the refresh phase re-fetches the page and hashes the cleaned text. If the hash matches what we stored on the last extract, we skip — zero LLM cost. If the hash differs (WHO appended an “Update: 12 May” section, ECDC refreshed a surveillance page, an article was corrected) we re-extract and upsert the claim rows so the rollups reflect the latest source.

03 · Attribution

How cases and deaths are assigned to countries

A single person who got sick on a ship and was treated in a hospital ashore can be attributed three different ways depending on what question you're asking. We pick one rule and stick to it:

Living cases → the country where the person is currently being treated, not their nationality. A British national hospitalised in Amsterdam is a Netherlands case.
Land-based deaths → the country where the death occurred. A Dutch national who died in a Johannesburg hospital is a South Africa death.
Ship/transit deaths → nationality of the deceased. A German national who died on board the MV Hondius is a Germany death.
Global totals without a per-country breakdown →return nothing. The extractor is forbidden from distributing “3 deaths across the cluster” as 1+1+1 across mentioned countries.

Map pins follow this rule. The banner table's Dethcolumn is a separate, by-nationality view — tells you “how many citizens of country X have died anywhere”, regardless of where the death is attributed for the map. Both views always sum to the same total because each death contributes to exactly one of each.

04 · Evidence

Every death count is backed by a verbatim quote

For every claim that attributes one or more deaths to a country, the LLM is required to return a short verbatim quote from the source naming either the nationality of the deceased (for ship/transit deaths) or the location of death (for land-based deaths). If no such quote exists in the article, deaths must be zero for that country in that claim. The parser enforces this defensively — if the model produces a death count without an evidence string, the count is zeroed before the claim is written.

Example evidence stored on a real claim row: “An elderly Dutch man, 78, died on board the MV Hondius on 11 April.” That quote is what backs the Netherlands transit-death attribution. Tap any pin to see its sources — every claim row links back to the URL the quote came from.

05 · Counting

How case counts roll up across sources

Each event's case count = MAX(case_count) across all non-disputed claims for that event. Same for confirmed cases and deaths. Three consequences:

A single low-balling source can't pull the count down — the highest credible number wins.
But the count is not additive across sources. If WHO says 2 cases in NL and CDC says 2 cases in NL, we report 2 — not 4 — because those two sources are probably naming the same two people. The system can't UNION distinct people across sources without per-person identity, which we don't track.
Every death is also a case. A country with 1 living lab- confirmed patient plus 1 person who died on the ship has case_count=2, confirmed_cases=1, deaths=1. Public health convention always reports “N cases, including M deaths” — never as parallel buckets.

This is the same MAX-over-claims logic the WHO and ECDC use to aggregate per-country reporting — but applied per event rather than per cluster.

06 · Editorial layer

When the LLM is wrong, how we fix it

Two mechanisms let a curator correct the data without touching the cron:

Dispute flag.Any individual claim row can be marked disputed. Disputed claims stay in the database for audit but are excluded from the rollup. Used when a source's attribution turns out to be wrong (e.g. the LLM read “death in the cluster” as a country-of-death when the source never named that country).
Curator-authored claims. A small number of claims are seeded by hand from authoritative sources that the cron hasn't reached yet — entered with the same evidence + nationality requirements as LLM-extracted claims. These are visible in each event's drawer as sources alongside the LLM rows.

07 · Discipline

What we explicitly don't do

No URLs the LLM “found” outside the allowlist. Every source domain is pre-vetted.
No silent distribution of cluster totals across mentioned countries. If the source doesn't give a per-country number, we abstain.
No counting monitored contacts as cases. The 20 British nationals isolating at Arrowe Park have no symptoms and no positive tests — they're not in the case count, by design.
No retroactive count-down without explicit source change. The map only decreases when a Tier 1 source corrects itself in a refresh pass, or a curator marks a claim disputed.
No invented coordinates. Geo lookups come from the GeoNames dataset; cities we can't resolve get country-level placement, never a guess.

08 · Caveats

Known limitations

Counts lag real-world events by up to 6 hours (cron cadence). Big-news cycles can move faster than that.
Per-source MAX can undercount when distinct people are split across sources — see “Counting” above.
Endemic background cases (not part of a tracked outbreak) are not in scope.
All counts may be incomplete — always verify the primary source links surfaced on each pin's drawer.

09 · Corrections

Spot something wrong?

Open an issue with the source URL and what should change: github.com/krisworkspace141-stack/hantatrack/issues. Curator review is manual to keep the trust chain intact.

See also the exposure checklist for what to do if you may have been exposed, and the hantavirus reference for medical background.