2023-11-13

Acknowledgements (page 1)

  • CANMOD (funding)
  • David Earn (vision, ideas)
  • Jonathan Dushoff, Ben Bolker (ideas)
  • Gabrielle MacKinnon, Samara Manzin (digitization, data prep scripting, data harmonization)
  • Sophie Stelmach, Chyun Shi, Miriam Dushoff, Jeanne Lin (digitization)
  • Jen Freeman (data prep scripting, time series visualization tools)
  • Michael Roswell (R development, data harmonization methodology)

Acknowledgements (page 2)

  • Kevin Zhao (modelling)
  • Bicko Cygu (scripting and pipelining)
  • Frank Jin, Ronald Jin (data api and shiny app)
  • Steven Lee (data harmonization)
  • Julia Maja (digitization, data quality scripting)
  • Ariel Earn (data resource organization)
  • Arash Shamseddini (data prep scripting)

Data from Past Epidemics

Why Study Past Epidemics?

  • Learn from our mistakes
  • Pandemic preparedness
  • Getting better at modelling

SIR fit to London Scarlet Fever

Inferred Force of Infection

Historical Infectious Disease Data

  • Modelling historical epidemics requires data
  • Good news: There’s lots of it
  • Bad news: It can be difficult to get and/or use

Historical Infectious Disease Data

Historical Infectious Disease Data

Historical Infectious Disease Data

Historical Infectious Disease Data

  • Modelling these data requires digitization
  • Kinds of historical data digitization projects
    • In support of specific research topics as they arise
    • Systematic coverage of a particular time and placeFor example, https://www.tycho.pitt.edu/

International Infectious Disease Data Archive (IIDDA)

  • Led by David Earn
  • Make historical digitized data public
  • Being rebuilt
  • Today we will have a trial re-release

CANMOD Digitization Project Overview

Straightforward and convenient access to historical and publicly available incidence, mortality, and population data:

  • Notifiable Communicable Disease Incidence (CDI) (1924-2000)
  • Population (1871-present)
  • Mortality (1950-2010)

Canadian Notifiable Communicable Disease Incidence

Canadian Communicable Disease Incidence

Canadian Communicable Disease Incidence

Canadian CDI – Scans

Canadian CDI – Digitized Spreadsheet

Canadian CDI – Tidy CSV

Harmonization

  • Users want things like: “Diphtheria incidence in Canada”
    • Longest time-range possible please (1924-2000)
    • Shortest intervals possible please (means weekly)
  • Idiosyncratic historical data
    • Need to stitch data sources together
    • Disease and place names change over time
    • Disease codes change over time
    • Hierarchical data (e.g. diseases within families of diseases)
    • Inconsistent time periods (e.g. weekly shifts to monthly)
    • Age group definitions change over timeWe have used monotonic splines for interpolating cumulative age distributions, and back transforming to get counts in any bin

Harmonizing Location (easy but useful)

Harmonizing Disease Names (very hard … still needs work)

Links

Getting Data

From R – Installation

remotes::install_github('canmod/rapiclient')
remotes::install_github('canmod/iidda-tools', subdir = 'R/iidda')
remotes::install_github('canmod/iidda-tools', subdir = 'R/iidda.api')
iidda.api::ops_staging$metadata() |> names()

Example From R

diphtheria_alberta = iidda.api::ops_staging$filter(resource_type = "CANMOD CDI"
  , disease = "diphtheria"
  , iso_3166_2 = "CA-AB"
  , time_scale = "wk"
)

Example From R

ggplot(diphtheria_alberta) + 
  geom_line(aes(period_end_date, cases_this_period / days_this_period))

Get the Data from the three Categories in R

canmod_cdi = iidda.api::ops_staging$filter(
    resource_type = "CANMOD CDI"
  , iso_3166 = "CA"
)
canmod_mort = iidda.api::ops_staging$filter(
    resource_type = "Mortality"
  , period_end_date = "1950-01-01/2020-12-31"
)
canmod_pop = iidda.api::ops_staging$filter(
    resource_type = "Population"
  , iso_3166 = "CA"
)

Appendix / Extras

Canadian Population Data

Canadian Population Data Sources

  • Sixth Census of Canada
    • population estimates for 1881-1921
    • every ten years
  • StatCan Report: Revised Annual Estimates of Population
    • population estimates for 1921-1971
    • every year
  • Population Estimates on July 1st
    • current and publicly available on-line
    • population estimates for 1971-current
    • every year

Canadian Population Data

Canadian Population Data

Canadian Population Data

Canadian Population Data

Canadian Population Data

Canadian Population

Newfoundland Population

Canadian Mortality Data

Canadian Mortality Data

  • 1950-2020
  • Weekly
  • Broken down by:
    • Province
    • Selected cause groups (next slide)
  • Custom tabulation request to StatCan

Canadian Mortality Data (Cause Groups)

  • Total causes
  • Malignant neoplasms
  • Diabetes mellitus
  • Diseases of the heart
  • Cerebrovascular diseases
  • Influenza and pneumonia
  • Chronic lower respiratory diseases
  • Nephritis, nephrotic syndrome and nephrosis
  • Accidents
  • Intentional self-harm (suicide)
  • Ill-defined and unspecified causes of mortality
  • All other causes
  • Alzheimer’s disease

All Cause Mortality (CA-Weekly)

Influenza & Pneumonia Mortality (CA-Weekly)

Influenza & Pneumonia Mortality (ON-Weekly)

Influenza & Pneumonia Mortality (NL-Weekly)

Influenza & Pneumonia Mortality (CA-1956)

Acknowledgements

  • CANMOD (funding)
  • David Earn (vision, ideas)
  • Jonathan Dushoff, Ben Bolker (ideas)
  • Gabrielle MacKinnon, Samara Manzin (digitization, data prep scripting, data harmonization)
  • Sophie Stelmach, Chyun Shi, Miriam Dushoff, Jeanne Lin (digitization)
  • Jen Freeman (data prep scripting, time series visualization tools)
  • Michael Roswell (R development, data harmonization methodology)

  • Kevin Zhao (modelling)
  • Bicko Cygu (scripting and pipelining)
  • Frank Jin, Ronald Jin (data api and shiny app)
  • Steven Lee (data harmonization)
  • Julia Maja (digitization, data quality scripting)
  • Ariel Earn (data resource organization)
  • Arash Shamseddini (data prep scripting)

CANMOD Digitization Project

  • Systematic about Canadian data
  • Last 2.5 years
  • Not for particular publications or purposes
  • Broad coverage across diseases and provinces
  • Today’s trial includes only the CANMOD data

International Infectious Disease Data Archive (IIDDA)

  • IIDDA has and will include
    • London Bills of Mortality
    • Registrar General Weekly Returns
    • Data digitized for
  • Mountains of scans to enter

Two Examples

  • Inferring force of infection
  • Interpolating counts over unequal time periods

London (UK) Scarlet Fever Mortality Data

Work by Kevin Zhao and David Earn. Sorry this is not a Canadian example.

Modelling London Scarlet Fever Mortality Data

\[\Delta S = B - \Lambda(t) \cdot S - \frac{S}{N} \cdot D\] \[\Delta I = \Lambda(t) \cdot S - \gamma I - \frac{I}{N} \cdot D\] \[\Delta R = (\gamma - CFP)I - \frac{R}{N} \cdot D\] \[\Delta M = CFP \cdot I\] \[\Lambda(t) = \frac{\beta(t) \cdot I}{N}\]

Work by Kevin Zhao and David Earn. Sorry this is not a Canadian example.

  • \(B\) and \(D\): weekly observed birth and all cause mortality counts
  • \(N\): population
  • \(CFP\): case-fatality proportion for scarlet fever (assumed to be 0.05)
  • \(\Delta M\): weekly scarlet fever mortality estimates (variable being fitted with maximum likelihood)
  • \(\beta(t)\): time-varying transmission rate
  • \(\Lambda(t)\): time-varying force of infection

Estimating Force of Infection Time Series

Maximum Likelihood Fit (London Scarlet Fever)

Force of Infection Estimate (London Scarlet Fever)

Counts over Unequal Time Periods

  • Idiosyncratic historical data
    • Counts reported over different lengths of time
    • Report periods overlap
    • Gaps in counts
  • OK with custom mechanistic models (e.g. SIR + radial-basis-functions)
  • Problem with traditional time-series analysis that assume
    • Evenly spaced
    • No missing values

Counts over Unequal Time Periods

Work by Michael Roswell

Counts over Unequal Time Periods

Work by Michael Roswell

Counts over Unequal Time Periods

Work by Michael Roswell

Counts over Unequal Time Periods

mgcv::gam(data
  , formula = count
    ~ s(period_mid_day, k)
    + offset(log(days_in_period))
    + offset(log(population / 1e5))
  , method = "fREML"
  , family = mgcv::nb
)

  • Generalized Additive Models (GAMs) in R
  • Estimate mean daily cases per 100,000 residents
  • Smooth function of time
  • Control for reporting period length and population size with offsets
  • Negative binomial error structure
  • Issue: choosing the smoothing parameter, k

Counts over Unequal Time Periods

Other Harmonization Challenges

  • Shifting disease definitions, names, hierarchies, and codes
  • Age structured data
    • Age group definitions changing over time
    • Hierarchical data
    • Use monotonic splines for interpolating cumulative age distributions, and back transforming to get counts in any bin
  • Connecting to recent data
    • No luck during COVID
    • But Ontario just came through