Disclaimer: I am far from an an expert on infectious disease modelling! My knowledge is mostly from university courses on the topic.

In this post, I explain some background and provide an introduction to the topic of modelling infectious diseases and SIR models. This post expands on that and extends the basic SIR model. The model derived and implemented there lays the basis for the model used here, so you might want to read the two articles first (or just the last one if you already have a good understanding of the basic SIR model).

This article is focused on fitting an extended SIR model with time-dependent R₀-values and resource-dependent death rates to real Coronavirus data, in order to come as close as possible to the real numbers and make informed predictions about possible future developments. But before we jump right into fitting the data to our model, let’s do something that is often overlooked — let’s have a short look at what our model cannot do.

You can find the python notebook for the whole article here.

Caveats, Pitfalls, Limitations

Models are always simplifications of the real world. If you want to see a model that perfectly resembles the real world, go outside! Still - that does not mean that they cannot be interesting and generate insights. It’s just important to keep the following things in mind:

Our System of Differential Equations is extremely sensitive to initial parameters; Slight changes can lead to completely different outcomes.
We are extrapolating from incomplete, preliminary data. Some countries might only count deaths directly resulting from Coronavirus, others might count all deaths where the individual was infected. Some might (purposely or not) report inaccurate or unreliable data, etc.
We assume homogeneity among the whole population, that is, we do not account for some places being initial hot spots and others implementing restrictions earlier and more strenuously (we’d need much more effort (computationally and mathematically) to take these things into account).

Additionally, specific to our model, we make the following assumptions (these are just some obvious ones, there are certainly more assumptions hidden in the equations):

Deaths do not change the population structure to a meaningful extent (we calculate fatality rates a priori (using the population structure from before the outbreak) and assume that deaths are not so high as to significantly change the population structure (I believe that this is a rather weak, reasonable assumption)).
Only critical cases fill up the hospitals and can lead to a higher fatality rate due to shortage of available care.
All critical patients that do not get treatment die.
Individuals are immune after recovering.
R₀ only decreases or stays constant. It does not increase. (Thus, this model does not allow us to model measures being loosened again; we’d need a different function for R₀ for that)

Now with that in mind, let’s start modelling: We’ll first go over curve fitting really quickly, then derive the model we use, and finally fit our model to Coronavirus data.

(A very short introduction to) Curve Fitting

Later on, we are going to have some data points (the cumulative fatalities per day) and a function (our model) that gives us, dependent on some parameters (fatality rate, ICU beds, …) the cumulative fatalities per day that we predict. We then want to fit the curves, that means that we want to find the parameters for our model that generate the predictions most closely resembling the real data. I won’t go into any mathematical background here, I’ll just present one example in Python (imports are left out here, they’re in the notebook).

First, generate some data (in this case, a normal distribution with some noise) to fit:

np.random.seed(42)
x = np.linspace(0, 20.0, 1001)
data = (gaussian(x, 21, 6.1, 1.2) + np.random.normal(scale=0.1, size=x.size))
plt.plot(x, data);

It looks like this:

Next, we need a function that takes the x-value as first argument and the three parameters we want to fit (we’ll call them a, b, c) as next arguments. This is the function we will use to fit the data; our curve fitting library (don’t worry, we don’t have to do any heavy lifting ourselves) will vary the parameter until it finds a good fit (and hopefully finds the ones we used to generate the data, i.e. a=21, b=6.1, c=1.2). With this function, we generate a curve fitting model (using the lmfit library) and pass some initial guesses for the parameters. We then fit the data. It’s important to note that curve fitting methods in general do not guarantee find a global minimum and that our initial guesses for the parameters are crucial.

def f(x, a, b, c):
    return gaussian(x, a, b, c)

mod = lmfit.Model(f)
# we set the parameters (and some initial parameter guesses)
mod.set_param_hint("a", value=10.0, vary=True)
mod.set_param_hint("b", value=10.0, vary=True)
mod.set_param_hint("c", value=10.0, vary=True)

params = mod.make_params()

result = mod.fit(data, params, method="leastsq", x=x)  # fitting
result.plot_fit(datafmt="-");
result.best_values

Here’s the output - we found our parameters!

That’s all the curve fitting background we need. Now, on to the model we’ll use to fit the data:

Extended SIR Model for Fitting

We build on the model derived in the last article: Individuals are first Susceptible- they can catch Coronavirus (with probability S/N) and become Exposed. They cannot yet spread the virus. After some days (1/δ, to be exact), they are Infected and spread the virus. Over the course of the infection, they can either Die (over 1/ρ days, with probability α) or survive and finally Recover (after 1/γ days):

Here are the variables again:

N: total population
S(t): number of people susceptible on day t
E(t): number of people exposed on day t
I(t): number of people infected on day t
R(t): number of people recovered on day t
D(t): number of people dead on day t
β: expected amount of people an infected person infects per day
D: number of days an infected person has and can spread the disease
γ: the proportion of infected recovering per day (γ = 1/D)
R₀: the total number of people an infected person infects (R₀ = β / γ)
δ: length of incubation period
α: fatality rate
ρ: rate at which people die (= 1/days from infected until death)

I just want to add one new compartment: Critical, for individuals that need intensive care. This will allow us to model overflowing hospitals. Of course, only infected individuals can enter the critical state. From the critical state, they can either die or recover.

We arrive at these transitions:

Well, lots of question marks here, let’s do some thinking! We need the probability that an infected individual becomes critically ill. I am not going to introduce more and more Greek letters here, we’ll just call it p(I→C). Logically, the probability of going from infected to recovered is thus 1-p(I→C).

We need another probability, the probability of dying while critical: p(C→D). Analogously, the probability of going from infected to recovered is 1-p(C→D).

Now we’re only missing the rate of becoming critically ill after being infected, the rate of dying while being critically ill, and the rate of recovery while being critically ill. Reading through current estimates, I get the following numbers (I aggregated these from multiple sources, feel free to use different numbers; there should be more accurate numbers over time):

Number of days from infected to critical: 12 (→rate: 1/12)
Number of days from critical to dead: 7.5 (→rate: 1/7.5)
Number of days from critical to recovered: 6.5 (→rate: 1/6.5)

Filling all this in:

Triage and Limited Resources

There has been lots of coverage on triage taking place in Italy and other highly impacted places, meaning that doctors have to choose who gets treated with the limited resources available. This might be incorporated into a model as follows:

Imagine a country with B ICU beds suitable to treat critically ill Coronavirus cases. If there are more than B critically ill patients (the amount is C, our critical compartment), all those above B cannot be treated and thus die. For example, if B=500 and C=700, then 200 patients die because there are no resources to treat them.

This means that C-B people die due to shortages. Of course, if we have more beds than critically ill patients (e.g. B=500 and C=100), then we do not have C-B=-400 people dying, that wouldn’t make sense. Rather, we have max(0, C-B) people dying because of shortages (think about it: if C<B (more beds than patients), then C-B<0, so max(0, C-B)=0, and 0 people die because of shortages; if C>B (not enough beds), then C-B>0, so max(0, C-B)=C-B, and C-B people die because of shortages).

We thus need to expand our transitions: from C, there are two populations we have to look at: max(0, C-B) people die because of shortages, and the rest get treated like we derived above. What’s the rest? Well, if C<B (enough beds), then C people get treated. If C>B (not enough beds), then B people get treated. That means that “the rest” - the amount of people getting treatment - is min(B, C) (again, if C<B, then min(B, C)=C people get treatment; if C>B, then min(B, C)=B people get treatment; the math checks out).

We finally get this revised model (with deaths happening immediately for all those above the number of beds available; one could change this to taking several days, dying with a probability of 75%, etc.):

And these are the equations that go with it (note that Beds is a function of time, we’ll get to that in just a minute)

Again, take some time to see how these equations directly relate to the state transitions above. They look difficult, but they are just another way of describing the diagram.

Time-Dependent Variables

For this model, we only have two time-dependent variables: R₀(t) (and thus β(t), as R₀ = β / γ) and Beds(t).

For R₀(t), we will again use the following logistic function:

For Beds(t), the idea is that, as the virus spreads, countries react and start building hospitals, freeing up beds, etc. Thus, the number of beds available increases over time. A (very) simple way to do this is to model the amount of beds as

Where Beds₀ is the total amount of ICU beds available and s is some scaling factor. In this formula, the number of beds increases by s times the initial number of beds per day (e.g. if s=0.01, then on day t=100, Beds(t) = 2 ⋅ Beds₀)

Fitting the Model

First of all, let’s think about what we know and what we want to know. This helps us find which parameters we can fix and which we want to fit.

Here are all the parameters our model needs (that’s just all the variables in the equations plus the variables in the functions for R₀(t) and Beds(t)):

N: total population
β(t): expected amount of people an infected person infects per day
γ: the proportion of infected recovering per day (γ = 1/D)
R₀_start (parameter in R₀(t))
R₀_end (parameter in R₀(t))
x₀ (parameter in R₀(t))
k (parameter in R₀(t))
s (parameter in Beds(t))
Beds₀ (parameter in R₀(t))
δ: length of incubation period
p(I→C): probability of going from infected to critical
p(C→D): probability of dying while critical

Let’s go through these: We certainly do not have to fit N, we can just look at the population of the region we want to model the disease in. The same goes for Beds₀, we can easily look up the number of ICU beds in a region (I have prepared all the datasets to do this in Python, we’ll have a look at them in the next section). δ and γ are fixed to δ=1/9 and γ=1/3, these are the best estimates I found reading through papers. Concerning β(t), we’re calculating beta through R₀(t) and γ, so there’s no need to find any separate parameters for beta. The beds scaling factor s can be fitted; Admittedly, it does not play a big role in the outcome as, until now, the amount of people not receiving treatment due to shortages was small compared to the total amount of deaths.

I have collected two estimates for the probabilities p(I→C) and p(C→D), split up by age group (again, we’ll look at these in the next section). This would allow us to calculate the probabilities as we derived in the last article: weighted by the proportion of the population per age group. Here, we will in fact try to fit these parameters - until now, trying to fit these always yielded results extremely close to the collected estimates.

All in all, we’re only left with these parameters to fit:

p(I→C)
p(C→D)
R₀_start (parameter in R₀(t))
R₀_end (parameter in R₀(t))
x₀ (parameter in R₀(t))
k (parameter in R₀(t))
s (parameter in Beds(t))

Supplemental and Coronavirus Data

I have collected and cleaned data for age groups, probabilities, and ICU beds from UN Data. We’ll get the up-to-date case numbers from here. Let’s have a look:

First, load the data …

beds = pd.read_csv("https://raw.githubusercontent.com/henrifroese/infectious_disease_modelling/master/data/beds.csv", header=0)
agegroups = pd.read_csv("https://raw.githubusercontent.com/henrifroese/infectious_disease_modelling/master/data/agegroups.csv")
probabilities = pd.read_csv("https://raw.githubusercontent.com/henrifroese/infectious_disease_modelling/master/data/probabilities.csv")
covid_data = pd.read_csv("https://data.humdata.org/hxlproxy/data/download/time_series_covid19_deaths_global_narrow.csv?dest=data_edit&filter01=merge&merge-url01=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2Fe%2F2PACX-1vTglKQRXpkKSErDiWG6ycqEth32MY0reMuVGhaslImLjfuLU0EUgyyu2e-3vKDArjqGX7dXEBV8FJ4f%2Fpub%3Fgid%3D1326629740%26single%3Dtrue%26output%3Dcsv&merge-keys01=%23country%2Bname&merge-tags01=%23country%2Bcode%2C%23region%2Bmain%2Bcode%2C%23region%2Bsub%2Bcode%2C%23region%2Bintermediate%2Bcode&filter02=merge&merge-url02=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2Fe%2F2PACX-1vTglKQRXpkKSErDiWG6ycqEth32MY0reMuVGhaslImLjfuLU0EUgyyu2e-3vKDArjqGX7dXEBV8FJ4f%2Fpub%3Fgid%3D398158223%26single%3Dtrue%26output%3Dcsv&merge-keys02=%23adm1%2Bname&merge-tags02=%23country%2Bcode%2C%23region%2Bmain%2Bcode%2C%23region%2Bsub%2Bcode%2C%23region%2Bintermediate%2Bcode&merge-replace02=on&merge-overwrite02=on&filter03=explode&explode-header-att03=date&explode-value-att03=value&filter04=rename&rename-oldtag04=%23affected%2Bdate&rename-newtag04=%23date&rename-header04=Date&filter05=rename&rename-oldtag05=%23affected%2Bvalue&rename-newtag05=%23affected%2Binfected%2Bvalue%2Bnum&rename-header05=Value&filter06=clean&clean-date-tags06=%23date&filter07=sort&sort-tags07=%23date&sort-reverse07=on&filter08=sort&sort-tags08=%23country%2Bname%2C%23adm1%2Bname&tagger-match-all=on&tagger-default-tag=%23affected%2Blabel&tagger-01-header=province%2Fstate&tagger-01-tag=%23adm1%2Bname&tagger-02-header=country%2Fregion&tagger-02-tag=%23country%2Bname&tagger-03-header=lat&tagger-03-tag=%23geo%2Blat&tagger-04-header=long&tagger-04-tag=%23geo%2Blon&header-row=1&url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_covid19_deaths_global.csv", parse_dates=["Date"], skiprows=[1])
covid_data["Location"] = covid_data["Country/Region"]




# create some dicts for fast lookup
# 1. beds
beds_lookup = dict(zip(beds["Country"], beds["ICU_Beds"]))
# 2. agegroups
agegroup_lookup = dict(zip(agegroups['Location'], agegroups[['0_9', '10_19', '20_29', '30_39', '40_49', '50_59', '60_69', '70_79', '80_89', '90_100']].values))

# store the probabilities collected
prob_I_to_C_1 = list(probabilities.prob_I_to_ICU_1.values)
prob_I_to_C_2 = list(probabilities.prob_I_to_ICU_2.values)
prob_C_to_Death_1 = list(probabilities.prob_ICU_to_Death_1.values)
prob_C_to_Death_2 = list(probabilities.prob_ICU_to_Death_2.values)

The beds table has the amount of ICU beds per 100k inhabitants for many countries.
The agegroups table has the amount of people per age group for all countries.
The probabilities table has probabilities I collected for the transitions I→C and C→D per age group (two separate probabilities; only use either _1 or _2) (we will not use these as we will try to fit the transition probabilities)
covid_data is a huge table with the number of fatalities per region per day, from 2020–01–22 onwards.

As an example, here are the overall deaths for the whole world:

Not all countries and regions are included in the table. If you want to model the outbreak in a region that is not included, however, you should be able to find the data you need through a quick google search.

You might have noticed that we’re only using data for the number of deaths and not the number of cases reported. The reason is simple: Reporting of confirmed cases is extremely noisy and strongly depends on the number of tests (and still with enough tests, not everyone that’s infected will be getting tested). For example, the case number might increase from 10000 to 15000 from one day to the next, but that could just be because the number of tests increased by 5000. In general, the number of deaths reported is much more accurate - deaths are quite hard to miss, so the reported numbers are probably quite close to the real numbers.

Coding the Model

Having set up all the data, we can now get to coding our model. Again, here are the equations:

Let’s translate them into code just like in the last articles (one caveat: we are calculating β a little simplified here as calculating it rigorously would be much more complicated and would have negligible impact on the outcome; you can see the extended implementation here if you’re curious):

def deriv(y, t, beta, gamma, sigma, N, p_I_to_C, p_C_to_D, Beds):
    S, E, I, C, R, D = y

    dSdt = -beta(t) * I * S / N
    dEdt = beta(t) * I * S / N - sigma * E
    dIdt = sigma * E - 1/12.0 * p_I_to_C * I - gamma * (1 - p_I_to_C) * I
    dCdt = 1/12.0 * p_I_to_C * I - 1/7.5 * p_C_to_D * min(Beds(t), C) - max(0, C-Beds(t)) - (1 - p_C_to_D) * 1/6.5 * min(Beds(t), C)
    dRdt = gamma * (1 - p_I_to_C) * I + (1 - p_C_to_D) * 1/6.5 * min(Beds(t), C)
    dDdt = 1/7.5 * p_C_to_D * min(Beds(t), C) + max(0, C-Beds(t))
    return dSdt, dEdt, dIdt, dCdt, dRdt, dDdt

That’s really just the equations typed into Python, nothing exciting going on! Now, onto the R0-function and the whole model that takes the parameters to fit (and some we already know) to calculate the curves of S, E, I, C, R, and D:

gamma = 1.0/9.0
sigma = 1.0/3.0

def logistic_R_0(t, R_0_start, k, x0, R_0_end):
    return (R_0_start-R_0_end) / (1 + np.exp(-k*(-t+x0))) + R_0_end

def Model(days, agegroups, beds_per_100k, R_0_start, k, x0, R_0_end, prob_I_to_C, prob_C_to_D, s):

    def beta(t):
        return logistic_R_0(t, R_0_start, k, x0, R_0_end) * gamma

    # agegroups is list with number of people per age group -> sum to get population
    N = sum(agegroups)

    def Beds(t):
        # the table stores beds per 100 k -> get total number
        beds_0 = beds_per_100k / 100_000 * N
        return beds_0 + s*beds_0*t  # 0.003

    y0 = N-1.0, 1.0, 0.0, 0.0, 0.0, 0.0  # one exposed, everyone else susceptible
    t = np.linspace(0, days, days)
    ret = odeint(deriv, y0, t, args=(beta, gamma, sigma, N, prob_I_to_C, prob_C_to_D, Beds))
    S, E, I, C, R, D = ret.T

    R_0_over_time = [beta(i)/gamma for i in range(len(t))]  # get R0 over time for plotting

    return t, S, E, I, C, R, D, R_0_over_time, Beds, prob_I_to_C, prob_C_to_D

Here’s what we get when we simulate a disease in a population with not enough ICU beds (the plotting function is in the notebook):

You can see the spike in deaths attributed to resource shortage (not enough beds) in the graph in the lower right corner.

Right, we now have the model and the data. Let’s put our curve fitting skills to good use!

Curve Fitting

First, we get the data to fit and the parameters we already know, and we define initial guesses and lower and upper bounds for those we do not know (to aid the curve fitter and get good results) - feel free to change the initial guesses and bounds.

data = covid_data[covid_data["Location"] == "Italy"]["Value"].values[::-1]
agegroups = agegroup_lookup["Italy"]
beds_per_100k = beds_lookup["Italy"]
outbreak_shift = 30
# parameters to fit; form: {parameter: (initial guess, minimum value, max value)}
params_init_min_max = {"R_0_start": (3.0, 2.0, 5.0), "k": (2.5, 0.01, 5.0),
                       "x0": (90, 0, 120), "R_0_end": (0.9, 0.3, 3.5),
                       "prob_I_to_C": (0.05, 0.01, 0.1), "prob_C_to_D": (0.5, 0.05, 0.8),
                       "s": (0.003, 0.001, 0.01)}

One parameter that’s very important and that we haven’t talked about yet is outbreak_shift: The case data begins on January 21, so our model will think that the virus started to spread on that day. For many countries, this could have in fact been days or weeks later or earlier, and it has a big impact on the fitting. Of course, we still don’t know when the first person was infected in each country - use your best judgement. For example, if you think that the country you are trying to fit had the first case on January 30, you should set outbreak_shift to -9.

(Sadly, it’s not easy to use outbreak_shift as an additional parameter as only integers (full days) are allowed)

We now fill up the data we want to fit (the fatalities per day) with zeroes at the beginning to account for the outbreak shift. We also define the x values for fitting; that’s just a list [0, 1, 2, …, number of days total].

days = outbreak_shift + len(data)
if outbreak_shift >= 0:
    y_data = np.concatenate((np.zeros(outbreak_shift), data))
else:
    y_data = y_data[-outbreak_shift:]

x_data = np.linspace(0, days - 1, days, dtype=int)  # x_data is just [0, 1, ..., max_days] array

For fitting, we need a function that takes exactly an x-value as first argument (the day) and all the parameters we want to fit, and that returns the deaths predicted by the model for that x-value and the parameters, so that the curve fitter can compare the model prediction to the real data. Here it is:

def fitter(x, R_0_start, k, x0, R_0_end, prob_I_to_C, prob_C_to_D, s):
    ret = Model(days, agegroups, beds_per_100k, R_0_start, k, x0, R_0_end, prob_I_to_C, prob_C_to_D, s)
    # Model returns bit tuple. 7-th value (index=6) is list with deaths per day.
    deaths_predicted = ret[6]
    return deaths_predicted[x]

There’s not much left to do! Just initialize a curve fitting model, set the parameters according to the inits, mins, and maxs we defined, set a fit method (you can try different ones here, differential_evolution might work good, for example), and fit:

mod = lmfit.Model(fitter)

for kwarg, (init, mini, maxi) in params_init_min_max.items():
    mod.set_param_hint(str(kwarg), value=init, min=mini, max=maxi, vary=True)

params = mod.make_params()
fit_method = "leastsq"

result = mod.fit(y_data, params, method=fit_method, x=x_data)

result.plot_fit(datafmt="-");

And - finally - here’s the fit we get for Italy:

Not too bad! Let’s look at the parameters the fitter predicts:

{'R_0_end': 0.6152259605003915,
 'R_0_start': 3.670318365998767,
 'k': 4.894181796946975,
 'prob_C_to_D': 0.4989844934842116,
 'prob_I_to_C': 0.043276277260476996,
 's': 0.004548385254000369,
 'x0': 84.75275642328448}

Great, they look quite realistic and in-line with many data points reported in real life! x0 is 84, so with the data beginning on January 21 and the outbreak shift set to 30 days, day 84 for our model is March 15. x0 is the date of the steepest decline in R0, so our model thinks that the main “lockdown” took place in Italy around March 15, very close to the real date.

Let’s use the best-fitting parameters to get a look at the future our model predicts (you can zoom in when you look at the code in the notebook):

Note the spike in deaths due to filled hospitals around the end of march 2020 that increases the fatality rate, which finally comes out at around 1.4%.

Here’s the prediction zoomed in on March to May - if the model is right, Italy has gone through the worst already and deaths should decrease strongly over the next months. Of course, our model thinks that R0 will stay around 0.6; if it goes up again as lockdowns are reverted, the numbers will start increasing again!

Recap

That’s all! You should now be able to (try to) fit your model to real-world case data and hopefully make accurate predictions!