Before this, Matija had published a good tutorial about using the Jupyter Notebook. In my last article, I wrote and explained a simple regression using coronavirus (COVID-19) data.
This is a follow-up to that post, in which the focus will be on the model used to predict viral outbreaks. As promised, this sequel is intended as the culmination (yes, I know these posts are THAT exciting) of this series of articles. In the meantime, I also received a few suggestions on what to do next.
Someone suggested that we should compare the “ordinary” flu data to COVID-19 data and look for differences or similarities. I can’t find where that comment is and who wrote it, but I remember seeing it on Facebook. Or I went crazy, which is also possible.
Nikola Prpic suggested on Facebook that he should address the modeling of spread in society using network theory and finding Nash equilibrium to combat the spread of viruses. I’ve heard (and only heard) about both, but I’m not sure if I want to fall into that rabbit hole.
My name is Kristijan Šarić and I’m a programmer. I’m not a virus expert, I’m not an epidemiologist. I don’t consider myself an expert at anything. But I’m probably by some categorization an “IT expert” (whatever that is). All work that I show here is not the opinion of a person who is expert in the coronavirus. It’s my attempt to explain to myself what is happening and how fast the virus is spreading (and to share it with you).
Likewise, people who are convinced that we should let only recognized experts deal with this situation fail to see a key problem in that approach. The media is not perfect. The media is sometimes not objective. The media sometimes conveys unverified information. Similar to individuals who are not well versed and spread the story that if you put cabbage on your left upper arm you can’t get the coronavirus. The only problem is, you have to walk around with cabbage on the left upper arm.
This is an attempt to explain the situation based on data and facts, and I cannot stress how important I think it is. Although (always!) there is an amount of personal interpretation, the focus will be on the graphs and the data behind them. I urge everyone interested to get involved, to use the data and these graphs as a source of information, and to come up with (interesting) ideas to help us improve this. I’m sure some people can do this a lot better, so this may be a good starting point for someone. Everyone can run all of the things I created and they will get the same results. If you find any mistakes, feel free to let us know.
As promised, all content is available here.
In last blog post, the focus was on regression, and in this post, I will try to clarify how growth progresses with the model. The SIR model is often used to model epidemics.
This version of the model is very imprecise and I don’t think it should be taken seriously.
Keep in mind that the data we are using is a bit behind the real data, but because of that, it’s more accurate.
Did I forget to point out that this is just a projection, not the truth?
Data we used
You can see the data we used here.
This information is from the Johns Hopkins University Center for Systems Science and Engineering or Johns Hopkins CSSE for short. They are updated daily.
With the latest version, they changed the data format and file names. So, the data we used here is from 03/27/2020, while the visualization notebook uses older data from 03/25/2020. More information about that here. Every day when data is updated, you can refresh the data and get new visualizations.
The situation in Croatia and its surrounding countries
Let’s first take a look at these recycled graphs with the latest data and look at the regressions we got last time. All this is further clarified in the last article. After that will we dive into the SIR model and the results I got from it.
But before that, I added interactivity to the Jupyter Notebook. Awesome!
You can now move the slider and get different results. Damnnnn!
So now you can see which countries have more than N infected, where N is the value you can choose by moving a slider.
Joking aside, this is very if you want to interactively find some data or display some graphs. We will use this later in the SIR model to get nice pictures that we can describe. If you used Mathematica this was one of the things that were useful to me, and you can use it in the Jupyter Notebook. To do this you need to run this project.
Example (this slider on the top moves). You’ll just have to believe me or run the project yourself to test it:
But back to the article.
The number of confirmed cases in Croatia:
The number of confirmed cases in the countries bordering with Croatia:
We see that Italy dominates. Compared to it all other countries are a straight line at the bottom of the graph. Let’s see the situation of neighboring countries without Italy:
We see Serbia in the lead, with Croatia and Slovenia just behind.
If we look at the situation with Croatia, we can focus a little more on the days when the coronavirus was more active.
We can also check for Italy:
And let’s look at regressions.
Croatia, a logistic curve showing the number of confirmed cases, the red line is today’s date:
Croatia, Gaussian curve showing the growth (and decline) of new cases:
Now we can start the story around the mathematical modeling of infectious diseases.
In short, the models used to model epidemics are often called SIR or SEIR. There are variations on the same theme, but the principles are simple.
SIR stands for – Susceptible, Infected, Removed.
In the population we have people who are prone to infection and can be infected with a coronavirus. We also have those who are already infected and those who have recovered or died from the virus. There is a sequence to these states. People can only go from susceptible to infected. And people only go from infected to removed, which may be that they recovered or died (in other words, they can’t infect anyone else). There are a lot of things to say and write about this as the topic is quite rich. There are so many modifications to this model, and we don’t need to go into (too much) detail to try to see what we can get from such models.
What matters is that we can place the relation of such variables (S, I, R) into formulas that can give us the values of such variables at any given moment. So we can say, “now is day 105, what are the values of the uninfected, infected, and removed” and get the values. Time is often denoted by t, so we will also denote it here:
The question is what is this function that returns us the values of S, I and R and what that function looks like. These are differential equations that can define the relationship between these variables. Let’s look at how they are written, but let’s start in order. First, let’s look at how we can get , the value of people not yet infected:
We will ignore the left side, it is only a notation that denotes the differential equation. What do we have on the right? We have one parameter that we do not know, and immediately after that parameter we have values that we know. , which indicates the number of uninfected (susceptible). We have , which indicates the number of infected. And we have . What is ? is the total number of people in the population and we can often see . So we know everything except , and this parameter indicates what amount of disease infection for each individual. How likely is for each person is to get infected with the virus?
And, lastly, what can we see as a result , people who are removed from the population?
We only see the parameter and , infected. In other words, is the rate at which people move from infected to removed (whether recovered or dead).
What can we see in the equation for the infected? Let’s look:
Similarly, with another parameter here, , which is used with . is the rate at which people move from infected to removed.
And what does this tell us? Is there anything we can do with it? Of course, we can visualize different graphs and see how the situation changes as we change these parameters. Another important thing to take into consideration is the average number of days a person is infectious:
Let’s get some “real” numbers to play with. Let’s say we have a population of 10000 to 4000000 people. Let’s say we have from 0.001 to 0.5 iand say we have 1/ (how many days people are infectious) from 1 to 30. Also, we will say that we don’t have dead and recovered people at the beginning and that we have 5 people initially infected.
In this graph, let’s take an example of having a population of 2000000, of 0.5, and 1 / of 15, and look at what we get:
Horizontally (X-axis) are the days of the year, and vertically (Y-axis) is the population. The blue line is the number of infected, red is the number of suceptible, and green is the number of removed.
As time goes on, values change. I hope that makes sense, number of patients that are newly infected or newly removed is not the same every day.
What’s interesting here is that the blue line representing the infected ones plumets around the 120th day. If we look at what else is happening on the around 120th day, we can see that it is a day when we have the recovered (green) almost to the point of the whole popukation. And those who are not infected (red) are at zeto. In other words, in a scenario like this, after 120 days the entire population was infected, and than removed.
What happens when we change these values? Suppose that the spread rate is twice as low:
Now we see that the spread rate is much slower and it takes a lot more time for people to become infected (blue line), as well as that the entire population is never completely removed (not everyone was infected at some point).
What can we get if, instead of 15 days of infection, we put the infection at 3 days? So suppose that = 0.5and that the coronavirus is infectious for only 3 days?
Let’s take a look:
This is a much better situation. About half of the population is moved from infected to removed and after 120 days the whole situation is under control. We can see that this line of infected (blue line) is much more “flat”.
On the Jupyter Notebook, you have all the content, so you can modify these values and watch as they affect the outcome.
Unfortunately, my (limited) experience with SIR models tells me that they are very volatile. Although I tried to get precise values in many different ways, these models are sensitive to small parameter changes, as well as the fact that we have to full some values from our a…air. From thin air.
What is the population number? Is it the population for one country, province, the sum of all cities, or something else? Also, what are the initial values of the i parameters?
When we try to optimize a model to reduce errors between existing data and the model we are trying to make, the starting point is extremely important.
If we start with some values that are not close to the final solution, the function cannot find the optimal parameters (does not converge).
All in all, I am sure there are explanations as to why to use these or that values. But just maybe these solutions can be misleading and confirm that this model is accurate. If there is not one and only one set of values, this model is prone to “”fitting””.
There are several drawbacks to this model that I made, one of which is that the model error is based only on infected people (the problem with this dataset, the death toll was very odd so it was eliminated).
There are many optimizations that can be made here, but it seems to me that this model is still unstable.
Less typing, more graphing.
Let’s look at Hubei.
We need to zoom in a bit to see the red line and data points.
Let’s look at a situation where we don’t have all the information. I would not take this seriously.
SIR models are useful as simple models to show the dynamics of an epidemic and to model a scenario where we have more known variables.
They paint a really nice picture in imaginary situations of expansion, and we can visually present it to people.
Behind their simplicity lies a layer of complexity that is not for the amateurs and with which, I fear, even experts can sometime slip.
Happy learning people, enjoy!