this will be the last part of our series on the Coronavirus (COVID-19). If there is interest in the application of artificial intelligence related to the Corona virus, it is possible that we will return to the topic, but for now we are leaving it aside.
In addition to this article, the series contains:
What we (and how) projected
In the first article, we focused on visualizations of then-available data to help people make sense of what was actually going on.
After that, we had an Eureka moment with X-ray recognition of the coronavirus, but in the end it turned out that the algorithm recognized the difference between viral (atypical) and bacterial (typical, ordinary) pneumonia. Whether such technology will be applicable in healthcare remains to be seen.
In a second article, we focused on projections in several ways. The first projection algorithm used linear regression to predict the number of infected.
In addition to that, we touched upon Gaussian curves and estimates without safety intervals (yes, we served you lessons in statistics through a hot topic).
After that, I tried to get you hooked on the Jupyter Notebook, for you to try working with the data yourself.
In the third part, Kristijan explained the SIR model that could be used to estimate and predict epidemics, but only when we have clearly identified some of the factors required for a projection of virus spread, such as spread and infection factors.
Today we will return to March 27, 2020. and 08.04.2020. to review how accurate the algorithm estimates were using linear regression (and smuggle some more statistics inside so you don’t get bored).
First of all, i repeat that these estimates are based on a small set of data that was available at the time, and it is possible that they didn0t age well.
The red vertical line indicates the forecast date and the last day of data.
Blue dots are visualized data up to the forecast date.
The orange line is a linear regression of the data.
The blue vertical line indicates the predicted highest number of infected, predicted by linear regression.
Without further ado, the algorithm predictions from 27.3.2020.
Hubei Province, China:
Guangdong Province, China:
Henan Province, China:
Zhejiang Province, China:
Now that we’ve been caught up on what we predicted, we’re going to add real data to those graphs (the algorithms didn’t have that data because they don’t see the future).
To anyone who has been following the situation with the coronavirus, these projections above look strange. We will also explain why.
We will mark the actual data with orange dots, while the rest of the graph is identical to the graphs you have already seen.
(the algorithm had access only to the data before the red date line)
You can click on all images to open them in full size.
For Hubei we got pretty accurate results, the predicted numbers and the actual numbers almost match us. Let’s look at other provinces as well.
Here we are already seeing the problem, even on the part of the curve where we had the data. The difference between the predicted and the actual situation is visible.
We’ll make a pit stop here and explain what is happening.
When we use linear regression, we try to find a curve that describes all the data and their trend. It’s goal us to turn a large set of data into something we can work with and use for predictions. Especially when we use it to predict something that is “messy” like the number of new infections. It will deviate from reality, but it can still give us valuable data. Such as, based on the growth so far we can assume how it will continue and (in this case more important) where it will stop.
Similar to Hubei, Henan Province followed the curve almost perfectly.
The province of Zhejiang is somewhere in between, because it deviates a little from the sigmoid function, but again we see a certain regularity in its movement.
Now that we know the results can variate, and we know why, let’s take a closer look at Europe.
s it possible that the prediction is so wrong?
The algorithm accuracy depends on how much data there is, and how much that data follows the sigmoid function.
Note that the red vertical lines indicate the last data that was available to the algorithm. So the estimate that had data of April 8. is much more accurate than the one made on March 27th, but we’ll circle back to that after we take a look at Spain, Germany, and of course Croatia.
Let’s go in order, Spain.
Another interesting situation, the first estimate exceeded the real situation because Spain had rapid growth in the number of infected, but also the curve was corrected by April 8, as new information became available to the algorithm. As always, the more information the algorithm has, it can give us more accurate predictions.
(we can also notice a problem in data collection, where the value that should not fall, the total number of confirmed cases fell)
In Germany, we encountered a similar situation as in Italy, where the rapid growth of new cases continued even after apparent appeasement, which the algorithm then interpreted as the beginning of the end of the sigmoid function.
Croatia is, in this case in a similar position as Germany.
Data is important!
I know it’s a lot to digest, especially since some estimates fit and some don’t. I will try to simplify it as much as possible, the goal is not to write a scientific paper but to explain a small part of the way we anticipate trends.
The goal of such a projection (linear regression) is to reduce the data set to a curve. Linear regression predicts what the end of a curve will look like depending on its beginning.
Such an approach is not perfect, but it gives good results, which are sensitive to sudden jumps or changes.
We will make an example from Germany data, where we will make a linear regression for each day for the period from 27.3.2020 (when we made the first regressions) to 6.5.2020. (when the data for this article was downloaded.
And make an animation from it so you can see the daily changes.
In a few days, the linear regression will “catchup” to the actual data, and it will be able to accurately describe everything that happened with the number of infected, and with greater accuracy.
Such a function can be used in the future to better predict and curb some new world pandemics if they ever occur.
As we are nearing the end of this article, let’s make another set of regressions with “fresh” data.
Experience is important!
We have warned about this kind of development repeatedly in our past articles. We didn’t want to falsely present the prediction as completely valid. They are after all, based on a limited (small/short) set of historical data.
One should be careful with any statement related to a pandemic, and a lot of checks and tests are needed before something can be determined with certainty.
Inherently from our work with data and projections, we must emphasize how accurate the results are.
Only when there is enough baseline data can we reliably project something into the future. Anything that is not based on enough data is just speculation.
It’s time to say goodbye to coronavirus analysis and dedicate our time to some new projects.
We hope that you are slowly returning to your normal life after the quarantine.
Be careful how you predict future events, be objective, and make sure the data is solid!
For any questions or if you need to analyze a big dataset, feel free to contact us!