Visualization of VID-19 (Coronavirus) in Croatia and the World, Part Two

Hello everyone.

In my penultimate article, I wrote and explained something about visualizing coronavirus data.

This is a continuation of that post. But first, a small digression.

X-ray detection of coronavirus

In the meantime, I have published an article describing the potential for the use of x-rays for detecting COVID-19 in patients.

There were both good and bad reactions to that article, although I was very careful with the result. I tried to clarify my doubts about them.

It seems to me that people do not read (the whole?) article, but cherrypick parts and then comment on those parts out of context. In the meantime, similar ideas began to appear on other portals.

I would like to point out the positive reactions, where a radiologist called me via Skype and explained to me the medical side of the results. She also expressed her doubts that the results are being correctly interpreted. We also discussed the article that shares her opinion on the usage of AI to fight coronavirus.

I have to admit that I did not understand everything she said without Googling some terms. My conclusion is that the neural network learned to differentiate viral and bacterial pneumonia. That was what we needed, an expert to clarify for us what are we seeing in the x-rays. And see what is the realistic result of our model.

I would like to thank her once again, but also allow her to remain anonymous.

After such an evaluation by an expert, we stopped promoting the article. Although, as we emphasized, we have promoted neither accurate diagnostics nor medicine. We promoted an interesting application of technology.

I would still like to meet with a radiologist and go through the x-rays one more time. After that, I can explain to myself (and you) what neural network found in the picture.

We sent emails all over the place asking people to contact us and got almost no response. The situation is a little tricky and people don’t have the time now.

Introduction

My name is Kristijan Saric and I’m a programmer. I’m not an expert on viruses, I’m not an epidemiologist. I don’t consider myself an expert in anything, but I’m probably, based on some categories, considered an “IT expert” (whatever that is).

All I write and show here is not the opinion of a person who is an expert in the coronavirus. I am attempting to explain to myself (and share with you) what is happening and how quickly the virus spreads.

People who are adamant that we should only let recognized experts deal with this situation are missing a key problem in that approach.

Media is not perfect, sometimes biased and clickbaity and sometimes conveys unverified information.

Some, less informed, spread the story as if you put cabbage on your left elbow, you can’t contract anything. The only problem is, you have to walk around with cabbage on your left elbow.

This is an attempt to explain the situation based on data and facts. I cannot emphasize how important it is, unbiased analysis of data. There is an amount of personal interpretation, but the focus is on the graphs and the data behind them.

I invite everyone interested to get involved! Use the data and these graphs as a source of information. Come up with (interesting) ideas so we can improve this. I’m sure some people can do this a lot better so they can use it as a starting point.

Everyone can run the things I present and they will get the same results. If you find any mistakes, feel free to let me know.

As promised last time, I put all the content open and free to everyone.

In this first blog post, the focus was on data visualization. In this post, I will clarify how this growth is progressing and what trend is behind the data. I’ll make a simple regression where we can see how the virus itself is progressing. That will enable us to guess how it will continue to progress.

ATTENTION! This version is very imprecise and not a true prediction.

Keep in mind that the data we are using is a bit older, but is also more accurate for that reason.

Did I forget to point out that this is a projection, not the truth?

Data used

You can see the information we used here.

This information is from the “Johns Hopkins University Center for Systems Science and Engineering”. “Johns Hopkins CSSE” for short. Information is updated daily.

In the latest version, they changed the data format and file names so that the last data we used was from 03/27/2020. The visualization notebook used older data from 03/25/2020. More information here. When data is updated, you can refresh the data and get new visualizations.

The situation in Croatia and its surroundings

Confirmed cases in Croatia:

Confirmed cases with the countries Croatia borders with:

We can see that Italy is far ahead and compared to it all other countries are a straight line at the bottom of the graph. Let’s see the situation of all neighboring countries without Italy:

We see that Slovenia has the most cases, with Croatia and Serbia somewhere behind.

Let’s see the situation with us when the movement restriction was introduced:

Here we see in red the day the restriction was declared (19/03/2020). Let’s look at the situation in Italy and when they declared quarantine:

If we look at the situation with us where we can focus a little more on the days when the coronavirus was more active in our country:

We can also look at Italy:

Aside from the numbers being much larger, we can see that the curve is going in the same direction. That’s interesting. To be able to think about how this situation will progress, it would be useful to have more information.

Let’s focus on China, where we have the most data. Some provinces are well on their way to preventing further spreading.

The situation in China

The situation in the provinces in China is as follows:

We see Hubei has the most cases and it might be useful to look at a graph without that province:

It would be nice to be able to show this situation on the map so it would be a little clearer. Here is the situation across Asia:

We can also focus a little more on China and look at the situation there:

Focus on the change in color rather than circle size. We won’t put it here but there is a nice animation in the notebook as the number of confirmed cases increases.

If we want to look at how the virus is progressing, we need to focus on individual provinces. We can look at what is and what was happening there. Let’s look at the four provinces with the most cases.

Chinese provinces

First of all, these provinces that we will look at are:

  1. Hubei
  2. Guangdong
  3. Henan
  4. Zhejiang

Let’s look at the situation in Hubei province:

The situation in Guangdong Province:

The situation in Henan Province:

The situation in Zhejiang Province:

The number of confirmed cases and places are different, the pictures look very similar. There is a curve that looks like the letter S on each graph. Why?

That’s much better explained in this video and I don’t want to go into too much detail. I can try to explain it in a relatively simple way.

Let’s take one of these graphs.

On the Y-axis (up and down) we have the number of confirmed cases. On the X-axis (left and right) we have the number of days in a year. So 01/01/2020 is the first day of the year, 15/01/2020 is the fifteenth day of the year, and so on.

Every day we have several confirmed cases, and every day the number of confirmed cases increases. That curve never drops to zero, since confirmed cases are not getting eliminated. When the case is confirmed, the graph jumps up. And stays up. The problem here is that the number of confirmed cases starts to increase. If that happens, then we can call it exponential growth and we have a lot more new cases every day than the day before.

What would be nice to see is a straight line for a long time so we can say – we have no new cases and the situation under control.

We can see that the number of confirmed cases is increasing up to about 43 days. Then we see a noticeable drop in growth after the quarantine.

There is one curve/function that nicely describes this growth. This function is the logistic function.

If we try to fit it into this image, it will look like this:

In orange, we see the attempts to describe the data we have, along with the current projection of the number of infected.

Assuming the situation remains stable, we can assume that the number of infected would be (blue line on the graph) somewhere below 70,000 people. We are only looking at Hubei Province.

The red color indicates today, and the blue dots on the graph are the true data we have for the confirmed cases.

Let’s try to make it for the other provinces. Let’s look at Guangdong province:

We see the curve again in orange. The red line shows the current day and the blue estimate which would be the maximum number of confirmed cases.

But let’s pay attention to the information we received. The curve was almost flat by the 70th day, but it began to rise again shortly before the 80th day.

The number of confirmed cases under this function should not exceed 1400, and today we have a number greater than that. Why? Because this function is an approximation, not a prediction of the future.

But let’s look at the situation further, Henan Province:

Here the simple regression looks fine again.

Also, let’s look at Zhejiang Province:

Here again, we see that this simple regression failed to capture/approximate the number of confirmed cases. In other words, this is a very rough way to evaluate something.

Now, we can try to do the same for other countries, but this will be even less true.

Italy

Confirmed cases:

Regression:

Spain

Confirmed cases:

Regression:

Germany

Confirmed cases:

Regression:

Number of new cases

To make the situation a little easier, we can make the following view. Instead of showing how many cases we have confirmed, we can look at how many new cases occurred each day.

Then we get (hopefully) something that looks like a normal (Gaussian) curve. There will be few people at the beginning and it will grow until we reach the peak. Then it will decline until the number of new confirmed cases disappears.

What is the normal curve?

This is function that looks like a bell:

Let’s look at the situation in Henan, China (now looking at the number of new confirmed cases):

 So if you tried to approximate this with a normal function, you would get something like this:

What we can try, or I can only mention as an idea that I wanted to explore a little when I had more time, is the “Gaussian process”. What is interesting about them is that they can give us confidence intervals. They are a very simple approximation of functions that do not rest on (complicated) models.

Confidence intervals and model-free estimation

Currently we can show confidence intervals. Let’s look again at Henan province. This is the information itself:

This is how the estimate of the future looks and what will happen to the number of newborns. And what matters to us is a safety interval spaced by two standard deviations:

We can try to evaluate this for other countries as well. Let’s take Germany as an example:

We see that the predicted function is not the most accurate. It assumes that the number of new cases is at zero with a deviation between 3000 and -3000.

Let’s have look for Spain:

This is not surprising when we consider that such a function does not know or understand anything about the virus or epidemics. In general – it takes a bunch of data and tries to create a function based on that data.

What needs to be done is a (simple?) model, but it takes a lot of time and expertise.

It’s not something simple or fast, and with this, I’m already hitting my time allotted for this visualization.

But the package I included, PyMC3, would be very useful here, especially since they could handle the probabilities in the model.

Conclusion

Based on simple regressions, we can get tentative numbers of the infected. Then we can, though very tentatively, estimate what may happen in the future.

The algorithms for describing the data used here are not complicated, new or best. The focus of this was simplicity above all – is there an easy way to explain how this virus spreads?

The more advanced algorithms can probably lead to greater accuracy, but then we would lose the ability to interpret these same results.