In a recent blog post, Kristijan addressed the visualization and projection of Covid-19 data. As promised, the entire project is available here. You stuck with me today, where we will go through step by step how to get results like this, but with a focus on Croatia.
I have to mention that Croatia is a small country with little contamination in comparison to the world, and that in general all projections are more accurate the more data we have. However, it is good to take Croatia as an example because we are here and we monitor the situation day by day.
We need two things for any kind of data analysis. The data and software we will use to process that data.
In this case, we pull the data directly from GitHub, from Johns Hopkins University, which has been prominent in data collection since the start of the Corona crisis
The software we will use is called Anaconda.
Step 1 - Anaconda
Step 2 - Environment
When you install Anaconda on Windows, you will be greeted with a homepage.
You will make the Environment in the “Environments” tab.
With a fresh installation, you will only have a base (root) environment, so we can run it and set up your environment through it and install whatever we need. Start it by clicking ► and “Open terminal”.
We set the environment with the following commands.
(Enter them one by one, and confirm everything, to install everything you need).
conda create -n covid-analysis python=3.7
conda activate covid-analysis
Installing Jupyterlab – To Work and Present in Web Browser:
conda install -c conda-forge jupyterlab
For data manipulation:
conda install numpy pandas
For fancy graphs:
conda install plotly seaborn
conda install pymc3
When that’s all set, we have our environment to work with.
What else do we need? Yes, data.
You will download them by typing in the terminal:
git clone https://github.com/CSSEGISandData/COVID-19.git
That’s it now, we have the data, let’s get to work.
Note: If git is not working properly, you can find the solution here.
conda install -c anaconda git
Then try restarting the data download.
Step 3 -Jupyter Notebook
Feel free to close the terminal and you’re back to the Environments tab. Now you have a new environment called covid-analysis. Click on ► and “Open with Jupyter Notebook”. This will open one terminal (no need to touch it) and a window in your browser where you will see the files on your system.
If you followed the instructions correctly, you will see a folder called “COVID-19” that you can enter and where we will work. This is where all the information we downloaded is located. Within this folder we can open a new notebook under Python 3, in which we will analyze and visualize everything. I called that notebook “COVID-19 Tutorial Notebook”, and you’ll see it in the screenshots.
You will be greeted with an empty notebook, and now real work is starting.
Remember those things we installed? Now we will announce that we will use them.
#import and alias
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
Along the way, we give them abbreviations so that when we use ‘pandas’ we don’t have to spell the whole word, but from now on, ‘pandas’ and ‘pd’ are synonyms, and the jupyter notebook reads them equally. Also for the others. These are standard acronyms that everyone uses!
pandas, numpy, plotly.express, plotly.graph_objects, matplotlib.pyplot
We will also set some starting points for them, such as some values and formats. (whoever reads this and deals with this is laughing at me now, but the goal is to explain this to non programmer people who are willing to learn).
#set the seed – numpy
#style sheet – matplotlib.pyplot
plt.rcParams[‘figure.figsize’] = [12.0, 6.0]
plt.rcParams[‘figure.dpi’] = 80
#context (style) – seaborn
(when # is at the start of the line – then this is a comment, which the program ignores, is written for developers)
Next, we need to tell the program where the data we are working on is. You can view this information yourself to know what is in it, but this is how data loading looks.
(data set name) = (panda). (read csv file) (file location)
confirmed_df = pd.read_csv(‘./csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv’)
deaths_df = pd.read_csv(‘./csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv’)
recovered_df = pd.read_csv(‘./csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv’)
This gives us 3 sets of data, which are the same as the tables we downloaded from GitHub.
You can run all these pieces of code by clicking ►∣ next to each segment, or through the menu (Cell, run all).
We can now review what we have loaded. We can type “confirmed_df” into the new cell and that will give us everything inside.
We can do the same with deaths_df, recovered_df.
Ok, we have lots of data in tables that we don’t see completely. BUT WE HAVE DATA!
This is a good place for a coffee break.
The next thing we might find useful is that the information that is now in the form:
Respectively “Province” if any, Country, Latitude, Longitude, number of cases on 1/22/2020, number of cases on 1/23/2020, number of cases on 1/24/2020. etc.
We could, however, transform such data into something that will be easier to visualize. For example:
Croatia, 45.1,15.2, 1/22/20, 0
That is, put each date in a separate row in the data.
We do this with features that already exist in pandas:
(new data name) = (what data we use). (panda melt) ((what we leave, what we are spliting on, what we call the produced variable)
How to write it?
confirmed_data_df = confirmed_df.melt(id_vars=[‘Province/State’, ‘Country/Region’, ‘Lat’, ‘Long’], var_name=”Date”, value_name=”Confirmed”)
death_data_df = deaths_df.melt(id_vars=[‘Province/State’, ‘Country/Region’, ‘Lat’, ‘Long’], var_name=”Date”, value_name=”Deaths”)
recovered_data_df = recovered_df.melt(id_vars=[‘Province/State’, ‘Country/Region’, ‘Lat’, ‘Long’], var_name=”Date”, value_name=”Recovered”)
And now we have 3 new data sets that are easier to read.
Ok, now we have lists by date, even longer tables we don’t see! BUT WE HAVE SOMETHING!
Now, we need to check if we have a non missing data somewhere, that would really ruin our day.
This is practically, is there anything missing in the confirmed_data_df data in the “Confirmed” column.
Likewise for the other two data sets.
It says nothing is missing! AWESOME!
We are working with 3 data sets now, confirmed cases (confirmed_data_df), deaths (death_data_df), cured cases (recovered_data_df). We could put it all together, to work on only one set of data.
all_data_df = pd.concat([confirmed_data_df, death_data_df[‘Deaths’] , recovered_data_df[‘Recovered’]], axis=1).reset_index().drop([‘index’], axis=1)
(name of new set) = (panda concat) (origin table, what we connect 1 (deaths), what we connect 2 (recovered cases))
Ok let’s see what we got.
Ok, now we have a seriously big table.
Let’s just check for missing data somewhere? And each column separately.
A LOT IS MISSING!
We still have to think a little bit here and see what is missing…
we will use the display of the first X entries in the table:
We miss the provinces in countries that are monitored as total, not by provinces, that’s OK.
And also for completed cases.
These are the days when this value was 0 and therefore not recorded.
We will replace the nonexistent values in Deaths, Recovered, and Confirmed with zeros. It doesn’t matter to the provinces because we know that some countries are followed by provinces and some are complete. We can search for a list of such countries with:
In the table, where Province / State is blank, list the unique Country / Region values.
Now to replace non-existent values with zeros.
all_data_df[[“Deaths”, “Recovered”, “Confirmed”]] = all_data_df[[“Deaths”, “Recovered”, “Confirmed”]].apply(lambda row: row.fillna(0))
(what we’re working on) = (in which table) (in what columns) (apply) (if cell is empty, enter zero)
Another thing, US date format is a little difficult for us to use so we are gonna edit it a little.
all_data_df.loc[:, “Date”] = all_data_df[“Date”].apply(lambda s: pd.to_datetime(s).date())
And we could still say that graphs and everything else will go until the day we have data.
latest_date = all_data_df[“Date”].max()
You can already apply this and list the results yourself. Ok that’s it, the data is cleaner now and we can work with it.
Now let's look at Croatia
The new data set is called cro_data_df and contains all the data where country is Croatia, sorted by date.
cro_data_df = all_data_df[(all_data_df[“Country/Region”] == “Croatia”)].sort_values(‘Date’)
Ok that’s useful BUT it might be more useful to look only at dates when we have cases.
cro_data_df_conf = all_data_df[(all_data_df[“Country/Region”] == “Croatia”)&(all_data_df[“Confirmed”] > 0)].sort_values(‘Date’)
Ok that makes sense now. We have the data. Let’s plot that now.
grid = sns.lineplot(data=cro_data_df_conf[(cro_data_df_conf[“Confirmed”] > 0)], x=”Date”, y=”Confirmed”)
grid = sns.lineplot(data=cro_data_df_conf[(cro_data_df_conf[“Confirmed”] > 0)], x=”Date”, y=”Deaths”)
grid = sns.lineplot(data=cro_data_df_conf[(cro_data_df_conf[“Confirmed”] > 0)], x=”Date”, y=”Recovered”)
Ok it doesn’t look very good, it shrunk some dates with the same values …
And we have an error in the data regarding the recovered ceas, because on 3/17/2020. we did not have 4444 recovered. The result of such a mistake is that the university decided to change the data type, so some errors occurred. But again, we don’t have enough recovered to do anything with this data, so we’ll ignore it at the moment.
And as requested, everything we have applied in the past article to China and Europe will now be applied to Croatia. How do we go about the more complicated things now, if you need another coffee now is the right time.
Before we go to the graphs themselves, we need to get some more things done. For starters, they should turn dates into days of the year. Thus 15.1. becomes the 15th day of the year, while 4.2. becomes the 35th day of the year.
cro_data_df_conf[‘DateWeek’] = cro_data_df_conf[‘Date’].apply(lambda date_row: date_row.isocalendar())
cro_data_df_conf[‘DayOfYear’] = cro_data_df_conf[‘Date’].apply(lambda date_row: date_row.timetuple().tm_yday)
As we get into complex views, not every line will be explained, but you will be able to run the views yourself.
The next thing we need to enable ourselves is the logistic regression features for which you can download the code in the repository or HTML files at the end of the article.
Now we can show the number of confirmed cases in Croatia like this:
ax = sns.barplot(data=cro_data_df_conf, x=”DayOfYear”, y=”Confirmed”, ci=”sd”, palette=”Oranges_d”)
ax.set_title(‘Croatia confirmed cases’)
And make a projection:
fit_data_df = cro_data_df_conf
x_data = fit_data_df[‘DayOfYear’].values
y_data = fit_data_df[‘Confirmed’].values
show_logistic_regression(x_data, y_data, title=’Croatia projection’)
The projection tells us that, as it has grown so far, by an approximate curve, the number of infected should ideally stop slightly above 1300. This function is an approximation, not a prediction!
Number of new cases
The last thing that really sparked the most interest was the Gaussian projection of the number of new cases, so we will plot it for Croatia. But as data is limited, you can only consider the result a presentation of technology, not a realistic projection.
For such a thing it is a little problematic to give the codes here, you can look at them in the original repository, as well as at the end of the article where the links are to full notebooks.
The first thing we can look at is a comparison of new cases by day.
Again we are missing some values where there was no difference so we could fill it.
Now we can plot it all on a graph.
show_gaussian_regression(x, y, params=np.array([100.0, 35.0, 1.0]))
Or shown otherwise
Confidence intervals and model-free estimation
Ok, last thing for today, the security interval and the model-free assessment. We won’t go into the statistics behind all this, it deserves a special post, but as it was part of the original article we will touch on that.
WARNING! There is really little data and such an estimate is incorrect!
As with the graphs above, I will not be adding codes here because they are long and cumbersome, but you know where to download them.
All this together would not be at all useful if you could not refresh the data.
To do this, return to the Enviroment tab, and run the terminal on the covid-analysis environment (► and the Open terminal).
You must enter the COVID-19 folder:
And refresh the data:
You can then reopen the Jupyter notebook and refresh it all through the menu: Kernel -> Restart and Run All
I leave you two more things to do, HTML versions of the notebook, for some future or more detailed testing so you can easily copy features.
This was a long tutorial, both for you to do, and for us to produce.
I hope you had fun and learned something. This example is current because we are all quarantined at home. And it was made with very little data for Croatia. As always, more data means more accuracy.