Hello, I’m back! Yay.

Matija wrote the last two articles because I have to juggle between work and identity crisis.
So I will contribute something because Matija still finds it interesting to watch and refresh Google Analytics. Wow, a CLICK! I guess he’s not the only one.

The last article rounded out the story about retrieval, analysis, and visualization of coronavirus data.

While I’m very glad to see people laughing while passing them in the street and that we’re all starting to feel optimistic because things are getting back to “normal”, I’m afraid a jump like this can’t bring anything good. I will be happy if the upcoming situation proves me wrong. Although I don’t deal with it and my opinion is mine, I don’t see a rational reason for these jumps from “barricade yourself at home” and “let’s open all the borders, cafes and everything else, summer is coming, who the *uck cares.”

A transition period to see how the virus will react to such changes is needed, because otherwise why did we go through all these restrictions? Let’s not go into politics, let’s look at smarter things.

Introduction

The article that Matija wrote on the topic of analysis is pretty good.

Now it’s my turn to generalize it, automate it, and put other words with “it” (tune-IT).

In this article, we will go through what Matija did in the last article and we will give the Jupyter notebook to everyone to use. So you too can run your data and get indicators for your advertising. Yes, for free. The project is here.

The interesting thing in this article is that we can get data from Google Analytics completely automatically, load the same data into Python, and make different analysis on it using graphs. Also, as an icing on the cake, we have an explanation of how to analyze the effectiveness of different advertising channels, not using “lick your finger and see in which direction the wind blows” method, but real statistics.

What do we want to achieve? We want to see how many clicks we have in a given period (by date) for a particular channel, how many sessions we have, how long the sessions last, etc.

Google analytics API

The reporting capabilities for Google Analytics are very interesting. By using only the GA Reporting API can you get very nice results and avoid a lot of unnecessary clicks. The original version from Matija used reports directly from GA, so it was quite difficult to merge it later.

What interests us here is a Python tutorial to see what needs to be done to be able to “pull” reports from Google Analytics in Python. A list of instructions can be found on this page, what you need to do is add access to the application that we will use on GA so that we can “ask” GA questions about events on our website.

What you need to have in the end is a file called “client_secrets.json”. Don’t put it somewhere online, under any circumstances. That file enables you to access the data on GA.

Once you have this file, you can access the GA Reporting API and ask your questions. We can look at the short snipped code where we ask for data from GA:

# Fetch the response from GA
response = analytics.reports().batchGet(
  body={
    ‘reportRequests’: [
    {
      ‘viewId’: VIEW_ID,
      ‘dateRanges’: [{‘startDate’: ’30daysAgo’, ‘endDate’: ‘today’}],
      “metrics”: [
        {“expression”: “ga:pageviews”},
        {“expression”: “ga:sessions”},
        {“expression”: “ga:sessionDuration”},
        {“expression”: “ga:avgSessionDuration”}
      ],
      ‘dimensions’: [{“name”: “ga:pagePath”}, {‘name’: ‘ga:date’}, {“name”: “ga:sourceMedium”}]
    }]
  }
).execute()

What is written here, we ask the GA for a report, from 30 days ago to today, where under the dimensions (grouping values) we want the date, the advertising channel, and the page the user was on.

From these grouped values we then ask for the number of pageviews, the number of sessions, the duration of the sessions, and the average duration of the sessions.

If you only need the last 15 days, change the value, and run it again! That’s it.

The list of possibilities and what you can get back is quite rich, and the list itself can be found here. Finally, we transfer the results to a table (more precisely to the CSV format) and load it into the Pandas dataframe, which we will use to do further data analysis.

Data analysis with GA

Just looking at the data in the tables is boring. Matija asked me if it was possible to make dynamic interaction with the data itself so I tried it.

In the Jupyter notebook, you have an example where you can filter data by ad channel, it could be further improved in the future.

And yes, Cosmo wrote about us. I know.

The point is that you can make nice dynamic interactions with the data and watch it in real-time. I even put an example with interactive graphs, if you are interested in looking at a larger amount of data on graphs. It’s nice.

If you are interested in something specific on how to do it, you can contact us. If you ask politely and understand the basics (questions about how to install something and the like are not very interesting to me), we will help you. Yes, for free.

Filtering and analysis

We filter all results to contain only pages related to the newer Mini Tesla project, using the following snippet:

df = df[df[‘ga:pagePath’].str.contains(‘mini’) | df[‘ga:pagePath’].str.contains(‘razvoj-tehnologija-za-autonomno-vozilo’)]

Once we have data only for Tesla Mini, let’s split the data for different ad channels:

• Facebook video

• Facebook post

• Google (YouTube)

• LinkedIn

• Twitter

Once we separate it all, we can show that data as well.

Data visualization

Let’s show what we got in order.

Facebook video

Facebook post

Google

LinkedIn

Twitter

Correlation?

The question that Matija left “hanging” is whether there is a correlation between the existing data and if so, why.

Let’s look at the data for all ad channels:

It makes sense that there is a correlation between the number of sessions and the number of views, since these are very closely related terms, and for the rest, we do not see too many correlations.

What is the best advertising channel?

And now, the point of this post.

I will try to explain the general idea of how we got to the results, but if you are interested in more details, take a look at the Jupyter notebook itself and how it is solved there.

Unfortunately, the terms used here are quite rich. Not as if they have a lot of money, but they have a depth that is not visible until you start to use them. This depth is reached over time, and I can say for myself that I understand more superficial things concerning these statistical concepts. The statistics are very rich and there is a lot of material behind it that needs to be mastered.

So if you’re not clear on how to get to this and why something doesn’t work, that’s ok, the main thing is that you can see some benefit in all of it so you can always explore the topic deeper if you find it interesting.

What are we going to compare?

For comparison, we need to have a minimum of two things.

And for comparison, we have to have two similar things. As the saying goes, can’t compare apples to oranges.

So we can’t compare video and post. That doesn’t make sense. Apart from the fact that the material/content is different, the medium itself is different and has different behavior.

We can take just the post. Let’s take a post on Facebook and LinkedIn on how they behave and find the difference. And what we will measure about the post?

So we’re looking for an audience that resonates better with the content. In other words, an audience that reads it more (longer). For this, we can take the average session time. The average session time is a pretty good indicator of how much people are interested in the article itself so let’s see what that difference is between Facebook and LinkedIn.

How do we compare?

We can compare it by using statistics.

It is necessary to have a statistical test because the actual measurements are not absolutely accurate and always have some “noise” in them.

For example, someone landed on an article and went on a coffee break. When he comes back, he exits the page, because the Mini Tesla is a car that you can drive autonomously is “stupid” and he wants to watch cats on YouTube. The actual time that person spent on the article is maybe 15 seconds, but the session time is 15 minutes (it was a long coffee break).

The standard method for comparing the two groups is a statistical test. The statistical test involves expressing a null hypothesis, which would be if there was no difference between the duration of Facebook and LinkedIn sessions. We then compare the data distribution of these two groups and check how the null hypothesis holds up.

There is often some value (p-value) that serves to reject or accept the hypothesis. This value is made up (I probably shouldn’t have written this, but it is the truth) and is often used as an indicator in completely different fields, although there is a whole science around how much that value should actually be.

It is easier to use an alternative “branch” of statistics, which is not frequencistic, but probabilistic (hah, I made myself laugh). Basically, using Bayes’ theorem.

We are trying to get both distributions (how the data are arranged on the graphs). Then we look at their differences, how different are they distributed (how different the data graphs are).

We also look at the probability of uncertainty in that same estimate (how “sure” the algorithm is that this data has that distribution, that arrangement of data points on the graphs).

This is critical if you want to give some result that includes in your interpretation our ignorance of the domain (we may not have thought that there are people who watch the article for 15 seconds and drink coffee for 15 minutes), as well as stochasticity of variables that we did not take into account. That means that some variables are different at different times because we don’t have a complete picture of how they work. Let’s say most of the time people look at an article for 15 seconds, but there are examples where they go in and out after 1 second – it’s a very smart way to say we have no idea why some values are what they are).

A more detailed explanation of how the code works can be found here.

But, let’s look at the results of the average value of Facebook and LinkedIn sessions:

The width of these lines shows how much the values “differ”. The point in the middle is the mean value in the distribution.

The mean value for Facebook is higher than the value for LinkedIn, and is much more stable – we have fewer differences in values on Facebook than on LinkedIn.

In other words, on LinkedIn we have more uncertainty in the assessment, we have examples of people who read an article for a very long time, but also, who read an article for a very short time. This does not necessarily mean that the audience on LinkedIn is more diverse, but it can mean that we simply have less data so we have a higher difference (the algorithm is less “sure” in the estimation).

On Facebook, we have an example where we have much more uniform sessions (sessions that have very similar values).

So, with the current data set, where we have a lot more data for Facebook, Facebook is much more predictable and read a lot more (for longer).

Conclusion

Can we use technology to make our lives easier? Is that enough? No?

It’s hard to use the more advanced features available for Google Analytics and do data analysis for different campaigns, resources, media, and more when you’re not a developer. With this content, which is completely free and anyone can run it, anyone can use it at the click of a button. It is ideal for quick reviews of campaign performance and further decision-making and can serve as an executive summary and some general indicators of how marketing works.

Data is the key to decisions. If you ever need a KPI (Key Performance Indicator) for your marketing, using something like this once a day or supplementing it a bit and making an app that does it for you and serves you those results when you get to the site in real-time is relatively easy.