Introduction

Hello everyone,

today we will deal with the analysis, consolidation, and visualization of complex data sets that are poorly prepared and formatted. Like with that sentence, from a lot of data, we will get to usable data. “Today, we will get useful graphs from messy data.”

I’m Matija, and they call me an Excel expert (but really, at least once a week, someone calls me to ask something about Excel). Data analysis is something I’ve been doing for quite a while, and I’m leveling up today. Pythons, pandas, and jupyters, and I’m taking you with me.

This is a continuation of the Jupyter notebook tutorial that you can check out here, but today we are going to use a data set that almost anyone with a website has. Google Analytics data.

Data used

Before we start, a brief explanation of the data set we used.

04/17/2020. We have launched several campaigns to promote the development of technology for the Autonomous vehicle Mini Tesla.

This article uses information up to and including 4/27/2020.

This gives us 11 days of data (not great, not terrible).

The biggest advantage we have with these campaigns, and why we can let out this data freely is that we don’t sell anything with Mini Tesla campaigns. There is no hidden goal. We did something great that even the media followed-up on, and that’s it, we are bragging and promoting Roberto, Aco, and Kristijan.

A landing page for all ads except Twitter is a huge article that according to Read-o-Meter takes 17 minutes and 34 seconds to read. Also, the article deals with a rather technical subject and contains several videos.

For marketing and analytics professionals: During this period, campaigns were modified and optimized, ads were added, the period was too short, etc., etc.
The purpose of this article is not to boast about the results (though they are great) but to make life easier for all of you that need to analyze and visualize a lot of data (expect a link to the GiT repository when we publish a sequel to this article).

    Retrieving data

    We could do this with the Google Analytics API component, but we didn’t want to complicate it. The goal is to show what can be done with a messy data set, not create an application. The data was downloaded as the widest possible set:

    Behavior → All Pages → Secondary Dimension (source/medium)

    In the spirit of simplicity and applicability, .csv files were downloaded for each of the days in the set. But why? Why didn’t we just review everything in Google Analytics? Google Analytics is great and can show you in detail what was happening on your page.

    But if you want to compare something from Analytics with campaign data from, say, Facebook, you’ll use something external. There are automated solutions, or you can pull it all through a spreadsheet (Excel, Sheets, or whatever floats your boat).

    If you are already using something external, why not use something where you can calculate everything and draw graphs for visual inspection or presentation. As a bonus, you’ll eliminate human or cell formatting errors.

    We will use Juypter notebook (and pandas) for all this. Ok, let’s now take a look at what this information looks like.

    What a wonderful mess. And we have eleven of those.

    Just imagine how much time you would need to manually merge all 11 by Page + Source / Medium. Especially with all the fbclid URL spam.

    fbclid is a parameter that is added to a URL when the traffic is coming from Facebook. If you google it, you will find theories that it is used for Facebook Analytics, or for bypassing cookies and many conspiracy theories. For now, there is no 100% complete answer.

    (fbclid can be removed as a parameter, but I prefer to have all the data in its original form. When we remove something for “ease of use” there is still a chance that we will need it in 3 months or 3 years.) If you still want to remove it the link is here.

    Pandas will merge the data for us in record time, without errors. Finally, we have a table: 1249 rows × 90 columns.

    This is the first exit point if you will continue to work on the .csv file yourself, at least you know that it is merged properly.

    Also, to have a timeline, we renamed all the columns:

    • Page – pg
    • Source / Medium – sm
    • Pageviews – pv
    • Unique Pageviews – upv
    • Avg. Time on Page – atop
    • Entrances – et
    • Bounce Rate – br
    • % Exit – ex
    • Page Value – pval

    All columns are named as [abbreviation] – [date]

    pv-20200425 are Pageviews on 04/25/2020.

    Goals

    Ok, now we have all the data, we should define what we want to know. Today’s goal will be to answer the questions:

    1. From which network (and campaigns/campaign sets) did people stay the most on the article (hopefully reading it)?

    2. Is there any connection between impressions and time spent on page (most likely not)?

    3. Does any of the data collected correlate and if so, why? Is it causation or just correlation?

    4. What is the meaning of life?

    5. Which platform is best for spreading ideas?

    6. How should we spend the rest of the budget?

    Editing your data

    Before we do anything with the data, we might edit it so we don’t run into problems later.

    1. We will overwrite any non-existent values ​​with zeros.

    2. We will transform the number of page views and unique pageviews from decimal (float) to integers (int). We can’t have one and a half page view. A page view is counted when someone came to the page and loaded the analytics code.

    3. We will transform average time on the page  from hh:mm:ss format to seconds.

    Original data:

    After editing:

    Now that we have data that is relatively easy to work with, we can look at where all the traffic to the page came from in a given period, with just one line of code.

    All our traffic in a couple of lines. With this information, we can divide it into several categories:

    Facebook

    • fb / post
    • m.facebook.com / referral
    • facebook / paid_traffic
    • facebook.com / referral
    • fb / single_image
    • l.facebook.com / referral
    • fb / video
    • lm.facebook.com / referral
    • web.facebook.com / referral

    Google

    • google / cpc
    • youtube.com / referral
    • google / organic

    LinkedIn

    • linkedin.com / referral
    • lnkd.in / referral

    Twitter

    • t.co / referral

    Those we will not deal with now:

    • (direct) / (none)
    • cosmo / cosmo
    • ad-review-tool.twitter.biz / referral
    • s3.amazonaws.com / referral
    • loomen.carnet.hr / referral
    • alenn.ai / referral
    • github.com / referral
    • mail.google.com / referral
    • bing / organic
    • hexometer.com / referral

    Now we have a way to split that huge table to get what we want:

    We provided Facebook campaigns with URL parameters:

    • fb / post
    • facebook / paid_traffic
    • fb / single_image
    • fb / video

    Well, we will consider all items with this source medium as Facebook results.

    For Google:
    • google / cpc
    • youtube.com / referral

    For LinkedIn:
    • linkedin.com / referral
    • lnkd.in / referral

    For Twitter:
    • t.co / referral

    Yes, a referral could be from a lot of things. But in this case, Facebook campaigns are tagged, YouTube referral is justified if someone found a link in the video description, and LinkedIn and Twitter had no activity other than campaigns. If someone even clicked on the organic link in this case and with these goals, we can attribute it to the campaign.

    The division and how you split the data will depend on what you promote, what your goals are, and what you intend to do with the answers you get. For our specific case, we evaluate the channels.

    Not to repeat many similar tables, we will focus on Facebook, to see what stages the data goes through. We’ll include other channels at the end when it’s time for results. You’ll see the full stages of each one when we release the entire notebook. (test data will be included, will it be our real data or we will do some mockup, you will have to guess)

    But let’s go step by step.

    First, we pulled out the entire list from Facebook, both organic and paid. (if we ever want to do additional analysis)

    After that, we make a subset with rows that refer to pages that contain “mini-tesla” in the Page column. I emphasize “contain” rather than “are” as to include fbclid URLs.

    And then from that set, we make another subset containing all the rows we know came from paid campaigns.

    So far we’ve been sorting and filtering the data, now it’s time to finally start formatting it so we can answer our questions.

    We will split Facebook results into two subsets:
    • Video campaigns aimed at high delivery volume
    • Web traffic campaigns aimed at landing interested people to a blog post

    We will remove columns that contain data:
    • Entrances
    • Bounce rate
    • % Exits
    • Page value

    Although this is important data for optimization, it isn’t relevant to us in this case, because we do not compare them with the previous period, nor is it useful to compare them day by day.

    We have to do some magical things with time, which is much easier to do in Python than in Excel. As we are interested in the average time of all users that came from a particular channel, we need to summarize all users and get the average time. Of course, we cannot just sum the average time, because that is not how time works.

    For each row, we calculate the total time of that specific case (Page, Source / Medium) by the formula.

    Total Time = Average Time * Pageviews

    This gives us total time on a case by case basis. The data obtained can be summed up to give us total pageviews and total time, which we can now again convert to the average time spent on the page, but it now covers all cases.

    For Facebook web traffic campaigns:

    For Facebook video campaigns


    For Facebook web traffic campaigns:

    For Facebook video campaigns:

    And now, as promised, tables like this for each of the other channels:

    YouTube:

    LinkedIn:

    Twitter:

    who is a little late in launching because they have silly and strict rules about metaphors. That’s why it has a landing page for itself.

    And finally, we’re adding real data from Facebook Ads, Google Ads, LinkedIn Ads, and Twitter Ads.

    Facebook web traffic:

    Facebook video:

    YouTube:

    LinkedIn:

    Twitter:

    Ok now we have a lot of information, let’s draw it to see how the individual channels look (we’ll graph all the information we have).

    Facebook web traffic:

    We can notice a few things right away.

    For this view, it would be ideal to use two different scales for data visualization (this will be done for the next article).

    Impressions and reach are correlated. Reach (how many people saw ads) is linked to impressions (how many times we’ve shown an ad), so it doesn’t surprise us.

    There are days when people who came from Facebook were simply uninterested.

    Facebook video:

    According to the data, and now confirmed visually, Video campaigns did almost nothing. The average time on the page is negligible.

    Youtube:

    We have a bit more lines for YouTube because we also use information about how many people watched the video up to percentages (25%, 50%, 75%, 100%), but these numbers correlate and show us nothing we didn’t already know.

    LinkedIn:

    LinkedIn is messy due to several factors.

    The first is that we changed campaigns three times during this timeframe. The first iteration was to promote the post, from which we went to video, then to English post, and eventually decided on a LinkedIn article, which has good results when we look at the number of people who read it, but because it’s a summary a smaller fraction of those people came to the website.

    Twitter:

    Twitter started a lot later, so you’ll only see something more relevant to it in the next article.

    By themselves, these graphs don’t tell us much, but we know we’re interested in time spent on Mini Tesla related pages. So we’ll look at them.

    First, we will compare the average time spent on a page by channel:

    Total time spent on the page.

    And page views from different channels:

    Results and conclusions

    At the beginning of everything, we asked a few questions that we can now answer.

    1. From which network (and campaigns/campaign sets) did people stay the most on the article (hopefully reading it)?

    2. Is there any connection between impressions and time spent on page (most likely not)?

    3. Does any of the data collected correlate and if so, why? Is it causation or just correlation?

    4. What is the meaning of life?

    5. Which platform is best for spreading ideas?

    6. How should we spend the rest of the budget?

    From which network did people stay the most on the article?

    Facebook Web Traffic campaigns have proven to be the best because of a combination of several factors. They have a wide reach and the possibility of detailed targeting.

    Is there any connection between impressions and time spent on a page?

    As we have guessed, there is no direct link between the number of impressions and the time spent on page (at least there isn’t yet, we’ll look at it again in a few days).

    In addition to the audience itself, timing plays a big part in promoting such an article. It depends if the person who is interested in it has 20 minutes to read it right now.

    Does any of the data collected correlate and if so, why?

    What exactly correlates and why we see somewhere visually, but we will give a direct answer in the next article, where we will find the correlations by calculations, not just by visual inspection.

    What is the meaning of life?

    Jupyter won’t help us much there, but take care of yourself and do things that make you happy.

    Which platform is best for spreading ideas?

    Depending on the idea and material you have available.

    With an awesome video, it’s a good idea to try YouTube, while Facebook offers you good and precise audience control. In the worst case, try, analyze after a shorter period and then decide.

    How should we spend the rest of the budget?

    If you are interested, we decided to just turn off Facebook Video because it really did nothing for us while we continue with the other campaigns. In any case, you will get even better information when we write the second part of this article.

    Conclusion

    Perhaps the biggest advantage of digital advertising today is that it is absolutely measurable, either out-of-the-box or can be measured in some way.

    This generates really large data sets, with several sources that are sometimes difficult to analyze objectively. This is where data analysis can make your life much easier, as you consolidate information from multiple sources in one place.

    This theme will continue, with additional analysis capabilities, and we will eventually share a GiT repository where you will be able to try out similar visualizations with your data.

    If you have a lot of information that you do not know what to do and need analysis and answers to some questions, feel free to contact us.

     Take care of yourself and do what makes you happy.