JavaScript vs Logs Mon 06 January 2020

A war has been raging over the last few years between consumers and advertisers. A never ending cat and mouse game where each side tries to out-do the other. It’s tracking vs blocking, clicks vs privacy, readers vs publishers.

Gone are the days when adding the GA/et-al snippet to your blog captured almost every user. My theory is that, for tech oriented blogs, GA misses a significant portion of traffic. Tech readers are more aware and actually care about their privacy online.

But how big is the problem? What am I not seeing in GA? Am I talking a load of nonsense? To find out I used GoAccess to parse & track the stats for this blog from the nginx logs. The results have been quite revealing. Let’s look at February to July of 2018:

That’s a huge difference! Much more than I ever expected. Note that this after some cleanup. GoAccess uses servers logs so all traffic counts - including crawlers, static files and bad actors. The logs were filtered and processed as so:

# GET requests, 200 responses, no static/invalid files
grep "GET" access-feb-jun.log | \
    grep '" 200 ' | \
    grep -vE "\?|\x|http" | \
    grep -vE "\.(png|php|css|ico|xml|js|jpg|zip|cgi|gif|woff|woff2|eot|svg|ttf)" \
    > access-feb-jun-clean.log

# Parse with goaccess, removing redirects and crawlers
goaccess \
    --log-format=COMBINED \
    --ignore-crawlers \
    --ignore-status=301 \
    --ignore-status=302 \
    -o report.csv \
    access-feb-jun-clean.log

The result gives a clean(ish) log of legit traffic - which is roughly what GA should be tracking. As we can see, this is not the case - not even close. Even if we account for 10% still incorrect logs - GA is missing roughly 70% of traffic to this blog.

I have removed GA as it’s both useless and a privacy concern for anyone not using a blocker. The results provided by GoAccess are more than enough, even if they contain a few false positives.


Bonus item!: using server logs means tracking (rough) RSS readership is possible!

# GET requests, 200 responses, no static/invalid files
grep "/index.xml" access-feb-jun.log \
    > access-feb-jun-rss.log

goaccess \
    --log-format=COMBINED \
    -o report.csv access-feb-jun-rss.log