Exploring Web Analytics of the City of San Francisco's Website

weighted bounce rate and more!

Above is a dashboard created with Tableau to identify webpages with high weighted bounce rate. Read on to know how I made this.

 
 

Data analysis is a very wide world and you can use it in every field to further your understanding on anything really. What makes data science so interesting is the exponentially growing amount of data being collected everyday in the aim to make more informed decisions. Web analytics is particularly a hot field right now.  It analyzes web data to identify and understand the behaviour of existing and potential clients and in turn, to optimize web usage. There are many ways of establishing indicators and measuring the performance of your website. However, before listing the KPIs we have to first ask ourselves what is the intent of the business? Google Analytics Academy list five type of business and their desired measurable outcomes:

  • Content publisher: Showing ads to visitors;
  • Ecommerce: Selling products;
  • Lead generation: Collect contact information for sales prospects;
  • Online info/support: Help visitors find information;
  • Branding: To drive awareness and engagement.

In my first exploration of web analytics, I will be analyzing web usage of the City of San Francisco's website. The purpose of this project is to create a dashboard and to identify webpages that might be underperforming or need our attention.

 Screenshot of SF's Homepage.

Screenshot of SF's Homepage.

five questions before starting your analysis

 
 

Before getting in the technical aspects, it is important to clarify why we're doing this project (purpose), how we will be accomplishing it (methodology) and what are the desired outcomes (results). Too often data scientists create great looking data visualization that are informative ("nice to know") but doesn't necessarily lead to actionable insights. Taken from Cardinal Path's webinar, here are five questions we should ask ourselves before diving into any data analysis project:

1. Who is the audience?
Since the purpose of this project is to evaluate how efficient visitors can find information, both the communication and the IT department are the main audience. Indeed, improving the website can mean simply mean reworking the content, revamping the design of landing pages or creating more effective links between different pages.

2. Why does your stakeholders need this report or dashboard?
Most people rely on the Internet to find important information about the city they live in. Evaluating the performance of the City of San Francisco's website can provide valuable insight and pinpoint areas needed for improvement and in turn, improve communication between the City and the stakeholders.

3. What kind of information needs to be presented?
To provide context, I will be doing a data visualization of the different web metrics included in the dataset. Once we have an overview, we can start digging and identify underperforming webpages either by visually plotting them or by performing some SQL operations. In the end, the objective is to create an analytical dashboard with around 5 KPIs that will allow the stakeholders to locate underperforming webpages.

4. How will it be "enabled"?
I will be using Tableau to perform my analysis. With it's ease of use, intuitive interface, and powerful data visualization features, Tableau is the way to go!

5. When will updates occur?
No updates will occur since this is simply a web analytics exercise. In reality, if the City of San Francisco actually requested a dashboard to evaluate their website, it would have been a more complicated task. There would be a need to create an automated workflow that would connect and retrieve web traffic data (e.g. Google Analytics) and instantly update the dashboard through a server (e.g. Tableau Server). This would would result in an operational type dashboard. For the purpose of this exercise, we won't dive into this.

 

getting started

This dataset was acquired from the SF's OpenData portal, which contains an abundant collection of datasets and easily surpass the City of Toronto's own collection both in quality and quantity. The collected web data is not the most comprehensive and complete you can find, however it's often easier to start small and work your way up. Our dataset titled Web Analytics for SFGov sites - 2016 (Q1+Q2+Q3) compiles web traffic for the first three quarters of 2016. Unfortunately, we don't have any geographical or time series data since the numbers are totals aggregated by each webpage. Here's what the first five rows look like:

 
 
Page TitlePageviewsUnique PageviewsAvg. Time on PageEntrancesBounce Rate% Exit
SFGOV142774311900280:02:44116090675.54%70.27%
CCSF ePayroll ‰ÛÒ Online Paystubs | Office of the Controller11074043423230:03:2433296313.49%30.41%
Employee Gateway | SFGOV9108105994860:07:4358379523.45%64.73%
City and County of San Francisco6527605429800:02:4648778974.11%67.82%
Property Tax Payments | Treasurer & Tax Collector5016342954330:02:4415313652.73%45.75%
 
 

overview of the data

There are 4996 rows or sites in this dataset, which is a lot! This include PDF documents relating to municipal regulations, outdated press releases and public announcements. In order to focus our attention to the more important content, we can add filters in Tableau right after importing the CSV file. For this analysis, I kept the 10% most popular webpages based on page views. After that, it's really easy with Tableau to visualize the data, but this time in a more different way. Here's the same table as above but with some blended visuals:

The easiest and fastest way to filter your data is to add one in Tableau.

 
 

everything is relative 

The first dashboard is a major improvement but it's still hard to tell which sites are performing well and which ones need our attention. One way to this is to compare the metrics with the mean of the overall website. We will be able to tell for example which sites has both a higher bounce rate or page views compared to the median. To do this, all we need to do is to create a calculated field for each of our metric. The result is a more compelling dashboard.  

 
 

let's start digging! 

Now that we have our metrics against the mean, we can use SQL to pass some conditions and start filtering. In this example, I will insert a simple query to identify webpages that fulfill all of these conditions:

  • 20% higher bounce rate than the median
  • 20% higher page views than the median
  • 20% higher entrance than the median

Here's what my query looks like and after running these conditions, I can identify the ones returning True with a star ("*"). Easy Peasy!

IF [Bounce Rate - Comparison] > 0.2 AND [PageViews - Comparison] > 0.2 AND [Entrance - Comparison] > 0.2 THEN '*' ELSE '' END
 
 

bounce rate vs exit rate 

Another metric that we haven't touch on is the exit rate. There is a difference between the bounce and exit rate. Exit rate is the percentage where of all the page views, that page was the last one viewed in a session. On the other hand, bounce rate is the percentage where of all the sessions, that specific page was the only one in a session. To understand more the relationship between these two, I created a scatterplot. Of course, just doing that doesn't help us find the gold nuggets, the webpages that needs our attention. To help us with that, I calculated the weighted bounce rate and identified popular sites with high bounce rate. As an added bonus, I included 3 parameters users can dynamically change:

  1. Bounce Rate Cutoff
  2. Exit Rate Cutoff
  3. Number of Top Sites based on Weighted Bounce Rate

The result is a dynamic four dimension scatterplot (bounce rate, exit rate, weighted bounce rate, and unique page views). This allows us to visualize all these four metrics in one place and to reveal more in-depth insights with ease.