Analyzing Seattle's Bike Share 

Meet Pronto, Seattle's bike sharing service.  (John Lok - Seattle Times)

Meet Pronto, Seattle's bike sharing service.  (John Lok - Seattle Times)

Never been to Seattle and heard from other folks that it's one of those cities you're either love or hate living there. While exploring Kaggle's datasets, I came upon one about Pronto, Seattle's bike-sharing program. What makes this particular dataset interesting is the compiled wealth of information. Just like the one hosted on Kaggle, if you go on their website, you can download 3 types of datasets:

  • Trips: Information about each trip including start day & time, end day & time, trip start & end station.
  • Weather information per day
  • Bike and dock availability per minute per station.

These datasets offer us the opportunity to cross-analyze and see if there is a correlation between for example, the ridership and the weather. However, in this data analysis, we will limit ourselves with be the first dataset (Trips). 


What does the data look like?

043110/13/2014 10:3110/13/2014 10:48SEA00298985.9352nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1960.0
143210/13/2014 10:3210/13/2014 10:48SEA00195926.3752nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1970.0
243310/13/2014 10:3310/13/2014 10:48SEA00486883.8312nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberFemale1988.0
343410/13/2014 10:3410/13/2014 10:48SEA00333865.9372nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberFemale1977.0
443510/13/2014 10:3410/13/2014 10:49SEA00202923.9232nd Ave & Spring StOccidental Park / Occidental Ave S & S Washing...CBD-06PS-04MemberMale1971.0

Above is a preview of the data we'll be working on. Between October 13, 2014 and August 31, 2016, there was a total of 235 675 of recorded trips That's a lot of data to play with! The columns that interest me the most are the starttime, stoptime, tripduration, and birthyear.

Now before we dive into the analysis and the visualization, let's take a look if there is any missing data. To do this, we can do a quick heat map showing all the null values. The results below show that about half of the gender and birthyear values are missing and is something we should take into consideration if we want to utilize them in our analysis.


In yellow are rows with missing values.


Summary statistics

Next, let's calculate some basic statistics on our numerical values. The age and trip duration were the ones that I was interested in. I had to do some simple math, to convert the birthdays of riders to their age and the trip duration in seconds to minutes. When we look at these two features throughout the three years, there are several points worth mentioning:

  • The max for trip duration is nearing 8 hours in all three years, which suggest that we have some outliers. The median is around 10 minutes, which doesn't seem a lot but it does make sense considering how close the stations are.
  • The max age is someone of 85 years old, which I find that adorable!
  • The age of riders seem to get younger with time whereas the trip duration has seen a slight increase.


age and gender distribution

Let's take a closer look on the people who ride those Presto bikes. Who are they? If we plot their age, we can observe that millennials daily constitute the clientele. This is not surprising. However, it is interesting to note how 29 years olds remarkably dominate other ages, more than 10% to be precise! Now, it's important to remember that both gender and birth year columns have a lot of missing values. Hence, this observation can't be conclusive.

When it comes to gender, males largely outnumbers females (73% VS 27%). Since females are typically more sensitive to their surroundings than man, this could mean that they perceived the built environment as unsafe or not as practical to cycle. It would be interesting to compare these ratios with other cities and establish what would be considered as an acceptable ratio. 

duratioN of trips

For the duration of trips, earlier we saw that the median is about 10 minutes and that there are some outliers in our data. Although not ideal, we can filter out rows that doesn't fall between 5% and 95% of our range values which will result to more suitable visualizations. 

A kde plot showing the distribution of two values. In this case, the duration of trips and the hour of the day. 

A kde plot showing the distribution of two values. In this case, the duration of trips and the hour of the day. 

A boxplot of the duration of trips based on the type of memberships. 


TRIPS based on month and hour of the day

Finally, I wanted to find out if there was a difference in the amount of trips being made throughout the hour of the day and the year. To do this, we can group the trips based on the month and hour and plot them. My prefer method is to do a heat map. Here, we can note the most popular hours and months. Not surprisingly, people ride the most during peak hours and during warmer months.