TTC: Analysis of Toronto's Bus Delays

Many passengers await a TTC bus arrival

Overview

The public transit system in Toronto, like many other cities, experiences delays. In this project, I set out to answer the questions “Where do bus delays occur?”, “Why do delays occur at these locations?”, and “What can we do to reduce delays?”.
This project covers Toronto Bus Delays from January 2021 to June 2022. I used data that goes back to 2014 to get a broader view of some relationships.


This was a very ambitious and challenging project. There are many factors that affect bus delays, so it took a lot of research and effort to source the right data and merge them with my original dataset. The data-wrangling stages were especially difficult seeing as most of the original dataset had been entered incorrectly. I took on this project out of sheer curiosity and came out of this with a lot of new knowledge.

Tools 

Python: Data wrangling, Data visualization,  GeoSpatial Analysis, Machine Learning (Supervised & Unsupervised), Time Series Forecasting
SQL: Database creation and Joins
Excel: Data wrangling
Tableau: Reporting Results


Links

The primary datasets used in this analysis were made available by the TTC (Toronto Transit Commission) at the Toronto's DashboardToronto's Dashboard

Census data was pulled from Canada StatisticsCanada Statistics
Datasets can be found in this project's GitHub repoGithub repo Final dataset



Project Life Cycle

Data Sourcing: I conducted research on bus delays. This helped me ask the right questions and source the right data for the analysis.

Data understanding: I imported the datasets into python and got a general understanding of them (the shape of the dataset, what each column represents, and the data types). I inquired with TTC operators about entries I didn’t understand in the delays dataset. 

Data Preparation: Here I prepared the data for analysis by performing data wrangling procedures. This involved extensive cleaning, merging datasets, and generating geojson data.

Analysis: I performed advanced analysis on the final dataset. I made use of supervised and unsupervised machine learning to draw conclusions.

Presentation: I presented my final results in Tableau

Challenges

While most of the project was challenging, the data sourcing and data wrangling stages had me brainstorming and researching for hours!

Data Sourcing Challenge

Having never been to Toronto, I had no idea what would cause their bus delays. Could it be the same factors that cause bus delays in other cities? 

The visual below shows the research I carried out to solve this problem and ask some key questions.

Visual showing steps taken for research.

Step1:  Google searches about the causes of bus delays in Toronto and in general.

Step 2: Used public transit in my city for the duration of the project

Step 3: Spoke to friends in Toronto that made use of the transit system.

Key Questions

How does the ridership of a bus affect its services?

How do the demographics of an area affect its bus delays?

Are there certain features (eg. malls) in an area that could increase bus delays?

Does weather play an important role in ridership/bus delays?

Traffic congestion plays a role in delays. Are there areas more likely to experience traffic congestion?

Are there areas more likely to experience collisions?



After several data quality checks, I ended up using TTC ridership data, Toronto census data, Toronto ward features and Toronto weather data. 


Data Preparation Challenge

To perform this analysis, I needed to know the longitude and latitude of each bus stop listed in the delays dataset. This information was not in the delays dataset.  

The TTC released the folder below that contains information on its bus routes and schedules. (let’s call this group A) . TTC Bus Routes and Schedules
Using the stops dataset in this folder, I discovered most of the stop names in the delays dataset were entered incorrectly. 

I corrected these entries by finding the closest matching stop name at each bus route on the official TTC dataset. This wasn’t very straightforward since the stops dataset had no route information. 

The datasets in group A were connected by primary keys and secondary keys. I created an SQL database and ran a join query to extract the relevant information. The resulting dataset was changed into a geojson file with geopandas, I used the fuzzywuzzy library to fuzzy match each stop to its correct name. This left me with about 1000 unfixable entries
fuzzywuzzy library

(Refer to the below script  to view the entire wrangling process)script


Analysis

Delay Locations

After wrangling, I was able to plot the delays for each bus stop on a map.  I found that most bus delays occurred at bus stations.

Using geopandas, I calculated the distance from every stop to the nearest station. This would help me look for any relationships.

Map of Toronto with dots representing bus stops. Bigger dots are stops that experienced more delays.



To be able to work with demographics, I split the city into its 25 wards.

I plotted the delays on a choropleth map and found that a lot of bus delays occurred at York centre. 


Map of toronto divided into 25 city wards, shades of green represent number of delays. Darker wards experiences more delays.


Ridership Analysis

With the help of a few TTC operators and riders, I discovered that factors like traffic congestion and ridership caused bus delays.




Reddit page showing answers to questions about dataset.
reddit page showing answers to question about dataset.
reddit page showing answers to question about dataset.






Without any high-quality traffic congestion data, I settled for using Ridership data.

This project would have gone much faster if there was Ridership data for each bus stop, but I had to settle for total monthly ridership data from 2014-2022.

Hypothesis: More delays occur when more people ride the bus

I tested this hypothesis using Linear regression and found that monthly ridership contributed to 52% of the trend in the monthly delay count data.

Considering the many factors that affect bus delays, I considered this to be significant. Going off of this, I would now be thinking in the direction of "If these factors could affect ridership, they could also affect bus delays".


Scatter plot of monthly ridership and monthly delay count

Correlations

I plotted a correlation matrix to explore any possible relationships between my variables.

The correlation I decided to explore was the positive correlation between the population of children in a ward and the number of delays in that ward.

I chose this over the correlation between stop delay count and distance to the nearest station because I was looking for a linear relationship. When plotted on a scatter plot, the child population relationship was linear, while the distance to the nearest station was not.

correlation  matrix heat map
Linear Regression Results

Hypothesis: A ward with more children will experience more delays than a ward with fewer children. 

To test this hypothesis, I conducted a linear regression and found that the child population contributed to only 18% of the trend in the data. While some wards followed this rule, others did not. I decided I would have to take a non-linear approach.

map showing plots for ward child population and ward delay count

Cluster Analysis Results

After seeing that most of the correlations I’d found only worked for some bus stops. I performed a cluster analysis which grouped the bus stops into seven different groups. Thinking in terms of ridership made it very simple to interpret each cluster.

The results showed that stops closer to or at stations experienced more delays than stops farther from stations ( Stations are the hub of transport, most people would visit the bus stations to take the bus). This varied based on the location of the stop.

Stops located in wards with higher populations experienced more delays than stops located in wards with lower populations (high populations could mean high ridership).

Stops located in wards with a lower median individual income experienced more delays than stops located in wards with a higher median individual income (people with lower income may prefer to take public transit instead of using a car).

Stops located outside the designated city wards typically experienced very few delays (not many people are located here for ridership to be high). This varied based on the popularity of a stop. If a stop was located somewhere like Pearson airport, the ridership at that stop would be higher, so the delay count would also be higher.

Stops located in the downtown Toronto wards experienced very few delays because of easier access to transit. 

Refer to the tableau storyboard for in-depth cluster analysis results.Tableau storyboard

Map showing all the bus stops divided into clusters, bus stations are marked by red stars


Interesting Find: Station location

While I was interpreting the cluster analysis, I plotted all the bus stations on the map to the left. 

I found that all the transit options in Toronto are served by the same stations, this could be a cause of overcrowding at stations.

I also found that some areas are more prioritized than others in terms of station placement. Very few wards have a high concentration of stations while the others have barely any stations. 

maps of toronto bus station and toronto transit stations side by side


Conclusions

Bus delays are generally short, longer delays are less frequent, and tend to occur outside peak operating hours.

Stops at or near stations are more likely to experience delays than stops far away from stations.

Wards with higher populations and lower income are bound to experience more delays than stops with lower populations and higher income.

The TTC transit planning affects bus delays in an area.

Recommendations

TTC riders would benefit from more bus stations. Wards with high populations and low median incomes should be prioritized when building new stations.

The stations should only serve buses.

Limitations

There was not enough data available on TTC ridership.

Employment data would have been helpful in this case study.

Traffic congestion affects bus delays, but there is no reliable data on Toronto traffic congestion.

Next Steps

Explore the effects of subway ridership on TTC bus delays.

Explore more factors that could affect bus delays.

Perform deep analysis of bus delay times.

Run a classification algorithm to predict the locations and times of bus delays in Toronto.

Links


Github RepoDashboard