Using Google BigQuery to Analyze 1.1 Billion NYC Taxi and Uber Trips!

Using Google BigQuery to Analyze 1.1 Billion NYC Taxi and Uber Trips!

Shares

In 2015, the NYC had shared the entire data on the taxis in NYC.  It had the pickup, drop-off, timings and points for over 1.1 billion taxi trips from January 2009 through June 2015 covering yellow and green cabs.

Using Google BigQuery to Analyze 1.1 Billion NYC Taxi and Uber Trips Click To Tweet

Todd Schneider also found some Uber rides data thanks to  FiveThirtyEight, which covered nearly 19 million Uber rides in NYC from April–September 2014 and January–June 2015.  He combined both, the Cab data and the Uber data (which wasn’t as detailed) and added it to  the GitHub repository for anyone to use.

I used PostgreSQL to store the data and PostGIS to perform geographic calculations, including the heavy lifting of mapping latitude/longitude coordinates to NYC census tracts and neighborhoods. The full dataset takes up 267 GB on disk, before adding any indexes. For more detailed information on the database schema and geographic calculations, take a look at the GitHub repository.

Insights from NYC Cab Data

Todd’s analysis gives some interesting insights like:

  • the worst hour to travel to an airport is 4–5 PM
  • the median taxi trip leaving Midtown headed for JFK Airport between 4 and 5 PM takes 64 minutes – with 10% taking over 84 minutes.
  • If you left Midtown heading for JFK between 10 and 11 AM, you’d face a median trip time of 38 minutes, with a 90% chance of getting there in less than 50 minutes.

Here are some charts which illustrate the data for the NYC cabs.

Midtown to LaGuardia Cab Drive

Midtown to LaGuardia Cab Drive

Midtown to JFK Cab Drive

Midtown to JFK Cab Drive

Midtown to Newark Cab Drive

Midtown to Newark Cab Drive

Querying 1.1 billion Cab Trips for Faster Reports

Another Big Data enthusiast – Mark Litwintschik – took Todd’s data and has used Google’s BigQuery to do really fast reports.  He had earlier used Redshift, Presto on AWS EMR, Elasticsearch and even PostgreSQL.  He found Google Cloud’s BigQuery to be really fast in querying the metadata of 1.1 billion cab trips in NYC.

Also Read:  Google Pushes for Organization of Google Drive via New Improvements

What is Google Query?

This is what Google says about its own tool:

Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google BigQuery solves this problem by enabling super-fast SQL queries against append-only tables using the processing power of Google’s infrastructure. Simply move your data into BigQuery and let us handle the hard work. You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data.

The availability of data and the powerful big data querying and analysis tools will revolutionize the way we use the information gathered.  Our experiences articulated as data points can be harnessed to help us chart our future.  Google or someone can easily use this data to help us plan our travel around big cities in future.  It could soon be a feature on Google Maps perhaps?

Featured Image Source: Flickr

Shares
%d bloggers like this: