Peer reviews are a cornerstone of building high-quality data engineering projects. They don’t just help catch bugs and inefficiencies—they unlock opportunities for improvement, learning, and robust collaboration. In this two-part series, I’m diving into a peer review of the tfl-data-visualization
project, which leverages public Transport for London (TfL) Oyster card data to uncover insights about passenger flows across London’s extensive rail network.
In Part 1, we’ll focus on the project’s foundation: the problem it tackles, its cloud-native architecture, and the orchestration of data ingestion. The goal is to demonstrate how a senior-level data engineering project is structured, documented, and automated for real-world impact.
Project Overview: What is tfl-data-visualization?
This project is a modern data engineering pipeline designed to analyze footfall data from London Tube and TfL Rail stations. By using open data on Oyster card tap-ins and tap-outs, the project enables granular analysis of how passengers move through the city’s transport network. The end product is a Looker Studio dashboard powered by data pipelines that automate everything from raw data ingestion to warehouse transformations.
1. Problem Description
A well-defined business problem is the first step towards a meaningful solution. This project excels here. The README clearly articulates:
- The business context: Understanding passenger flows can help with optimizing station management, reducing congestion, and supporting infrastructure decisions.
- The data source: Publicly available TfL Oyster card tap count data.
- Project goals: Automate data collection, processing, and visualization to enable data-driven insights for stakeholders.
Review Comment:
Excellent articulation of the problem and its real-world significance. The clarity helps the reader quickly understand the project’s goals and value.
2. Cloud Infrastructure and IaC
Modern data engineering projects are built for the cloud, and this project demonstrates that ethos. The pipeline is developed for Google Cloud Platform (GCP), featuring:
- BigQuery as the data warehouse: Scalable, cost-efficient, and optimized for analytics.
- Google Cloud Storage (GCS) for raw data: Centralized, secure cloud storage for source files.
- Infrastructure as Code (IaC) with Terraform: All GCP resources are provisioned automatically, ensuring repeatability and minimizing manual setup.
Review Comment:
Outstanding use of cloud technologies and automation. Leveraging Terraform for GCP infra shows strong cloud engineering practice.
3. Data Ingestion: Batch Processing and Workflow Orchestration
Automation is at the heart of reliability and scalability. The project uses Kestra as its workflow orchestrator, building a robust, automated batch pipeline that covers:
- Automated downloading of multi-year historical CSV data from TfL’s open data portal.
- Uploading files to Google Cloud Storage.
- Loading data into BigQuery, both as external tables and as native tables for partitioned, consolidated analysis.
- DAG orchestration: Kestra flows define the data ingestion DAG, with scheduled weekly updates and modular subflows for maintainability.
Review Comment:
Great job with fully automated workflow orchestration. Consider including visual examples or screenshots of Kestra flows to further clarify the orchestration structure.
Wrapping Up Part 1
In this first part, we’ve established the strong foundation on which the tfl-data-visualization
project is built: a clearly defined problem, cloud-native architecture, and automated data ingestion. These elements are critical for any ambitious data engineering initiative, ensuring not only technical excellence but also business relevance and operational scalability.
Stay tuned for Part 2, where we’ll dive deeper into data warehousing strategies, transformation logic with dbt, dashboarding, reproducibility, and overall peer review feedback.