Introduction

In the growing creator economy, YouTube channels pump out tons of videos every week. and Creators need a simple dashboard that answers the following questions:

  • How has my channel grown over time?
  • What is the best day and time to post a video?
  • What are the most engaging videos in my channel?

To tackle this, I built an end-to-end data pipeline that automatically pulls, processes, and visualizes YouTube data.

In this article goes over the steps I took to archieve this.

Part 1: Installing the Tools Natively on an Azure Ubuntu VM

Instead of using Docker,I installed everything natively because it gave me more control, better performance tuning, and more flexibility compared to containerized environments.

1. Install Apache Airflow

Apache Airflow schedules and manages all the ETL jobs automatically.

Step 1.1: Create a dedicated airflow user

First, it’s good practice to create a separate user to run Airflow (avoiding permission issues later).

sudo adduser airflow

Follow the prompts (you can skip extra info fields). Then add the user to sudoers:

sudo usermod -aG sudo airflow
Login as the new user:
sudo su - airflow

Step 1.2: Install system dependencies

Update the system first:

sudo apt update && sudo apt upgrade -y
Install required libraries:

sudo apt install -y python3-pip python3-venv libpq-dev
python3-venv → To create isolated Python environments

libpq-dev → For Postgres client libraries (needed later)

Step 1.3: Set up Python Virtual Environment for Airflow

It’s highly recommended to run Airflow inside a virtual environment:

python3 -m venv airflow_venv
source airflow_venv/bin/activate
Now inside the virtualenv, install Airflow:

pip install apache-airflow
✅ Installing Airflow takes a while (~5 mins) because it pulls many dependencies.

Step 1.4: Initialize Airflow Database

Airflow uses a metadata database to track DAG runs and tasks. We'll configure it later for PostgreSQL, but initially, initialize it:

export AIRFLOW_HOME=~/airflow
airflow db init
This creates:

~/airflow/airflow.cfg

SQLite DB initially (we'll later connect to PostgreSQL).

Step 1.5: Create an Admin User

Airflow needs an admin user to login to the web UI.

airflow users create \
    --username admin \
    --firstname Admin \
    --lastname User \
    --role Admin \
    --email [email protected]

Step 1.6: Run Airflow services

You need two services running:

  • Webserver (UI)

  • Scheduler (Job runner)

Start them:

airflow webserver --port 8080
In another SSH session/tab:

airflow scheduler
You can now access Airflow at http://:8080

At this point, verify:

Airflow UI is accessible.

Admin login works.

2. Install Apache Spark (Data Processing Engine)

Spark handles heavy lifting for data transformation. Make sure you have java and scala installed.

Step 2.1: Download and extract Spark

wget https://downloads.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
tar -xvzf spark-3.4.0-bin-hadoop3.tgz

Move it to /opt (standard for system-wide apps):

sudo mv spark-3.4.0-bin-hadoop3 /opt/spark

Step 2.2: Set Spark Environment Variables

Edit your bash profile:

nano ~/.bashrc
Add at the bottom:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Apply changes:

source ~/.bashrc

Step 2.3: Test Spark installation

Check Spark version:

spark-submit --version
✅ If you see version info without errors, Spark is installed properly.

3. Install PostgreSQL (Data Warehouse)

PostgreSQL will store all cleaned YouTube data.

Step 3.1: Install PostgreSQL Server

sudo apt update
sudo apt install -y postgresql postgresql-contrib

Step 3.2: Create Database and User

Switch to Postgres superuser:

sudo -u postgres psql
Inside psql shell:

CREATE DATABASE youtube_analytics;
CREATE USER airflow WITH ENCRYPTED PASSWORD 'yourpassword';
GRANT ALL PRIVILEGES ON DATABASE youtube_analytics TO airflow;
\q

Now Airflow and your scripts will connect using the airflow user.

At this point, verify:

  • psql connects successfully.

  • youtube_analytics database exists.

4. Install Grafana (Dashboarding)

Grafana visualizes the results beautifully.

Step 4.1: Add Grafana Repo

sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update

Step 4.2: Install Grafana

sudo apt-get install grafana
Enable and start the service:

sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Step 4.3: Access Grafana

Grafana will be available at:
http://:3000

Default credentials:

Username: admin

Password: admin
(you'll be prompted to change password at first login)

At this point, your server is fully ready:
Airflow + Spark + Postgres + Grafana all installed, running natively.

Part 2: Code Walkthrough

1. The Core ETL Pipeline (main.py)

This file is the heart of the whole system. It does three main things:

  • Pulls data from the YouTube API

  • Processes and enriches it using PySpark

  • Returns clean DataFrames ready to be stored

Load API Key and Initialize Spark + YouTube Client
python

from dotenv import load_dotenv
import os
from googleapiclient.discovery import build
from pyspark.sql import SparkSession

load_dotenv()
google_api_key = os.getenv('API_KEY')

This section loads your .env file and gets the API key for authenticating with the YouTube Data API.

def get_spark():
    return SparkSession.builder \
        .appName("YoutubeAnalytics") \
        .config("spark.jars", "/path/to/postgresql-42.6.0.jar") \
        .master("local[*]") \
        .getOrCreate()

Creates a local Spark session and includes the JDBC driver to connect to PostgreSQL later.

def get_youtube():
    return build('youtube', 'v3', developerKey=google_api_key)

Builds the YouTube API client using the API key.

Get Subscriber Count (getSubscribers())

request = youtube.channels().list(part='statistics', id='UCtxD0x6AuNNqdXO9Wp5GHew')
response = request.execute()
subscriber_count = int(response['items'][0]['statistics']['subscriberCount'])

This gets the current subscriber count for the specified channel.

subscriber_data = [(date.today(), subscriber_count)]
df = spark.createDataFrame(subscriber_data, ["date", "subscribers"])

Stores the date + subscriber count as a Spark DataFrame. This helps visualize growth over time later.

Get Top Videos by Engagement (get_videos)

# Loop through playlist videos
request = youtube.playlistItems().list(part='snippet,contentDetails', playlistId=playlist_id, maxResults=50)

Pulls all video IDs from the "Uploads" playlist.

# Pull stats for each video
data_request = youtube.videos().list(part='snippet,statistics', id=','.join(video_ids))

Fetches views, likes, and comments for each video.

engagement_rate = round((like_count + comment_count) / view_count, 2)

Calculates engagement rate to identify top-performing videos.

best_df = df.orderBy(col("engagement_rate").desc()).limit(5)

Sorts and selects the top 5 most engaging videos.

Best Time to Post (best_post_time)

df = df.withColumn("hour", hour(col('published_date')))
df = df.withColumn("day", date_format(col('published_date'), "E"))

Extracts the day of week and hour from the video publish time.

df = df.withColumn("engagement", col('likes') + col('view_count') + col('comments'))

Creates an engagement score to use for averaging.

grouped = df.groupBy("day", "hour").avg("engagement")

Aggregates average engagement by day/hour pair.

result = grouped.select("day_hour", "avg_engagement").orderBy(col("avg_engagement").desc()).limit(3)

Returns the best 3 day-hour combinations for posting.

2. The Controller File (controller.py)

This script is responsible for executing the ETL logic and pushing the results to PostgreSQL.

from main import getSubscribers, get_videos, best_post_time

subscriber_df = getSubscribers()
videos_list, best_videos_df = get_videos()
best_post_time_df = best_post_time(videos_list)

It calls the three main ETL functions and stores their outputs.

subscriber_df.write.jdbc(url=jdbc_url, table="subscriber_data", mode='append', properties=properties)

Writes each DataFrame into its corresponding PostgreSQL table.

Tables used:

  • subscriber_data

  • best_post_time

  • best_performing_videos

Clean separation between data logic and storage logic.

3. The Airflow DAG (youtube_dag.py)

This DAG automates running the pipeline daily.

default_args = {
    "owner": "Batru",
    "start_date": datetime(2025, 4, 23),
    "retries": 1,
    "retry_delay": timedelta(minutes=1),
    "email_on_failure": True,
    "email": ["[email protected]"]
}

Defines retry behavior and email alerts on failure.

task_youtube = BashOperator(
    task_id="task_youtube",
    bash_command="""
        source /home/mombasa/projects/youtubue_analytics_dashboard/venv/bin/activate &&
        python3 /home/mombasa/projects/youtubue_analytics_dashboard/controller.py
    """
)

The task activates the Python virtual environment and runs controller.py, kicking off the full ETL.

Part 3: Visualizing YouTube Data with Grafana

With the data successfully loaded into PostgreSQL from our Spark job (via Airflow), it’s time to bring it to life visually using Grafana.

Grafana is already installed on our Ubuntu server. Here's how we set it up and created dashboards that answer our key analytics questions.

Connect Grafana to PostgreSQL

Username: admin

Password: admin (you’ll be asked to reset on first login)

  • Add PostgreSQL as a Data Source and Fill in:

Host: localhost:5432 or your DB host

Database: youtube_analytics

User and Password: your PostgreSQL credentials

Click Save & Test.

Create the Dashboards

Now that the DB is connected, let’s create panels to visualize the insights.

1. Channel Growth Over Time

Query from subscriber_data table:

SELECT date, subscribers FROM subscriber_data ORDER BY date;

Visualization: Use a Time Series panel

Y-axis: Subscriber count

X-axis: Date

This shows subscriber growth trends over time.

2. Best Performing Videos by Engagement

Query from best_performing_videos:

SELECT title, engagement_rate FROM best_performing_videos ORDER BY engagement_rate DESC LIMIT 5;

Visualization: Bar Chart

X-axis: Video titles

Y-axis: Engagement rate

This reveals which videos truly resonate with the audience.

3. Best Time to Post

Query from best_post_time:

SELECT day_hour, avg_engagement FROM best_post_time ORDER BY avg_engagement DESC LIMIT 5;

Visualization: Table or Bar Chart

The final dashboard would look like something like this

Image description