🔍Exploratory Data Analysis and Data Visualization with Python
Streaming services have become a crucial part of our entertainment routine. In this blog, we’ll walk through an Exploratory Data Analysis (EDA)and Data Visualization of Amazon Prime Video content using Python. This includes data cleaning, visualization, and insight extraction. Let’s dive in! 💻📊
First thing's first
Clone the repository using the Repo link provided and install the required libraries as provided in the requirements.txt file.
Link to the Github Repository: Amazon-prime
⚙️ Setting Up the Environment
We begin by importing the essential Python libraries for data analysis and visualization.
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import matplotlib.colors
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
from datetime import datetime
import datetime as dt
📥 Loading and Understanding the Data
Let's load the dataset and get an initial understanding of its structure.
amazon_csv = pd.read_csv('Amazon_prime_titles.csv')
amazon_csv.shape
📊 Exploratory Data Analysis
np.random.seed(0) #seed for reproducibility
amazon_csv.sample(5)
Yes, it looks like there's some missing values.
Lets do some Initial data exploration and data cleaning
🧼 Checking Missing Data
# Operator to find null/ missing values
print('Percentage of null values in the given data set:')
for col in amazon_csv.columns:
null_count = amazon_csv[col].isna().sum()/len(amazon_csv)*100
if null_count > 0:
print(f'{col} : {round(null_count,2)}%')
Loops through the columns and finds the percentage of null values
✅ Validating Columns
# Define valid values
valid_types = {"Movie", "TV Show"}
# Find invalid values
invalid_types = amazon_csv[~amazon_csv['type'].isin(valid_types)]
# Check if any invalid values exist
if invalid_types.empty:
print("Valid values")
else:
print("\nInvalid 'type' values:")
print(invalid_types['type'].unique())
Above code checks if the type column has any invalid values other than "TV Show" and "Movies"
# Check if there are any invalid ratings (less than 0 or greater than 10)
if ((amazon_csv['rating'] < 0.0) | (amazon_csv['rating'] > 10.0)).any():
print("Invalid")
else:
print("Valid range")
🎨 Amazon Prime Color Palette
Using Sea born lets visualize the data set. Let's use Amazon Prime color palette for all the visualizations.
# Amazon Prime Video brand colors
sns.palplot(['#00A8E1', '#232F3E', '#FFFFFF', '#B4B4B4'])
plt.title("Amazon Prime Brand Palette", loc='left', fontweight="bold", fontsize=16, color='#4a4a4a', y=1.2)
plt.show()
📈 Summary Statistics
# summary statistics
print("\nSummary Statistics:\n", amazon_csv.describe())
The above code gives us the summary statistics of the data set.
📉 Histograms and Box Plots
# new column 'duration_numeric'
amazon_csv['duration_numeric'] = amazon_csv['duration'].str.extract(r'(\d+)').astype(float)
#seaborn style and color palette
sns.set(style="whitegrid", rc={"font.family": "serif"})
palette = ['#00A8E1', '#232F3E', '#FFFFFF', '#B4B4B4']
#subplots for histograms
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle("Exploratory Data Analysis by plotting the Histogram and Box Plots", fontsize=16, fontweight='bold', family='serif')
# aesthetics
for ax in axes:
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["bottom"].set_visible(False)
# Histogram for Release Year
sns.histplot(amazon_csv['release_year'], bins=30, kde=True, ax=axes[0], color=palette[0])
axes[0].set_title("Distribution of Release Years")
axes[0].set_xlabel("Year")
# Histogram for Rating
sns.histplot(amazon_csv['rating'].dropna(), bins=30, kde=True, ax=axes[1], color=palette[1])
axes[1].set_title("Distribution of Ratings")
axes[1].set_xlabel("Rating")
# Histogram for Duration
sns.histplot(amazon_csv['duration_numeric'].dropna(), bins=30, kde=True, ax=axes[2], color=palette[3])
axes[2].set_title("Distribution of Duration (Minutes)")
axes[2].set_xlabel("Duration (Minutes)")
plt.tight_layout()
plt.show()
# box plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for ax in axes:
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["bottom"].set_visible(False)
sns.boxplot(y=amazon_csv['release_year'], ax=axes[0], color=palette[0])
axes[0].set_title("Box Plot of Release Years")
sns.boxplot(y=amazon_csv['rating'], ax=axes[1], color=palette[1])
axes[1].set_title("Box Plot of Ratings")
sns.boxplot(y=amazon_csv['duration_numeric'], ax=axes[2], color=palette[3])
axes[2].set_title("Box Plot of Duration (Minutes)")
plt.tight_layout()
plt.show()
Histogram: Displays the frequency distribution of content duration. Rating values has a bell shaped spread of data with values from 0 to 10, with a peak 7.0. Release year is skewed towards year towards 2005 and beyond. Shows that spread is more towards newer contents.
Boxplots: Show the distribution of ratings and release years, highlighting outliers. The duration box plot shows a time duration of 600 minutes, which could be outlier or a lengthy movie.
⚠️ Insights on Missing Values
It's tracked that values are missing in 5 columns Director, Cast, Country , Date_added and rating.
- Date added columns has the highest missing counts
- Missing values in Ratings are the lowest
- Fair share of missing values in director, followed by cast, country and age_group column.
🧹 Data Cleaning
🌍 Handling Missing Country Data
I will replace NULL values in the 'country' column with the MODE value.
Let's deal with the missing Data
For the given dataset, with N = 7786, my approach for data cleaning will be:
I will replace NULL values in the 'country' column with the MODE value.
Why: Since the 'country' column is categorical, replacing missing values with the most common country (the mode) seems like a good choice. Because I’m keeping the most frequent value, which makes sense for the analysis.
🗑️ Dropping Columns and Rows
I will drop the row in the 'Cast' column if it has NULL values.
Why: The 'Cast' column is crucial for understanding the movie’s popularity. It’s better to drop that row so it doesn't mess up the analysis.
I will drop the whole 'Date Added' column since it might not provide any significant insight, especially since there are so many NULL values.
Why: 'Date Added' has too many NULL values, and it’s probably not useful for the analysis.
🎭 Handling Missing Cast and Director Data
I will keep the 'Director' as it is.
Why: The 'Director' column seems important, and there doesn't seem to be much missing data here. So, it’s best to leave it untouched because it might provide important information for understanding the movie.
I will drop the whole 'Date Added' column since it might not provide any significant insight, especially since there are so many NULL values.
#Drop the column date_added
amazon_csv.drop(columns=['date_added'], inplace=True)
# Replace missing values in 'country' with its mode (most frequent value)
amazon_csv['country'].fillna(amazon_csv['country'].mode()[0], inplace=True)
# Replace missing values in 'director' and 'cast' with "Not available"
amazon_csv[['director', 'cast']] = amazon_csv[['director', 'cast']].fillna("Not available")
# Drop rows where 'age group' or 'rating' are missing
amazon_csv.dropna(subset=['age_group', 'rating'], inplace=True)
# Reset index after dropping rows
amazon_csv.reset_index(drop=True, inplace=True)
📽️ Data Visualization
🕰️ Amazon Prime Timeline
Lets create a time line, to show the progress of Amazon Prime over the years. Timeline plotting code inspired from Subin An's notebook
# Set figure & Axes
fig, ax = plt.subplots(figsize=(15, 4), constrained_layout=True)
ax.set_ylim(-2, 1.75)
ax.set_xlim(0, 10)
# Timeline : line
ax.axhline(0, xmin=0.1, xmax=0.68, c='#B4B4B4', zorder=1)
# these values go on the numbers below
tl_dates = [
"1994\nFounded",
"2006\nIntroduced Prime",
"2015\n100M Users",
"2024\nPrime membership hits 250M"
]
tl_x = [1, 2.6, 4.3,6.8]
## these go on the numbers
tl_sub_x = [2.3,3.5,4.1,5.1,5.8]
tl_sub_times = [
"2005","2010","2014","2016","2020"
]
# these values go on Stemplot : vertical line
tl_text = [
"Amazon Prime Launched","First Amazon Original","Amazon adds streaming benefits", "India Launch (My birthplace)", "Prime Video Goes Global"]
# Timeline : Date Points
ax.scatter(tl_x, np.zeros(len(tl_x)), s=120, c='#4a4a4a', zorder=2)
ax.scatter(tl_x, np.zeros(len(tl_x)), s=30, c='#fafafa', zorder=3)
# Timeline : Time Points
ax.scatter(tl_sub_x, np.zeros(len(tl_sub_x)), s=50, c='#4a4a4a',zorder=4)
# Date Text
for x, date in zip(tl_x, tl_dates):
ax.text(x, -0.55, date, ha='center',
fontweight='bold',
color='#4a4a4a',fontfamily='serif',fontsize=12)
# Stemplot : vertical line
levels = np.zeros(len(tl_sub_x))
levels[::2] = 0.3
levels[1::2] = -0.3
markerline, stemline, baseline = ax.stem(tl_sub_x, levels)
plt.setp(baseline, zorder=0)
plt.setp(markerline, marker=',', color='#4a4a4a')
plt.setp(stemline, color='#232F3E')
# Remove the Spine around the plot
for spine in ["left", "top", "right", "bottom"]:
ax.spines[spine].set_visible(False)
# Ticks
ax.set_xticks([])
ax.set_yticks([])
# Title
ax.set_title("Amazon through the years", fontweight="bold", fontsize=16,fontfamily='serif', color='#4a4a4a')
ax.text(2.4,1.57,"From Fast Shipping to Global Streaming - Popular streaming service for movies and TV shows.", fontsize=12, fontfamily='serif',color='#4a4a4a')
# Text
for idx, x, time, txt in zip(range(1, len(tl_sub_x)+1), tl_sub_x, tl_sub_times, tl_text):
ax.text(x, 1.3*(idx%2)-0.5, time, ha='center',
fontweight='bold',
color='#4a4a4a' if idx!=len(tl_sub_x)-1 else '#00A8E1', fontfamily='serif',fontsize=11)
ax.text(x, 1.3*(idx%2)-0.6, txt, va='top', ha='center',fontfamily='serif',
color='#4a4a4a' if idx!=len(tl_sub_x)-1 else '#00A8E1')
plt.show()
🌐 Country-Wise Content Analysis
Helper code for better display of country names , also since a few movies have more than one Country I am seggeregating. Also Group country and get the total count and sort in descending order.
amazon_csv['count'] = 1
# Extract the first country from the 'country' column
amazon_csv['first_country'] = amazon_csv['country'].apply(lambda x: x.split(",")[0])
# Shorten country names
amazon_csv['first_country'].replace({'United States': 'USA','United Kingdom': 'UK','South Korea': 'S. Korea'}, inplace=True)
# Group by country and get total count
df_country = amazon_csv.groupby('first_country')['count'].sum().sort_values(ascending=False)
Let's use Plotly's choropleth map to show country wise release of Movies/ TV shows on Amazon Prime.
Load a new CSV for ISO Code of countries, as choropleth takes one of these as the input for plotting.
Merge this with the Amazon CSV to map it with the choropleth.
# Load the ISO code CSV data
iso_df = pd.read_csv('country_iso_codes.csv')
# Merge the country counts with the ISO codes
merged_df = pd.merge(countriess, iso_df, left_index=True, right_on='Country')
# Define Amazon Prime themed color scale for specific ranges for better visibility
custom_colorscale = [
(0.0, '#00A8E1'), # Highest values (Prime Blue)
(0.1, '#0073CF'), # 700-800 range (Deep Blue)
(0.2, '#005BB5'), # 600-700 range (Darker Blue)
(0.3, '#004494'), # 500-600 range (Navy Blue)
(0.4, '#002E73'), # 400-500 range (Deep Navy)
(0.5, '#001A52'), # 300-400 range (Dark Indigo)
(0.6, '#232F3E'), # 100-300 range (Amazon Dark Blue)
(1.0, '#0F1111') # 1-100 range (Almost Black)
]
# Create the choropleth map
fig = go.Figure(data=go.Choropleth(
locations=merged_df['ISO Code'], # Country ISO codes
z=merged_df['count'], # Data values
text=merged_df['Country'], # Hover text
colorscale=custom_colorscale,
marker_line_color='darkgray',
marker_line_width=0.6,
colorbar_title='Count',
))
# Update layout
fig.update_layout(
title={
'text': "TV Shows / Movies origin country available on Amazon Prime",
'x': 0.5,
'xanchor': 'center',
'yanchor': 'top',
'font': dict(
size=30,
family='serif',
weight='bold'
),
'pad': dict(t=20)
},
geo=dict(
showframe=False,
showcoastlines=True,
projection_type='equirectangular'
)
)
# Show the figure
fig.show()
We can clearly see that Majority of the content on Amazon Prime is produced in USA & then followed by India, UK, South Korea, France, Australia. This could be because of Hollywood and Bollywood which are major producers of Movies and TV Shows. Also, because of K Drama and K Pop, South Korea also holds a strong hold. But also could be a result of Population in the respective countries which increases the production of contents.
🍿Movie vs TV Show Ratio
Let's plot the share of ratio between Movies and TV Shows on Amazon Prime
# Find the ratio of type
x=amazon_csv.groupby(['type'])['type'].count()
y=len(amazon_csv)
r=((x/y)).round(2)
mf_ratio = pd.DataFrame(r).T
print(mf_ratio)
Above code finds the ratio of Movies to TV Shows for the next step in visualizing.
#calclated 'mf_ratio'
mf_ratio = pd.DataFrame({'Movie': [0.8], 'TV Show': [0.2]})
fig, ax = plt.subplots(figsize=(6, 1.5))
# single horizontal stacked bar
ax.barh(0, mf_ratio['Movie'], color='#00A8E1', alpha=0.9)
ax.barh(0, mf_ratio['TV Show'], left=mf_ratio['Movie'], color='#232F3E', alpha=0.9)
# Remove axes details
ax.set_xlim(0, 1)
ax.set_xticks([])
ax.set_yticks([])
ax.spines[['top', 'left', 'right', 'bottom']].set_visible(False)
# Annotations for Movie & TV Show
ax.annotate("80%", xy=(mf_ratio['Movie'][0] / 2, 0), va='center', ha='center', color='white', fontsize=20, fontweight='bold', fontfamily='serif')
ax.annotate("20%", xy=(mf_ratio['Movie'][0] + mf_ratio['TV Show'][0] / 2, 0), va='center', ha='center', fontsize=20, color='white', fontweight='bold', fontfamily='serif')
fig.text(1.1, 1, 'Insight', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(1.1, 0.40, '''
The graph shows that the content available
in Amazon Prime is majorly Movies with (80%)
and TV Shows only add upto (20%).This could
be a result because of the country distribution which
we saw on the previous plot. Or Maybe because
TV Shows are something which are new to the trend than movies.
'''
, fontsize=8, fontweight='light', fontfamily='serif')
# Creates a vertical line to seperate the insight
import matplotlib.lines as lines
l1 = lines.Line2D([1, 1], [0, 1], transform=fig.transFigure, figure=fig,color='black',lw=0.2)
fig.lines.extend([l1])
# Title
fig.text(0.125, 1.1, 'Movie & TV Show Distribution', fontsize=14, fontweight='bold', fontfamily='serif')
fig.text(0.64,0.9,"Movie", fontweight="bold", fontfamily='serif', fontsize=15, color='#00A8E1')
fig.text(0.77,0.9,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')
fig.text(0.8,0.9,"TV Show", fontweight="bold", fontfamily='serif', fontsize=15, color='#221f1f')
plt.show()
The graph shows that the content available in Amazon Prime is majorly Movies with (80%) and TV Shows only add upto (20%).This could be a result because of the country distribution which we saw on the previous plot. Or Maybe because TV Shows are something which are new to the trend than movies.
🎬 Top Directors on Amazon Prime
Now lets analyze the Directors column and find Top 6 of them all.
# Filter out 'Not Available' values before counting occurrences
filtered_directors = amazon_csv[amazon_csv['director'] != "Not available"]
# Count the number of titles per director
director_counts = filtered_directors['director'].value_counts()
# Select top 10 directors
top_directors = director_counts[:6]
# Define bar colors
colors = ['#232F3E'] * 10 # Default light color
colors[:3] = ['#00A8E1'] * 3 # Highlight top 3 directors
# Create figure and axis for the bar chart
fig, ax = plt.subplots(figsize=(12, 6))
# Plot the bar chart
ax.bar(top_directors.index, top_directors.values, width=0.5, color=colors, edgecolor='darkgray', linewidth=0.6)
# Get the maximum value for dynamic annotation positioning
max_value = top_directors.max()
# Add value labels above the bars
for i, director in enumerate(top_directors.index):
ax.annotate(f"{top_directors[director]} Content",
xy=(i, top_directors[director] + max_value * 0.05), # 5% above the bar
ha='center', va='bottom', fontweight='light', fontfamily='serif')
# Remove top, left, and right borders
for side in ['top', 'left', 'right']:
ax.spines[side].set_visible(False)
# Set x-axis labels
ax.set_xticks(range(len(top_directors.index))) # Correct tick positioning
ax.set_xticklabels(top_directors.index, fontfamily='serif', rotation=0, ha='center')
ax.set_yticklabels('', fontfamily='serif', rotation=0, ha='center')
# Add horizontal grid lines for better readability
ax.grid(axis='y', linestyle='-', alpha=0.4)
ax.set_axisbelow(True) # Place grid lines behind bars
# Add a thick bottom border
plt.axhline(y=0, color='black', linewidth=1.3, alpha=0.7)
fig.text(0.95, 0.75, 'Insight', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.95, 0.45, '''
I have chosen only Top 6 for better visualization
because majority of the directors have single
figure content under their name.
Director, Mark Knight has the highest
content of Movies/ TV show with 104 counts
that is hosted on Amazon Prime. Followed
by Cannis Holder (60), Moonbug E(36).
Its interesting although India , South Korea are one of the top
producers, the directors dont hold a Monopoly.
'''
, fontsize=8.5, fontweight='light', fontfamily='serif')
# Remove tick marks
ax.tick_params(axis='both', which='major', labelsize=12, length=0)
fig.text(0.65,0.75,"Top 3", fontweight="bold", fontfamily='serif', fontsize=15, color='#00A8E1')
fig.text(0.71,0.75,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')
fig.text(0.73,0.75,"Rest 3", fontweight="bold", fontfamily='serif', fontsize=15, color='#221f1f')
# Title styling
plt.title('Top 6 Directors on Amazon Prime', fontsize=16, fontweight='bold', fontfamily='serif')
# Display the plot
plt.show()
Director, Mark Knight has the highest content of Movies/ TV show with 104 counts that is hosted on Amazon Prime. Followed by Cannis Holder (60), Moonbug E(36).
Its interesting although India , South Korea are one of the top
producers, the directors dont hold a Monopoly.
⏳ Chronological availability of Movies & TV Shows
# To create a layout with length and width defined
fig, ax = plt.subplots(figsize=(12, 6))
color = ["#00A8E1", "#221f1f"]
# Annotation
for i, mtv in enumerate(amazon_csv['type'].value_counts().index):
mtv_rel = amazon_csv[amazon_csv['type']==mtv]['release_year'].value_counts().sort_index()
ax.plot(mtv_rel.index, mtv_rel, color=color[i], label=mtv)
ax.fill_between(mtv_rel.index, 0, mtv_rel, color=color[i], alpha=0.9)
ax.yaxis.tick_right()
# Creates a horizontal line below in x axis
ax.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)
# Limits for the y axis ticks
ax.set_ylim(0, 1000)
#To remove the box around
for s in ['top', 'right','bottom','left']:
ax.spines[s].set_visible(False)
ax.grid(False)
plt.xticks(fontfamily='serif')
plt.yticks(fontfamily='serif')
# Insights
fig.text(0.30, 0.85, 'Chronological availability of Movies & TV Shows', fontsize=16, fontweight='bold', fontfamily='serif')
fig.text(0.13, 0.75, 'Insight', fontsize=15, fontweight='bold', fontfamily='serif')
fig.text(0.13, 0.55,
'''It shows that majority of the content on Amazon Prime
is new. We can see a steep curve in the range of 2000-2020.
Definitely this will give us an insight of the age group
of the consumers.
Could this mean contents are mainly focused on youth or teens?
'''
, fontsize=8.5, fontweight='light', fontfamily='serif')
# Legends
fig.text(0.13,0.2,"Movie", fontweight="bold", fontfamily='serif', fontsize=15, color='#00A8E1')
fig.text(0.19,0.2,"|", fontweight="bold", fontfamily='serif', fontsize=15, color='black')
fig.text(0.2,0.2,"TV Show", fontweight="bold", fontfamily='serif', fontsize=15, color='#221f1f')
# removes the ticks
ax.tick_params(axis=u'both', which=u'both',length=0)
# Display the grapgh
plt.show()
It shows that majority of the content on Amazon Prime is new. We can see a steep curve in the range of 2000-2020. Definitely this will give us an insight of the age group of the consumers. Could this mean contents are mainly focused on youth or teens?
Plot a Heatmap .
# Get the unique countries
unique_countries = set()
# Get the first country if multiple are listed
for entry in amazon_csv['country'].dropna():
first_country = entry.split(', ')[0]
unique_countries.add(first_country)
# Filter the dataset
amazon_csv['first_country'] = amazon_csv['country'].apply(lambda x: x.split(', ')[0] if isinstance(x, str) else x)
# Get the top 10 countries
top_countries = amazon_csv['first_country'].value_counts().head(10).index
# Creating a dictionary for clubbing
ratings_ages = {
'TV-Y': 'Kids', 'TV-G': 'Kids', 'G': 'Kids', 'ALL': 'Kids', 'ALL_AGES': 'Kids',
'TV-Y7': 'Older Kids', '7+': 'Older Kids', 'TV-PG': 'Older Kids', 'PG': 'Older Kids',
'PG-13': 'Teens', 'TV-14': 'Teens', '13+': 'Teens', '16+': 'Teens', '16': 'Teens', 'AGES_16_': 'Teens',
'18+': 'Adults', 'R': 'Adults', 'TV-MA': 'Adults', 'NC-17': 'Adults', 'NR': 'Adults', 'TV-NR': 'Adults',
'UNRATED': 'Adults', 'AGES_18_': 'Adults', 'NOT_RATE': 'Adults'
}
# Create 'target_age_group' column *before* filtering
amazon_csv['target_age_group'] = amazon_csv['age_group'].map(ratings_ages)
# Filter data for only the top 10 countries
filtered_data = amazon_csv[amazon_csv['first_country'].isin(top_countries)]
filtered_data['first_country'].replace({'United States': 'USA',
'United Kingdom': 'UK',
'South Korea': 'S. Korea'}, inplace=True)
# Create a pivot table with counts of each age group per country
heatmap_data = filtered_data.pivot_table(index='target_age_group', columns='first_country', aggfunc='size', fill_value=0)
# Convert counts to percentages
heatmap_data_percentage = heatmap_data.div(heatmap_data.sum(axis=0), axis=1) * 100
# custom colormap for Amazon
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#FFFFFF', '#B4B4B4', '#00A8E1'])
# Plot the heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(heatmap_data_percentage, cmap=cmap, annot=True, fmt='.1f', linewidths=2.5, square=True,
cbar=False, vmax=60, vmin=5, annot_kws={"fontsize": 12, "fontfamily": 'serif', "fontweight": 'bold', "color": 'black'})
# Annotations
for t in plt.gca().texts:
t.set_text(t.get_text() + "%")
t.set_fontsize(10)
# Title and labels
plt.title("Target Age Group Distribution by Country (%)", fontsize=16, fontweight='bold', fontfamily='serif', pad=20, color='#232F3E')
plt.xlabel("", fontsize=10, labelpad=10, fontfamily='serif', color='#232F3E')
plt.ylabel("", fontsize=10, labelpad=10, fontfamily='serif', color='#232F3E')
plt.xticks(fontsize=10, fontfamily='serif')
plt.yticks(fontsize=10, fontfamily='serif')
# Add labels
plt.text(10.5, 0.2, "Insight", fontweight="bold", fontfamily='serif', fontsize=16)
plt.text(10.5, 0.9, '''Yes, we can see that Most of the countries
target Teens and Adults for their viewer engagement.
Italy targets 70% of younger viewers.
''', fontfamily='serif', fontsize=8.5)
# Show plot
plt.show()
Yes, we can see that Most of the countries target Teens and Adults for their viewer engagement.
Italy targets 70% of younger viewers.
Could the number of seasons of a TV Show has any relation with the rating? Usually TV Shows tend to increase their seasons if its a success or has a good engagegement.
amazon_csv['num_seasons'] = amazon_csv['duration'].str.extract(r'(\d+) Season[s]?')
From duration checks if the value has 'season' in it to confirm its TV Shows.
###Correlation between Number of Seasons Vs Ratings.
correlation = amazon_csv['num_seasons'].corr(amazon_csv['rating'])
print(f"Correlation Coefficient: {correlation}")
Correlation Coefficient is found between Number of Seasons in a TV Show and Rating value of it. To check for statistical signicance.
# Make it numeric to extract the number of seasons
amazon_csv['num_seasons'] = pd.to_numeric(amazon_csv['num_seasons'], errors='coerce')
# Create the regplot / Scatter plot
plt.figure(figsize=(12, 6))
sns.regplot(x=amazon_csv['num_seasons'], y=amazon_csv['rating'], scatter_kws={"s": 3, "color": "#232F3E"}, line_kws={"color": "#00A8E1", "linewidth": 2})
# Set title and labels
plt.title("Number of Seasons vs Rating", fontsize=16, fontweight='bold', family='serif')
plt.xlabel("Number of Seasons", fontsize=12, fontweight='normal', family='serif')
plt.ylabel("Rating", fontsize=12, fontweight='normal', family='serif')
# Grid customization
plt.grid(True, axis='both', linestyle='-', alpha=0.4)
# Make the background color white for a cleaner look
plt.gca().set_facecolor('white')
plt.text(22, 15.5, 'Insight', fontsize=16, fontweight='bold', fontfamily='serif')
plt.text(22, 13,
'''There is a strong correlation of 0.67 to show
that rating might have tempted the Directors to
increase the number of seasons.
'''
, fontsize=8.5, fontweight='light', fontfamily='serif')
# Customize the ticks
plt.xticks(fontsize=8, family='serif')
plt.yticks(fontsize=8, family='serif')
# Remove the borders (spines)
for spine in plt.gca().spines.values():
spine.set_visible(False)
# Display the plot
plt.tight_layout()
plt.show()
There is a strong correlation of 0.67 to show that rating might have tempted the Directors to increase the number of seasons. But correlation doesnt imply causation so we cant predict AND confirm the analysis.
Exploratory Data Analysis (EDA) of Amazon Prime Titles
Word count: 391
This analysis explores Amazon Prime’s content library using data from 9,667 titles, including movies and TV shows. The dataset contains attributes like title, type, release year, rating, and genre. The objective is to clean the data, handle missing values, and uncover trends in content distribution.
Data Cleaning and Preprocessing
Missing values were found in multiple columns:
- "date_added" (98.4%) – Dropped due to excessive missing data.
- "country" (10.3%) – Filled with mode value.
- "director" (21.54%) and "cast" (12.75%) – Replaced with "Unknown."
- "rating" (2.93%) – Filled with mode value.
Checks ensured "type" only had "Movie" and "TV Show" values, and all ratings were valid.
Content Distribution and Trends
- Movies vs TV Shows: The dataset contains 7,814 movies (80%) and 1,854 TV shows (20%).
- Release Year Trends: Content production increased after 2000, with a sharp rise post-2015.
- Newer Content Preference: 50% of content is from 2016 or later.
Genre and Audience Analysis
- Popular Categories: "Drama," "Comedy," and "Suspense" dominate the dataset.
- Ratings Breakdown: PG, R, and TV-MA are common ratings, showing a mix of family-friendly and mature content.
-
Country-wise Production:
- USA leads content production.
- India, UK, South Korea, and France follow.
- Hollywood, Bollywood, and K-Dramas contribute significantly.
Duration and Rating Analysis
- Movies: Average duration is 90 minutes.
- TV Shows: Ratings correlate strongly (0.67) with the number of seasons, suggesting successful shows get extended seasons.
- Highest Rated Content: Median rating is 6.3, with a maximum of 9.8.
Key Findings and Business Insights
- Content Type: Majorly populated with Movies than TV Shows
- Content Age: Significant rise in content chronology distrubtion after 2010, majorly being newer releases.
- Regional Availability: USA dominates, but India and South Korea also have strong presence maybe due to Hollywood, Bollywood and KDrama.
- Directors: Mark Knight (104 titles) leads, followed by Cannis Holder (60) and Moonbug E (36). Notably, no monopoly among Indian and Korean directors even with higher content origin.
- Statistical Insights: Correlation between rating and number of TV show seasons suggests audience engagement influences newer releases.
Conclusion
Amazon Prime’s content is heavily movie-focused. The platform’s diverse genre majorly targets Adult viewers but mix both family and mature audiences. These insights can help optimize content curation and engagement strategies.