PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment.

Jupyter Notebook is used to create interactive notebook documents that can contain live code, equations, visualizations, media and other computational outputs. Jupyter Notebook is often used by programmers, data scientists and students to document and demonstrate coding workflows or simply experiment with code.

Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. Originally designed by Google.


The kubernetes control-plane & worker nodes addresses are :

192.168.56.115
192.168.56.116
192.168.56.117

Image description
Kubernetes cluster nodes :

Image description
you can install helm via the link helm :


Install Spark On Kubernetes via bitnami helm chart

The Steps :

you can install helm chart via the link helm chart :

Important: the spark version of helm chart must be the same as the PySpark version of jupyter

Install spark via helm chart (bitnami) :

Image description

$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ helm search repo bitnami
$ helm install kayvan-release bitnami/spark --version 8.7.2

Deploy Jupyter workloads :

jupyter.yaml :

apiVersion: apps/v1
kind: Deployment
metadata:
   name: jupiter-spark
   namespace: default
spec:
   replicas: 1
   selector:
      matchLabels:
         app: spark
   template:
      metadata:
         labels:
            app: spark
      spec:
         containers:
            - name: jupiter-spark-container
              image: docker.arvancloud.ir/jupyter/all-spark-notebook
              imagePullPolicy: IfNotPresent
              ports:
              - containerPort: 8888
              env: 
              - name: JUPYTER_ENABLE_LAB
                value: "yes"
---
apiVersion: v1
kind: Service
metadata:
   name: jupiter-spark-svc
   namespace: default
spec:
   type: NodePort
   selector:
      app: spark
   ports:
      - port: 8888
        targetPort: 8888
        nodePort: 30001
---
apiVersion: v1
kind: Service
metadata:
  name: jupiter-spark-driver-headless
spec:
  clusterIP: None
  selector:
    app: spark
kubectl apply -f jupyter.yaml

the installed pods:

Image description
and Services (headless for statefull) :

Image description
Note: spark master url address is :

spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077

Open jupyter notebook and write some python codes based on pyspark and press shift + enter keys on each block to execute:

#import os

#os.environ['PYSPARK_SUBMIT_ARGS']='pyspark-shell'
#os.environ['PYSPARK_PYTHON']='/opt/bitnami/python/bin/python'
#os.environ['PYSPARK_DRIVER_PYTHON']='/opt/bitnami/python/bin/python'

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("spark://kayvan-release-spark-master-0.kayvan-release-spark-headless.default.svc.cluster.local:7077")\
            .appName("Mahla").config('spark.driver.host', socket.gethostbyname(socket.gethostname()))\
            .getOrCreate()

note:

socket.gethostbyname(socket.gethostname()) ---> returns the jupyter pod's ip address

Image description

Image description

Image description
enjoy from sending python codes to spark cluster on kubernetes via jupyter.

Note: of course you can work with pyspark single node installed on jupyter without kubernetes and when you will be sure that the code is correct, then send it via spark-submit or like above code to spark cluster on kubernetes.


Deploy on the Docker Desktop :

docker-compose.yml :

version: '3.6'

services:

  spark-master:
    container_name: spark
    image: docker.arvancloud.ir/bitnami/spark:3.5.0
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=root   
      - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3
    ports:
      - 127.0.0.1:8081:8080
      - 127.0.0.1:7077:7077
    networks:
      - spark-network

  spark-worker:
    image: docker.arvancloud.ir/bitnami/spark:3.5.0
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=2
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=root
      - PYSPARK_PYTHON=/opt/bitnami/python/bin/python3
    networks:
      - spark-network


  jupyter:
    image:  docker.arvancloud.ir/jupyter/all-spark-notebook:latest
    container_name: jupyter
    ports:
      - "8888:8888"
    environment:
      - JUPYTER_ENABLE_LAB=yes
    networks:
      - spark-network
    depends_on:
      - spark-master


networks:
  spark-network:

run in cmd :

docker-compose up --scale spark-worker=2

Image description
Copy csv file to inside spark worker container :

docker cp file.csv spark-worker-1:/opt/file
docker cp file.csv spark-worker-2:/opt/file

Open jupyter notebook and write some python codes based on pyspark and press shift + enter keys on each block to execute:

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("YourAppName")\
            .master("spark://8fa1bd982ade:7077").getOrCreate()
data = spark.read.csv("/opt/file/file.csv", header=True)
data.limit(3).show()
spark.stop()

Image description

Image description
Note again: you can work with pyspark single node installed on jupyter without spark cluster and when you will be sure that the code is correct, then send it via spark-submit or like above code to spark cluster on docker desktop.


Execute on the pyspark that installed on jupyter, no on spark cluster

Copy csv file to inside jupyter container :

docker cp file.csv jupyter:/opt/file
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("YourAppName").getOrCreate()

data = spark.read.csv("/opt/file/file.csv", header=True)
data.limit(3).show()

spark.stop()

and also you can practice on single node pyspark in jupyter :

Image description

Congratulation 🍹