Databases are tricky. They hold more state as we move towards stateless architectures. Kubernetes setups often start with stated intention of providing governance over all infrastructure until they encounter Databases. Database also turns out to be the largest cost in many workloads, because of compute requirements, DR, backup and skill.
Database costs are fixed, hard to reduce and form almost 1/3rd of Cloud costs for many customers. As a cofounder of startup or VP of Cloud in a Bank, you start to think of this cost item as the cost of doing business and this has always been. Until it isn't. Database like any other application is highly configurable and we look at major levers to increase performance where a 138$ a month database performs like a $3000 a month. A notional Porsche 917K at price of Renault Clio
We cannot start from managed options here clearly here because it it just a more expensive start. Many of the activities and learnings are deployable on managed databases but we want to start at garage sale price for our shiny high octane database. We start with an ordinary node and in case of E2E Cloud with a 4 core 16 GB c3.16GB-288 instance which costs $66 a month.
We will truly snowflake the database by naming the server david. We will turn these learnings in K8S setup later desnowflaking it but for now we want it - to be one of a kind.
$ sudo hostnamectl set-hostname David
$ sudo sed -i "s/$(hostname)/david/g" /etc/hosts
Now let's setup SysBench as our reference benchmark to test david like a dojo testing a pup pupil until it reaches mental and physical toughness.
Firstly we will setup MySQL and SysBench for testing and run the control test cycle. Control provides us the baseline performance, which will show us how much we were able to push the machine to the frontier of becoming a $3000 dollar DB server.
david$ sudo apt install sysbench mysql-server
I will leave setting credentials on MySQL to the reader. We run the benchmark like this;
david$ sysbench oltp_read_write \
--tables=10 \
--table-size=1000000 \
--threads=4 \
--time=60 \
--db-driver=mysql \
--mysql-db=sbtest \
--mysql-user=sbtest \
--mysql-password=password \
run
We will anchor the readings as a baseline here with key items that matter in read/write mixed performance
- Transaction/Sec which baselines at 112.24 (higher is better)
- Queries/Sec at 2,244.84 (higher is better)
- Avg Latency (ms) 35.63 (lower is better)
- 95th %ile Latency (ms) 94.10 (lower is better)
- Min Latency (ms) 4.31 (lower is better)
- Max Latency (ms) 1,065.46 (lower is better)
We will be improving on each of these key performance numbers as we go through retrofitting for Porsche makeover. Before saying that we improve these numbers, we should level-set that these are quite good numbers to start and often meet or exceed customer requirements in vast majority of use-cases.
Step 1: InnoDB Block Sizes and File I/O
Block sizes of choice in Database and Filesystem is single most important change that you could do to improve the database performance. Ostensibly XFS is the best filesystem for database performance however we can't larger block sizes beyond 4k because that requires Linux kernel recompilation. However if we place Innodb block size to be 16k that aligns neatly with 4 blocks for 4k size thus it is still a performance win.
Add to /etc/mysql/mysql.conf.d/mysqld.cnf
innodb_page_size=16384
With a single line change we have almost doubled our performance from 66$ instance. To reach this performance without config change we would have expended a budget of 3 times current spend. We have thus increased value of our setup by $132.
Step 2: Swapiness and network parameters
Reference Only – Not Applied in Benchmarks
There are several popular Swap and network related sysctl parameters which change kernel behaviour at runtime. Network ones are irrelevant since in my setup networking is loopback which is sufficiently fast in normal coarse. Swap settings are risky, or harmful or both in virtualised environment. We include them for sake of completeness but we did not apply and test in our environment.
# MySQL Swap Settings
vm.swappiness = 10
vm.vfs_cache_pressure = 50
# File System Settings
fs.file-max = 65536
fs.aio-max-nr = 1048576
# Network Settings - General
net.core.somaxconn = 4096
net.core.netdev_max_backlog = 4000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# TCP Settings
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 10000 65000
net.ipv4.tcp_slow_start_after_idle = 0
# Memory Management for Database Workloads
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
vm.dirty_expire_centisecs = 500
It is imperative that you should be testing your workload with each of these changes before ever bringing them to production.
Step 3: MySQL tune-ups
MySQL has history of being used as database of the Internet. Almost every site large and small hosts on MySQL. With that there a myriad of features, bells and whistles, dials and buttons. We decided to press few of these buttons to extract MySQL performance.
Add to /etc/mysql/mysql.conf.d/mysqld.cnf
# InnoDB Buffer Pool Settings
innodb_buffer_pool_size = 12G
innodb_buffer_pool_instances = 4
innodb_old_blocks_time = 1000
# Transaction and Log Settings
innodb_log_file_size = 3G
innodb_log_buffer_size = 32M
innodb_flush_log_at_trx_commit = 1
sync_binlog = 1
# I/O and Threading
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_read_io_threads = 4
innodb_write_io_threads = 4
innodb_page_cleaners = 4
# File and Storage Settings
innodb_flush_method = O_DIRECT
innodb_file_per_table = ON
innodb_open_files = 400
table_definition_cache = 1400
table_open_cache = 2000
# Query Optimization
join_buffer_size = 2M
sort_buffer_size = 4M
read_rnd_buffer_size = 2M
read_buffer_size = 1M
tmp_table_size = 64M
max_heap_table_size = 64M
# Connection and Thread Settings
max_connections = 150
thread_cache_size = 16
max_prepared_stmt_count = 16384
These numbers have been worked out on specific CPU, RAM and Disk Configuration we used. We assumed that MySQL will be the only tenant to this system and it can fully utilise the resources for maximum performance.
With these valuable config change, we once again doubled the performance of our system which still is being charged at $66 per month. Now notional the value we extract out of it is close to 16 core system which could cost us around $660, unlocking value of $600.
Step 4: Deadlines and IO Schedulers
IO schedulers are important but not that important in SSD and NVMe world today. Both these kind of devices have sophisticated way to schedule the IO workload and placing priority on them. While Ubuntu by default includes two schedulers none and mq-deadline. We stick to default. No changes. No performance gains.
Step 5: More NVMe Controllers, More IOPS, More Throughput
At this time, the simplest thing to do is add more volumes which increase IOPS and throughput. We have two options for doing it, first and simplest is to put single large disk and reap high IOPS,
256 GB -> 5000 IOPS
512 GB -> 10000 IOPS
1TB -> 20000 IOPS
The second and more interesting method is to setup RAID - Redundant Array of Inexpensive Disks or in this Redundant Array of Inexpensive NVMe (RAIN). We have to increase our budget by $66 to do this. We take 4 256 GB Volumes at E2E Cloud priced at 8 dollars per Volume per month. Thus our revised total cost of ownership for this new setup is $138 per month. Once attached these volumes will show up as available disks vd*
david$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sr0 11:0 1 366K 0 rom
vda 253:0 0 139.7G 0 disk
├─vda1 253:1 0 1M 0 part
└─vda2 253:2 0 139.7G 0 part /
vdb 253:16 0 232.8G 0 disk
vdc 253:32 0 232.8G 0 disk
vdd 253:48 0 232.8G 0 disk
vde 253:64 0 232.8G 0 disk
Now we have four disks available to be setup as RAID, instead of using mdadm based setup, we use zfs striped pool which is quite easy to setup
david$ sudo apt install zfsutils-linux
david$ sudo zpool create mysqlpool /dev/vdb /dev/vdc /dev/vdd /dev/vde
We can also place some performance garnishes in ZFS to get slight improvements
# Disable access time updates
david$ sudo zfs set atime=off mysqlpool
# Set optimal record size for MySQL
david$ sudo zfs set recordsize=16K mysqlpool
# Cache only metadata to prevent double-caching
david$ sudo zfs set primarycache=metadata mysqlpool
# Enable LZ4 compression for better performance/storage ratio
david$ sudo zfs set compression=lz4 mysqlpool
Finally we can stop mysql server, move the directory /var/lib/mysql and mount new ZFS pool as new /var/lib/mysql
david$ sudo systemctl stop mysql
david$ sudo mv /var/lib/mysql /var/lib/mysql.old
david$ sudo zfs set mountpoint=/var/lib/mysql mysqlpool
david$ sudo rsync -av /var/lib/mysql.old/ /var/lib/mysql/
david$ sudo chown -R mysql:mysql /var/lib/mysql
david$ sudo systemctl restart mysql
After this massive exercise, looking much like entire cars being reassembled on a race track, we are back to running benchmark again. Anchor this that we have increased IOPS count 4x which is 20K now compared to 5K in default system.
Now we again more than doubled ourselves of all key parameters bringing performance in stratospheric levels. This place is where "enterprise" databases live where core counts float in rivers of RAM. This performance levels are not available anything below $3000 in a cloud provider and the managed version being more expensive and serverless version even further expensive.
Benchmark overview
We have improved our system in every key performance point for this benchmark and here is comparison in a snapshot. While this is great performance wise. We can also forgo some performance for higher resilience setup by using RAIDZ instead of Striped to increase resilience with same setup with marginal 18% loss of performance.
Now that we have these improvements how would we integrate it back to K8S setup.
Step 6: Kubernetes Setup with ZFS CSI Driver Method
This approach uses a Container Storage Interface (CSI) driver to dynamically provision ZFS volumes.
Install ZFS on Worker Nodes
Apply the ZFS installation to each worker node that will host MySQL data:
# Run these commands on each designated worker node
sudo apt update
sudo apt install -y zfsutils-linux
# Create ZFS pool as before
sudo zpool create mysqlpool /dev/vdb /dev/vdc /dev/vdd /dev/vde
# Apply MySQL performance optimizations
sudo zfs set atime=off mysqlpool
sudo zfs set recordsize=16K mysqlpool
sudo zfs set primarycache=metadata mysqlpool
sudo zfs set compression=lz4 mysqlpool
Label the Nodes with ZFS Storage
$ kubectl label nodes storage=zfs
Install ZFS CSI Driver
# Clone the ZFS CSI driver repository
git clone https://github.com/democratic-csi/democratic-csi.git
cd democratic-csi
# Install using Helm
helm repo add democratic-csi https://democratic-csi.github.io/charts/
helm repo update
# Install the ZFS CSI driver
helm install zfs-csi democratic-csi/democratic-csi \
--namespace kube-system \
--create-namespace \
-f ./examples/zfs-local-dataset.yaml
Create Storage Class for MySQL
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: zfs-mysql-sc
provisioner: org.democratic-csi.zfs
allowVolumeExpansion: true
parameters:
fsType: zfs
poolName: mysqlpool
recordsize: "16k"
primarycache: "metadata"
compression: "lz4"
atime: "off"
Deploy MySQL as a StatefulSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
namespace: database
spec:
selector:
matchLabels:
app: mysql
serviceName: mysql
replicas: 1
template:
metadata:
labels:
app: mysql
spec:
nodeSelector:
storage: zfs
containers:
- name: mysql
image: mysql:8.0
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: password
ports:
- containerPort: 3306
name: mysql
volumeMounts:
- name: data
mountPath: /var/lib/mysql
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: zfs-mysql-sc
resources:
requests:
storage: 50Gi
Create MySQL Service
apiVersion: v1
kind: Service
metadata:
name: mysql
namespace: database
spec:
ports:
- port: 3306
selector:
app: mysql
clusterIP: None
Conclusion: From Garage Sale Renault Clio to Enterprise Gold Prosche 917K
Databases may be the last bastion of state in an increasingly stateless world—but that doesn’t mean they must also be the last bastion of uncontrolled cost. What we’ve demonstrated is a principled, low-budget approach to extract maximum performance per dollar by:
Leveraging block-level tuning and file system optimizations.
Exploiting MySQL configuration levers to align with system limits.
Deploying ZFS RAID setups for high IOPS, with compression and smart caching.
Seamlessly porting all these wins into Kubernetes using a CSI-driven ZFS architecture.
At every step, we doubled performance, not by spending more, but by knowing more.
What this shows is that infrastructure as craft—when applied with deliberate design and mechanical sympathy—can give you serverless-grade performance without serverless-grade cost.
This isn’t just a benchmark hack—it’s a mindset shift.