Most devs assume running AI models requires Python, GPUs, or cloud APIs. But modern browsers are capable of running full neural network inference, using ONNX Runtime Web with WebAssembly — no backend, no cloud, no server.
In this tutorial, we’ll build a fully client-side AI inference engine that runs a real ONNX model (like sentiment analysis or image classification) entirely in the browser using WebAssembly — perfect for privacy-focused tools, offline workflows, or local-first apps.
Step 1: Choose a Small ONNX Model
To keep things performant, pick a lightweight ONNX model. You can use:
Let’s use a text model for simplicity — TinyBERT.
Download the ONNX model:
wget https://huggingface.co/onnx/tinybert-distilbert-base-uncased/resolve/main/model.onnx
Store this file in your public assets directory (e.g., public/models/model.onnx
).
Step 2: Set Up ONNX Runtime Web
Install the ONNX Runtime Web runtime:
npm install onnxruntime-web
Then, initialize the inference session in your frontend code:
import * as ort from "onnxruntime-web";
let session;
async function initModel() {
session = await ort.InferenceSession.create("/models/model.onnx", {
executionProviders: ["wasm"],
});
}
This loads the ONNX model into a WASM-based runtime, running entirely in-browser.
Step 3: Tokenize Input Text (No HuggingFace Needed)
ONNX models expect pre-tokenized inputs. Instead of using HuggingFace or Python tokenizers, we’ll use a compact JavaScript tokenizer like bert-tokenizer:
npm install bert-tokenizer
Then tokenize user input:
import BertTokenizer from "bert-tokenizer";
const tokenizer = new BertTokenizer();
const { input_ids, attention_mask } = tokenizer.encode("this is great!");
Prepare inputs for ONNX:
const input = {
input_ids: new ort.Tensor("int64", BigInt64Array.from(input_ids), [1, input_ids.length]),
attention_mask: new ort.Tensor("int64", BigInt64Array.from(attention_mask), [1, input_ids.length])
};
Step 4: Run Inference in the Browser
Now run the model, right in the user's browser:
const results = await session.run(input);
const logits = results.logits.data;
Interpret the logits for your task (e.g., choose the argmax index for classification).
You’ve just run a transformer-based AI model with zero server calls.
Step 5: Add WebAssembly Optimizations (Optional)
ONNX Runtime also supports WebAssembly SIMD and multithreading if the browser supports it:
await ort.env.wasm.setNumThreads(2);
ort.env.wasm.simd = true;
Enable these for dramatically better inference speed.
✅ Pros:
- 🧠 Full AI model execution directly in the browser
- 🔐 No cloud, no server, fully private
- 📴 Works offline — ideal for PWAs or local-first apps
- 🚀 Uses ONNX: works with any exported PyTorch/TensorFlow model
⚠️ Cons:
- 🐢 Limited to lightweight models (mobile-scale)
- 👀 Manual preprocessing and tokenization required
- 📦 Bundle size can grow due to model + tokenizer
- ❌ Not supported in all browsers (e.g., some mobile browsers may limit WASM features)
Summary
Running AI inference in the browser used to sound like science fiction — now it’s just WebAssembly + ONNX. With this setup, you can deliver powerful, privacy-preserving AI capabilities entirely client-side: from offline transcription to secure chat assistants to smart document processors. The performance is real, and the applications are endless — especially in health, security, and creative tools.
Give users smart features without compromising speed or privacy — no server required.
If this was helpful, you can support me here: Buy Me a Coffee ☕