Most devs assume running AI models requires Python, GPUs, or cloud APIs. But modern browsers are capable of running full neural network inference, using ONNX Runtime Web with WebAssembly — no backend, no cloud, no server.

In this tutorial, we’ll build a fully client-side AI inference engine that runs a real ONNX model (like sentiment analysis or image classification) entirely in the browser using WebAssembly — perfect for privacy-focused tools, offline workflows, or local-first apps.


Step 1: Choose a Small ONNX Model

To keep things performant, pick a lightweight ONNX model. You can use:

Let’s use a text model for simplicity — TinyBERT.

Download the ONNX model:

wget https://huggingface.co/onnx/tinybert-distilbert-base-uncased/resolve/main/model.onnx

Store this file in your public assets directory (e.g., public/models/model.onnx).


Step 2: Set Up ONNX Runtime Web

Install the ONNX Runtime Web runtime:

npm install onnxruntime-web

Then, initialize the inference session in your frontend code:

import * as ort from "onnxruntime-web";

let session;
async function initModel() {
  session = await ort.InferenceSession.create("/models/model.onnx", {
    executionProviders: ["wasm"],
  });
}

This loads the ONNX model into a WASM-based runtime, running entirely in-browser.


Step 3: Tokenize Input Text (No HuggingFace Needed)

ONNX models expect pre-tokenized inputs. Instead of using HuggingFace or Python tokenizers, we’ll use a compact JavaScript tokenizer like bert-tokenizer:

npm install bert-tokenizer

Then tokenize user input:

import BertTokenizer from "bert-tokenizer";

const tokenizer = new BertTokenizer();
const { input_ids, attention_mask } = tokenizer.encode("this is great!");

Prepare inputs for ONNX:

const input = {
  input_ids: new ort.Tensor("int64", BigInt64Array.from(input_ids), [1, input_ids.length]),
  attention_mask: new ort.Tensor("int64", BigInt64Array.from(attention_mask), [1, input_ids.length])
};

Step 4: Run Inference in the Browser

Now run the model, right in the user's browser:

const results = await session.run(input);
const logits = results.logits.data;

Interpret the logits for your task (e.g., choose the argmax index for classification).

You’ve just run a transformer-based AI model with zero server calls.


Step 5: Add WebAssembly Optimizations (Optional)

ONNX Runtime also supports WebAssembly SIMD and multithreading if the browser supports it:

await ort.env.wasm.setNumThreads(2);
ort.env.wasm.simd = true;

Enable these for dramatically better inference speed.


Pros:

  • 🧠 Full AI model execution directly in the browser
  • 🔐 No cloud, no server, fully private
  • 📴 Works offline — ideal for PWAs or local-first apps
  • 🚀 Uses ONNX: works with any exported PyTorch/TensorFlow model

⚠️ Cons:

  • 🐢 Limited to lightweight models (mobile-scale)
  • 👀 Manual preprocessing and tokenization required
  • 📦 Bundle size can grow due to model + tokenizer
  • ❌ Not supported in all browsers (e.g., some mobile browsers may limit WASM features)

Summary

Running AI inference in the browser used to sound like science fiction — now it’s just WebAssembly + ONNX. With this setup, you can deliver powerful, privacy-preserving AI capabilities entirely client-side: from offline transcription to secure chat assistants to smart document processors. The performance is real, and the applications are endless — especially in health, security, and creative tools.

Give users smart features without compromising speed or privacy — no server required.


If this was helpful, you can support me here: Buy Me a Coffee