Run AI Models Entirely in the Browser Using WebAssembly + ONNX Runtime (No Backend Required)

Most devs assume running AI models requires Python, GPUs, or cloud APIs. But modern browsers are capable of running full neural network inference, using ONNX Runtime Web with WebAssembly — no backend, no cloud, no server.

In this tutorial, we’ll build a fully client-side AI inference engine that runs a real ONNX model (like sentiment analysis or image classification) entirely in the browser using WebAssembly — perfect for privacy-focused tools, offline workflows, or local-first apps.

Step 1: Choose a Small ONNX Model

To keep things performant, pick a lightweight ONNX model. You can use:

Let’s use a text model for simplicity — TinyBERT.

Download the ONNX model:

wget https://huggingface.co/onnx/tinybert-distilbert-base-uncased/resolve/main/model.onnx

Store this file in your public assets directory (e.g., public/models/model.onnx).

Step 2: Set Up ONNX Runtime Web

Install the ONNX Runtime Web runtime:

npm install onnxruntime-web

Then, initialize the inference session in your frontend code:

import * as ort from "onnxruntime-web";

let session;
async function initModel() {
  session = await ort.InferenceSession.create("/models/model.onnx", {
    executionProviders: ["wasm"],
  });
}

This loads the ONNX model into a WASM-based runtime, running entirely in-browser.

Step 3: Tokenize Input Text (No HuggingFace Needed)

ONNX models expect pre-tokenized inputs. Instead of using HuggingFace or Python tokenizers, we’ll use a compact JavaScript tokenizer like bert-tokenizer:

npm install bert-tokenizer

Then tokenize user input:

import BertTokenizer from "bert-tokenizer";

const tokenizer = new BertTokenizer();
const { input_ids, attention_mask } = tokenizer.encode("this is great!");

Prepare inputs for ONNX:

const input = {
  input_ids: new ort.Tensor("int64", BigInt64Array.from(input_ids), [1, input_ids.length]),
  attention_mask: new ort.Tensor("int64", BigInt64Array.from(attention_mask), [1, input_ids.length])
};

Step 4: Run Inference in the Browser

Now run the model, right in the user's browser:

const results = await session.run(input);
const logits = results.logits.data;

Interpret the logits for your task (e.g., choose the argmax index for classification).

You’ve just run a transformer-based AI model with zero server calls.

Step 5: Add WebAssembly Optimizations (Optional)

ONNX Runtime also supports WebAssembly SIMD and multithreading if the browser supports it:

await ort.env.wasm.setNumThreads(2);
ort.env.wasm.simd = true;

Enable these for dramatically better inference speed.

✅ Pros:

🧠 Full AI model execution directly in the browser
🔐 No cloud, no server, fully private
📴 Works offline — ideal for PWAs or local-first apps
🚀 Uses ONNX: works with any exported PyTorch/TensorFlow model

⚠️ Cons:

🐢 Limited to lightweight models (mobile-scale)
👀 Manual preprocessing and tokenization required
📦 Bundle size can grow due to model + tokenizer
❌ Not supported in all browsers (e.g., some mobile browsers may limit WASM features)

Summary

Running AI inference in the browser used to sound like science fiction — now it’s just WebAssembly + ONNX. With this setup, you can deliver powerful, privacy-preserving AI capabilities entirely client-side: from offline transcription to secure chat assistants to smart document processors. The performance is real, and the applications are endless — especially in health, security, and creative tools.

Give users smart features without compromising speed or privacy — no server required.

If this was helpful, you can support me here: Buy Me a Coffee ☕

Run AI Models Entirely in the Browser Using WebAssembly + ONNX Runtime (No Backend Required)

Step 1: Choose a Small ONNX Model

Step 2: Set Up ONNX Runtime Web

Step 3: Tokenize Input Text (No HuggingFace Needed)

Step 4: Run Inference in the Browser

Step 5: Add WebAssembly Optimizations (Optional)

Summary

Comments (0)

Read More

#reading

#popular

Run AI Models Entirely in the Browser Using WebAssembly + ONNX Runtime (No Backend Required)

Step 1: Choose a Small ONNX Model

Step 2: Set Up ONNX Runtime Web

Step 3: Tokenize Input Text (No HuggingFace Needed)

Step 4: Run Inference in the Browser

Step 5: Add WebAssembly Optimizations (Optional)

Summary

Comments (0)

Read More

Model routing for function calling with Arcee Conductor

Remote Development with Cursor?

Top 15 Builder.ai Alternatives for 2025: Explore the Best App Development Platforms

What is Deep Learning

#reading

#popular