Running a 3.7GB language model in a browser tab
Gemma 4 is 3.75GB. We run it in a browser tab. Here is how the stack works.
Open a browser tab. Load a page. Wait a minute or two. Now you have a 3.75GB language model running locally, generating text with no server, no API key, and no data leaving your device.
That sentence should feel slightly absurd. A year ago it would have been. But the stack that makes it possible is real, shipping, and surprisingly practical. Here is how every piece fits together.
The stack
Four layers make in-browser LLM inference work.
Transformers.js is the application layer. It is a JavaScript port of Hugging Face's Transformers library. It handles tokenisation, model loading, the generate loop, and the pipeline API. You point it at a model, give it a prompt, and it returns tokens. Under the hood, it delegates all the heavy linear algebra to the layer below.
ONNX Runtime Web is the inference engine. ONNX (Open Neural Network Exchange) is a standard format for representing neural network graphs. ONNX Runtime executes those graphs. The "Web" build compiles ONNX Runtime's C++ core to WebAssembly and also provides a WebGPU execution provider. It handles operator dispatch, memory allocation, and the actual matrix multiplications that make a language model produce text.
WebGPU is the hardware acceleration layer. It gives JavaScript access to the GPU through a modern, low-level API. When ONNX Runtime uses the WebGPU backend, tensor operations run as compute shaders on the GPU instead of as WASM instructions on the CPU. This is where the real speed comes from.
Quantisation makes the file sizes manageable. The original Gemma 4 model in fp32 would be roughly 15GB. We use q4 quantisation, which represents each weight with 4 bits instead of 32. That brings the download to around 3.75GB and the memory footprint to something a browser tab can handle. Quality drops slightly, but for most tasks the difference is hard to notice.
Loading a multi-gigabyte model
Downloading 3.75GB into a browser tab is not instant. The loading process has to be designed around that reality.
CDN delivery from Hugging Face. Model files are hosted on Hugging Face's CDN. Transformers.js fetches them as standard HTTP requests. The model is split into multiple shards (typically 500MB-1GB each) so the browser does not need to hold the entire download in memory at once.
Web Worker isolation. All model loading and inference runs in a Web Worker. This keeps the main thread free for UI updates, progress bars, and user interaction. Without a Worker, the page would freeze for the entire download and model compilation phase.
Cache Storage API for persistence. After the first download, model files are stored in the browser's Cache Storage. This is the same API that service workers use. It persists across sessions, survives tab closes, and does not have the size limits of localStorage. On a return visit, the model loads from cache in seconds instead of minutes.
The loading flow looks like this:
- 1User clicks "load model"
- 2Worker checks Cache Storage for existing model files
- 3If cached, load from cache (fast). If not, fetch from CDN (slow, with progress)
- 4ONNX Runtime compiles the model graph and allocates memory
- 5Worker signals ready. Inference can begin.
First load on a decent connection takes one to three minutes. Subsequent loads take five to fifteen seconds.
WebGPU vs WASM: why the backend matters
ONNX Runtime Web supports two execution backends. The difference between them is enormous.
WASM backend. All tensor operations run as WebAssembly instructions on the CPU. This works everywhere that supports WASM (which is every modern browser). It is reliable and portable. It is also slow. Generating text with a 3.75GB model on the WASM backend produces roughly 2-5 tokens per second on a fast CPU. That is usable but not pleasant.
WebGPU backend. Tensor operations run as compute shaders on the GPU. A mid-range discrete GPU produces 15-30 tokens per second. A high-end GPU pushes 40+. That is 5-10x faster than WASM, and the difference is immediately obvious. Text streams out fluidly instead of appearing word by word.
The catch: WebGPU is not available everywhere yet. As of early 2026, it is enabled by default in Chrome and Edge on desktop. Firefox has it behind a flag. Safari has partial support. Mobile support is limited. When WebGPU is unavailable, we fall back to the WASM backend automatically.
The fp16 question
Modern GPUs handle 16-bit floating point (fp16) operations much faster than 32-bit (fp32). Using fp16 for inference roughly doubles throughput on supported hardware.
But not every GPU supports fp16 compute shaders. We detect this at runtime by checking for the shader-f16 feature in the WebGPU adapter.
const adapter = await navigator.gpu.requestAdapter();
const supportsF16 = adapter.features.has('shader-f16');If the GPU supports it, we request a device with the feature enabled and ONNX Runtime uses fp16 kernels. If not, we fall back to fp32 shaders. The user never sees this decision. They just get the best speed their hardware can deliver.
The Janus workaround
Not every model runs cleanly on a single backend. We learned this the hard way with Janus, a multimodal model that handles both text and image generation.
Janus has a vision tower (the part that processes images) and a language model (the part that generates text). The language model runs well on WebGPU. The vision tower does not. Specifically, the ReduceMean operator in the vision tower triggers a WebGPU bug that produces incorrect output on some GPU drivers.
Our workaround: hybrid device routing. The vision tower runs on the WASM backend (CPU). The language model runs on the WebGPU backend (GPU). ONNX Runtime supports multiple execution providers in a single session, so we assign operators to backends based on which subgraph they belong to.
This is ugly. It works. The vision tower is a small fraction of total compute, so running it on CPU has minimal impact on overall speed. The language model still gets full GPU acceleration.
We expect this to become unnecessary as WebGPU implementations mature and driver bugs get fixed. For now, it is a practical solution to a real problem.
Memory limits
Chrome allocates roughly 4GB of memory per tab (this varies by platform and system RAM). A q4-quantised 3.75GB model, plus the ONNX Runtime engine, plus intermediate buffers during inference, lands close to that ceiling.
This means:
- Close other heavy tabs before loading a large model. Each tab has its own memory budget, but system RAM is shared.
- Very long conversations accumulate KV-cache memory. If the model starts producing nonsense after a long exchange, the context window is likely exhausted.
- Larger models (7B+ parameters) are currently impractical in a browser tab. The quantised weights alone would exceed the memory budget before inference even starts.
The 2-4B parameter range is the sweet spot for in-browser inference today. Models like Gemma 4 2B, SmolLM2, Phi-3 Mini, and Qwen2.5 3B all fit comfortably.
What actually happens during generation
Once the model is loaded, text generation follows a standard autoregressive loop.
- 1The prompt is tokenised into integer IDs
- 2The token IDs are fed through the model graph as a tensor
- 3The model outputs logits (probability scores) for every token in its vocabulary
- 4A sampling strategy (temperature, top-k, top-p) selects the next token
- 5The selected token is appended to the sequence
- 6Repeat from step 2 until the model produces a stop token or hits the length limit
Each iteration through this loop is one "forward pass" and produces one token. The speed (tokens per second) depends almost entirely on how fast the matrix multiplications in step 2 run, which is why the WebGPU vs WASM distinction matters so much.
Transformers.js yields each token as it is generated, so we can stream text to the UI in real time. The user sees words appearing progressively, just like a cloud-based chatbot, except everything is running on their own hardware.
What is next
In-browser LLM inference is improving fast. Several developments will push it further in the next year or two.
Better WebGPU compute shaders. Current WebGPU implementations do not yet expose all the GPU features that native inference engines use. As browser vendors add support for subgroups, cooperative matrix operations, and more flexible memory access patterns, inference speed will improve without any changes to the models themselves.
Smaller quantisation formats. Research into 2-bit and 1.58-bit quantisation (like BitNet) is progressing. If these formats prove practical, a model that currently needs 3.75GB could fit in under 1GB. That changes the loading story completely.
Speculative decoding. This technique uses a small, fast "draft" model to generate candidate tokens, then verifies them in batches with the full model. It can double effective throughput without changing the model weights. Some early implementations already work in ONNX Runtime.
WebGPU on mobile. Once mobile browsers ship stable WebGPU support, the same models that run on desktop GPUs will run on phone GPUs. Mobile GPUs are less powerful, but they are more than fast enough for 2B parameter models at q4 precision.
The point
A year ago, running a language model meant paying for API calls and trusting a cloud provider with your prompts. Today, you can load a 3.75GB model into a browser tab, type a question, and get an answer without any data leaving your device.
The stack is real. The performance is usable. The privacy is architectural, not policy-based. Your prompts never hit a server because there is no server.
Try it at /llm. Load a model, ask a question, and check the Network tab in DevTools. Nothing leaves.