LLM in the Browser: run AI models without a server

What this actually is

Your browser tab downloads a language model and runs it locally. The model weights sit in your browser cache. Inference happens on your CPU or GPU. Nothing leaves your device. No API key, no account, no server.

That is not a metaphor. It is literally what happens when you open Unwrite LLM.

Why it matters

Most AI tools work like this: you type something, it goes to a server, a big model processes it, and a response comes back. That round trip means your data touches infrastructure you do not control.

Browser-based LLMs flip that model. The trade-off is smaller models, but you get something valuable in return:

Privacy. Your prompts never leave your machine. No logging, no training data contribution, no terms of service to read.
Cost. Zero. No API credits, no subscription tiers, no usage caps.
No account. Open the page and start. No sign-up flow, no email verification, no password to manage.
Offline capable. Once the model is cached, you can use it without an internet connection.

How it works under the hood

Three technologies make this possible:

Transformers.js

Hugging Face's Transformers.js library loads ONNX-format models directly in the browser. It handles tokenisation, model loading, and inference. The same library powers chat, vision, speech, and image generation models.

WebGPU and WASM

WebGPU gives the browser access to your GPU for parallel computation. When WebGPU is not available (older browsers, some operating systems), the library falls back to WebAssembly (WASM), which runs on the CPU. WebGPU is significantly faster for larger models.

Web Workers

All inference runs in a Web Worker, a background thread separate from the main page. This keeps the UI responsive while the model is thinking. You can still scroll, type, and interact with the page normally.

Available models

The tool supports models ranging from 77 million to 3.8 billion parameters. That is a wide spread, and the quality difference is real.

Category	What you get
Chat	Text generation and conversation, from small fast models to larger capable ones
Vision	Image understanding and description
Text-to-speech	Natural voice synthesis with multiple voices and languages
Speech recognition	Audio transcription from your microphone or files
Image generation	Create images from text prompts

Smaller models (77M to 360M parameters) download fast and respond quickly but produce noticeably weaker output. Larger models (1B to 3.8B) take longer to download and run slower, but the quality improvement is substantial.

Getting started

1Open Unwrite LLM
2Pick a model from the dropdown
3Wait for the download (one-time, cached afterwards)
4Start chatting

The first load takes a while depending on your connection and the model size. After that, the model loads from cache in seconds.

If your browser supports WebGPU, you will see a noticeable speed improvement. Chrome and Edge on desktop have the best support currently.

Honest limitations

This needs to be said plainly.

Small models are not GPT-5. A 1B parameter model running in your browser is not comparable to a 400B+ parameter model running on a data centre full of GPUs. The gap is large. Small models hallucinate more, struggle with complex reasoning, and produce less coherent long-form output.

Gemma 4 is good, but limited. Among the available models, the Gemma 4 1B is surprisingly capable for its size. It handles straightforward questions well and produces readable text. But its context window is limited, and it falls apart on tasks that require holding lots of information in memory.

Download sizes are real. Larger models can be several gigabytes. On a slow connection, that initial download is painful. The caching helps, but the first visit is a commitment.

Hardware matters. A modern laptop with a decent GPU will run these models comfortably. An older machine or a phone will struggle with anything above the smallest models.

When to use a remote server instead

If you need stronger models, longer context windows, or tool calling, run a local server. Ollama is the simplest option. Install it, pull a model, and you have a local API running models up to 70B+ parameters.

We built a clean browser interface for exactly this use case at Unwrite LLM Remote. It connects to your Ollama instance (or any OpenAI-compatible server) and gives you a proper chat UI without installing another Electron app.

For anything requiring frontier-level reasoning, Unwrite GPT connects to commercial APIs where the biggest models live.

The practical middle ground

Browser LLMs are not a replacement for cloud AI. They are a complement. Use them when privacy matters more than capability, when you want something quick without signing into anything, or when you are offline.

For serious work, run Ollama locally and connect to it from Unwrite LLM Remote. For maximum capability, use the commercial APIs. For a quick, private, no-setup conversation with a small model, open Unwrite LLM and start typing.

The point is not that browser LLMs are better. The point is that they exist, they work, and they cost nothing. That is worth knowing about.