Tips & Best Practices

18 AI models, one browser tab

Chat, text-to-speech, speech recognition, vision, and image generation. All running locally in your browser.

9 min read
Free Guide

The full catalogue

Unwrite LLM ships 18 models across five categories. Every one of them downloads to your browser and runs locally. No server, no API key, no data leaving your device.

Here is the complete list.

ModelParametersDownloadType
SmolLM2 135M135M~270 MBChat
SmolLM2 360M360M~720 MBChat
Qwen 2.5 0.5B500M~950 MBChat
Qwen 2.5 1.5B1.5B~1.6 GBChat
Gemma 3 1B1B~1.3 GBChat
Gemma 4 1B1B~1.3 GBChat
Llama 3.2 1B1B~1.3 GBChat
Llama 3.2 3B3B~3.2 GBChat
Phi 3.5 Mini3.8B~3.6 GBChat
SmolLM2 135M (WASM)135M~270 MBChat (CPU)
SmolLM2 360M (WASM)360M~720 MBChat (CPU)
Kokoro TTS-~350 MBText-to-speech
Granite ASR-~160 MBSpeech recognition
Florence-2 Base230M~500 MBVision
Florence-2 Large770M~1.5 GBVision
Janus Pro 1B1B~2.2 GBImage generation
Moondream 0.5B500M~1.1 GBVision
Moondream 2B2B~3.6 GBVision

All sizes are approximate. Quantised formats keep downloads smaller than the raw parameter count would suggest.

Chat models: honest comparisons

Not all 1B models are equal. Here is what to expect.

SmolLM2 (135M and 360M)

The smallest options. They respond almost instantly and download fast. Good for testing, quick throwaway questions, or machines without much horsepower. Do not expect nuanced answers. The 135M model in particular will produce incoherent output on anything beyond simple prompts.

Qwen 2.5 (0.5B and 1.5B)

Solid mid-range options. The 1.5B variant handles instructions well and produces more structured output than SmolLM2. Good for summarisation and straightforward Q&A. Multilingual support is decent.

Gemma 3 and Gemma 4 (1B)

Google's entries. The Gemma 4 1B is arguably the best model at this size class. It follows instructions more reliably than its competitors and produces cleaner prose. The context window is limited though, so do not try to feed it a long document and ask for analysis. Keep prompts focused.

Llama 3.2 (1B and 3B)

Meta's models. The 3B variant is the largest chat model available in the browser tool and it shows. Noticeably better reasoning, more coherent long responses, and fewer hallucinations than the 1B models. The trade-off is a 3.2 GB download and slower inference, especially without WebGPU.

Phi 3.5 Mini (3.8B)

Microsoft's compact model. The largest in the catalogue at 3.8B parameters. Strong at structured tasks like code generation and step-by-step reasoning. The download is hefty and inference is slow on CPU. WebGPU makes a real difference here.

WASM variants

The SmolLM2 WASM models run purely on CPU without WebGPU. They exist for browsers that do not support WebGPU yet, or for situations where you need maximum compatibility. Same models, different runtime, slightly slower.

Kokoro TTS: 54 voices, 8 languages

Kokoro is genuinely impressive for a browser-based text-to-speech engine. The voice quality is natural, with proper intonation and pacing that does not sound robotic.

54 voices across English (American and British), French, Spanish, Italian, Portuguese, German, Japanese, and Korean. You pick a voice, type or paste text, and it generates audio locally.

The output is good enough for draft narration, accessibility testing, or just hearing how your writing sounds out loud. It is not broadcast quality, but it is far better than the flat monotone you might expect from a browser-based tool.

IBM Granite ASR: speech recognition

Granite handles speech-to-text. Feed it audio from your microphone or upload a file and it transcribes locally. Accuracy is reasonable for clear speech in English. Background noise and accents reduce accuracy noticeably.

The model is small (around 160 MB) and transcribes quickly. Useful for quick notes, rough transcriptions, or any situation where you want speech-to-text without sending audio to a cloud service.

Florence-2: vision

Florence-2 analyses images. Upload a photo and it can describe what it sees, identify objects, or answer questions about the image content.

The base model (230M) is fast but shallow. It will tell you "a dog on a beach" but not much more. The large model (770M) provides more detail and handles complex scenes better.

Useful for quick image descriptions, accessibility alt-text drafts, or understanding what is in an image without uploading it to a third-party service.

Moondream: vision

Moondream is an alternative vision model. The 0.5B version is lightweight and fast. The 2B version is more capable and provides richer descriptions. Both handle visual question answering, where you upload an image and ask specific questions about it.

Janus Pro: image generation

This one required a workaround. Janus Pro is a hybrid model that uses WebGPU for most of its processing but falls back to WASM for specific operations that WebGPU cannot handle efficiently in the browser. The result is slower than a native implementation but it works.

Type a prompt, wait a while, and get an image. The output is 384x384 pixels. Quality varies. Simple subjects (a red ball, a sunset) work well. Complex scenes with specific details are hit or miss. Faces are unreliable.

It is a proof of concept more than a production tool. But the fact that it runs entirely in your browser, generating images from text with zero server involvement, is noteworthy.

Tradeoffs worth knowing

Download sizes add up

If you try every model, you are looking at 20+ GB of downloads. The browser caches them, so subsequent loads are fast, but the initial hit is real. Pick the models you actually need.

WebGPU matters enormously

On a machine with WebGPU support, the larger models are usable. Without it, anything above 500M parameters becomes painfully slow. Check your browser's WebGPU support before committing to a large model download.

Quality variance is wide

The gap between a 135M model and a 3.8B model is enormous. Do not judge browser LLMs by the smallest model. Try Gemma 4 1B or Llama 3.2 3B before forming an opinion.

Privacy is the constant

Regardless of which model you pick, the privacy story is the same. Everything runs locally. Your prompts, your images, your voice, your generated content. None of it leaves your browser. That is the whole point.

Try it

Open Unwrite LLM, pick a model, and see what your browser can do. Start with something in the 1B range for a reasonable balance of speed and quality. If your machine handles it well, try the larger models.