Connect your own LLM server from the browser

The idea

You have a machine with a GPU. You want to run large language models on it. You do not want to pay for API credits or send your data to someone else's servers.

Unwrite LLM Remote is a browser-based chat interface that connects to any OpenAI-compatible API. You bring the server, we provide the UI. Your prompts go from your browser to your server and back. Nobody else is involved.

Supported servers

Any server that speaks the OpenAI chat completions API will work. That includes:

**Ollama** - The simplest option. One binary, one command to pull a model.
**LM Studio** - Desktop app with a built-in server mode and model browser.
**llama.cpp** - The original. Maximum control, minimal overhead.
**Jan** - Another desktop app with OpenAI-compatible server built in.
**vLLM** - Production-grade serving with PagedAttention. Heavier setup, better throughput.
**TGI** - Hugging Face's inference server. Good for multi-user setups.
**LocalAI** - Supports multiple model formats and backends.

If your server exposes /v1/chat/completions, it will work.

Quick start: Ollama

Ollama is the fastest path from zero to running.

Install

On macOS:

brew install ollama

On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows has a downloadable installer at ollama.com.

Pull a model

ollama pull llama3.2

Pick whatever suits your hardware. llama3.2 (3B) runs on most machines. llama3.1:70b needs serious GPU memory. gemma2:27b is a good middle ground if you have 16+ GB of VRAM.

Enable CORS

This is the step most people miss. Browsers block cross-origin requests by default. Ollama needs to be told to allow requests from your browser.

Set the environment variable before starting Ollama:

OLLAMA_ORIGINS="*" ollama serve

On macOS, if Ollama is running as an app, set it system-wide:

launchctl setenv OLLAMA_ORIGINS "*"

Then restart Ollama.

Connect

1Open Unwrite LLM Remote
2Enter your server URL: http://localhost:11434
3The tool fetches your available models automatically
4Pick a model and start chatting

That is it. You are now running a large language model on your own hardware with a clean browser interface.

Quick start: LM Studio

LM Studio is more visual. Download it, browse models, click download, then:

1Open the "Local Server" tab in LM Studio
2Load a model
3Start the server (defaults to port 1234)
4In Unwrite LLM Remote, enter http://localhost:1234

LM Studio handles CORS by default, so no extra configuration needed.

Accessing from other devices

Same network (LAN)

If your server machine and your browser are on the same local network, use the server machine's local IP address instead of localhost. For example: http://192.168.1.50:11434.

Make sure your firewall allows connections on the relevant port.

Tailscale

Tailscale creates a private network between your devices. Install it on your server machine and your browsing device, and you get a stable IP address that works from anywhere. No port forwarding, no dynamic DNS, no exposing anything to the public internet.

This is the cleanest way to access your home server from a laptop at a coffee shop. Your traffic is encrypted and goes directly between your devices.

Features beyond basic chat

Model management

The tool lists all models available on your server. Switch between them mid-conversation. Compare outputs from different models on the same prompt.

System prompts

Set a system prompt to shape the model's behaviour. Useful for role-specific conversations or constraining output format.

Tool calling

If your server supports function calling (Ollama does for compatible models), the tool can handle structured tool-use workflows.

Conversation history

Conversations stay in your browser's local storage. They never leave your device. Clear them whenever you want.

Why not just use ChatGPT

Fair question. Here is why someone would bother with local models:

Privacy. Your prompts do not go to OpenAI, Anthropic, or Google. If you are working with confidential documents, proprietary code, or personal data, this matters.

Cost. After the hardware investment, running local models costs electricity. No per-token billing, no subscription fees, no surprise invoices.

Control. You choose the model, the parameters, the context length. No content filters you did not ask for. No model deprecation breaking your workflow.

Speed for some workloads. A local GPU with a quantised model can be faster than a round trip to an API server, especially for shorter prompts.

Availability. No rate limits, no outages you cannot control, no "we're experiencing high demand" messages.

The trade-off is obvious: local models are smaller and less capable than frontier models like GPT-4 or Claude. For tasks that need maximum reasoning ability, the APIs win. For everything else, local is worth considering.

Troubleshooting

"Failed to connect" or network error

Check that your server is actually running. Try opening the server URL directly in your browser. For Ollama, http://localhost:11434 should return "Ollama is running".

CORS errors

The browser console will show a CORS error if the server is not configured to accept cross-origin requests. For Ollama, set OLLAMA_ORIGINS="*" and restart. For other servers, consult their documentation for CORS configuration.

Wrong URL format

Some servers need /v1 appended to the base URL. If model listing fails, try adding /v1 to the end of your server URL. For example: http://localhost:1234/v1.

Slow responses

Large models on insufficient hardware will be slow. Check your GPU utilisation. If the model is running on CPU, expect much slower inference. Quantised models (Q4, Q5) are faster than full precision.

The setup is worth it

Five minutes of setup gets you a private, unlimited AI chat running on your own hardware with a clean browser interface. No subscriptions, no data sharing, no accounts.

Start with Ollama, pull a model, set CORS, and open Unwrite LLM Remote. If you want browser-only models with zero server setup, Unwrite LLM runs smaller models directly in your tab.