Indent

At Indent, we run coding agents in cloud sandboxes. We strive to make these sandboxes feel like local development, and one part of that is exposing a responsive, full-featured terminal interface.

Many products already offer remote terminals on the web, including vscode.dev, GitHub Codespaces, and Cursor. But under the hood, most of them hit the same tradeoffs:

Reconnection is slow. They stream raw terminal output to the browser, so when you reload a tab or lose your connection, the client has to replay the buffered session output through its emulator to reconstruct the screen.
There’s no canonical screen. Each client builds its own copy of terminal state from the byte stream. Two browsers looking at the same session can drift out of sync.
Variable connections hurt. A phone on 3G and a workstation on localhost get the same firehose of data. There’s no way to adapt to what each client can actually handle.

Tools like tmux and Mosh solve some of these problems by keeping canonical state on the server and sending screen updates, but neither speaks a protocol browsers can use.

Blit brings a tmux-style architecture to the browser. The rest of this post covers how that works: how we diff and compress the terminal grid, how the browser renders frames with WebGL, and how per-client congestion control adapts delivery to each connection.

How terminals work

A terminal is a grid of cells; on a modern display it might be 200 columns wide and 50 rows tall, for 10,000 cells. Each cell holds:

A character (if there is one)
A foreground and background color
Style flags like bold, italic, or underline

Programs don’t draw to this grid directly; when you run ls or htop, the program writes bytes to its standard output. Most of those bytes are plain text, but mixed in are escape sequences (ANSI codes) that say things like “move the cursor to row 5, column 10” or “set the foreground color to red.” A program like htop creates a full-screen interface even though it’s really just printing characters to a stream, and it falls to terminal emulators like Alacritty, xterm, or Ghostty to parse that byte stream and render it as a grid of styled characters on screen.

In a local setup the terminal emulator runs on your machine and draws the grid directly, but in a remote setup we need to get that grid into a browser.

Where to run the terminal emulation

We have a program running on the server, outputting bytes. We have a browser that needs to display a grid. How do we get one to the other? There are really two choices, and they come down to where the terminal emulator runs:

In the browser. Stream the raw bytes to the browser and let it run a terminal emulator in JavaScript to build the grid. This is how most web terminals work today (VS Code, Codespaces), typically via a library called xterm.js. It comes with the tradeoffs described above.
On the server. Run the terminal emulator on the server, maintain the canonical grid there, and send structured cell data to the browser. The browser just displays what it’s told. This is the approach tmux and Mosh take, and it’s what Blit does for the browser.

Server-side emulation gives you near-instant reconnection (a new client gets the current grid, not a replay of history), one source of truth (every client sees the same screen), and per-client pacing (the server adapts to each client’s connection speed). The obvious objection is bandwidth: a 200x50 terminal is about 120 KB of cell data. A program like htop redraws multiple times per second, so naively shipping full grids means megabytes per second per client. That’s a compression problem, and it’s what the next section solves.

Diffing and compressing the grid

Sending 120 KB on every update is too much, so the first optimization is to only send what changed. The server keeps a copy of what each client last saw, and on every update it compares the current grid against that snapshot. To make this comparison cheap, every cell is encoded as a fixed-size 12-byte slot:

2 bytes for style flags (bold, italic, underline, etc.)
3 bytes for the foreground color
3 bytes for the background color
4 bytes for the character

On a 200x50 terminal, that’s 10,000 cells and 120,000 bytes.

Because every cell is the same width in memory, comparing two grids is just walking both arrays and checking which 12-byte slices differ, with no parsing or alignment needed. The server sends only the changed cells. A single keystroke typically touches one or two cells out of ten thousand, so most updates are tiny.

That gets us from 120 KB down to a handful of changed cells, but we can compress further. If you just concatenate the changed cells and run LZ4 over them, each cell’s bytes are interleaved (flags, then colors, then character, then flags again for the next cell) and LZ4 does not find many repeated sequences. Instead, we rearrange the data before compressing it: Instead of sending each cell’s 12 bytes together (character, then colors, then flags), we group all the first bytes of every changed cell, then all the second bytes, and so on (a columnar transpose, same idea as Parquet). This puts identical bytes next to each other: all the background reds together, all the foreground greens together, all the style flags together.

Then we compress with LZ4, which excels at repeated byte sequences. An 80-column row of terminal output is 960 bytes uncompressed. Without striping, LZ4 compresses it to around 275 bytes. With striping, around 115 bytes.

The combination of diffing and striped compression works well for normal typing, but scrolling exposes a tradeoff of the server-side approach. A browser-side emulator like xterm.js sees the scroll escape sequence directly and just shifts rows. Blit works from the grid, so it has to rediscover that a scroll happened. Every row shifted up by one looks like the entire screen changed. Instead, the server compares rows between frames to detect when content has shifted by a constant offset, then sends a tiny instruction that says “shift all rows up by one,” and then only patches the new content at the bottom. One small instruction replaces a full-screen retransmit.

Rendering in the browser

Once a compressed diff arrives in the browser, it needs to become pixels as fast as possible.

The most common approach is to use the DOM, which is what xterm.js does: create a <span> for each styled run of text and let the browser’s layout engine place them on screen. This works, but the browser ends up doing a lot of unnecessary work because it runs layout, reflow, and style recalculation on every update, even though a terminal grid has completely fixed geometry: every cell is the same size, at a known position, and never needs to reflow.

Blit uses WebGL instead, which lets us skip the DOM entirely and talk to the GPU directly, since a terminal is really just colored rectangles and pre-rendered characters at fixed positions. The two big levers are batching GPU work and avoiding data copies.

Batching draw calls. If you drew a terminal cell by cell (paint a rectangle, draw a letter, move to the next cell, repeat), you would make thousands of tiny GPU calls per frame. GPUs are fast at drawing lots of the same thing in one go, but slow when you ask them to constantly switch between rectangles and text. Blit avoids this by splitting the terminal into two layers and drawing each in a single batch:

Backgrounds. Walk each row left to right, merging adjacent cells that share the same color into one rectangle. A 120-character prompt with a colored background becomes one rectangle instead of 120, because a new rectangle only starts when the color changes or the row ends.
Glyphs. The first time a character appears, it gets rasterized into a texture atlas (a sprite sheet of every character the terminal has ever displayed), so drawing a letter after that is just “copy this rectangle from the sprite sheet to this position.” By the end of the first screenful, almost every character is already cached.
Cursor. Drawn last, on top of everything else.

Each batch does only one kind of work, which lets the GPU process each layer efficiently.

Zero-copy between WASM and WebGL. The straightforward approach would be: decode the diff in WASM, serialize the result to JavaScript, then upload to the GPU. Each handoff copies all the data, so Blit skips the middle step entirely. The WASM module and the WebGL renderer share the same block of memory, so when a diff arrives, WASM updates the grid in place and produces vertex buffers directly in shared memory. The WebGL code reads those buffers with no intermediate copy or serialization. The only remaining copies are the ones every WebGL application makes: uploading vertex data and textures to the GPU.

Skipping idle frames. When no data has arrived from the server, the renderer skips vertex generation entirely, although events like cursor blinks still trigger a repaint by reusing the vertex buffers from the last frame rather than recomputing them.

Congestion control

The naive approach is to send frames as fast as the server can produce them. On a fast network this works fine, but on a slow one frames pile up in buffers faster than the client can drain them and latency spikes. Blit runs a per-client congestion controller inspired by Google’s BBR.

The client measures its display refresh rate and reports it to the server, so the server never pushes frames faster than the screen can paint them; within that ceiling, it paces delivery using round-trip time estimates. Each frame gets an ACK when the browser finishes decompressing and applying the diff; the elapsed time gives a round-trip sample that includes network transit plus client-side decode, though not GPU paint. Paint timing is tracked separately: the browser periodically reports its backlog depth, apply latency, and how many frames sit buffered ahead of the last paint, and the server uses these signals alongside queue delay to decide when to back off.

Throughput is measured over sliding time windows (at least 20ms each) rather than per-frame, because a single fast burst is not a reliable indicator of sustained capacity. The server smooths these samples with an exponential moving average and subtracts a jitter penalty based on recent sample-to-sample variation, so it budgets conservatively when the connection is unstable.

The active terminal gets the full send budget first, while background terminals are guaranteed at least 25% of available bandwidth but run at a lower frame rate, so they never starve but also never compete with the terminal the user is typing in.

Try Blit

Share a terminal in one command:

curl -sf https://install.blit.sh | sh
blit share

blit share starts a session and prints a URL. Anyone with the link can watch your terminal in their browser, streamed over WebRTC.

Embed Blit in your product:

Blit ships frontend components for React and Solid, plus a server-side integration via multiple protocols. A minimal React setup:

import { BlitTerminal, BlitWorkspace, WebSocketTransport } from "@blit-sh/react";

const transport = new WebSocketTransport("wss://your-server/blit", passphrase);
const workspace = new BlitWorkspace({ wasm, connections: [{ id: "default", transport }] });

See EMBEDDING.md for the full integration guide.