v0.2.0 · Apache 2.0

The open-source private-cloud AI coworker.

Same product shape as Claude Cowork, Gemini Enterprise Agent, and GPT‑6 Atlas. None of the data egress. Single‑tenant, deployed end-to-end into the customer's own Northflank project — control plane and H100 GPU under one roof. Every line auditable. Data locality is mechanically verifiable, not marketed.

Schedule a demo View on GitHub How it works

GPU per tenant: 1× H100; 80 GB · sm_90 · native FP8
License: Apache 2.0; OSI-approved · patent grant
Vendor egress: 0 bytes; provable with tcpdump

Why it exists

Same product shape. None of the data egress.

Between January and April 2026, every frontier lab converged on the same product: an agentic AI coworker with a task inbox, saved schedules, document memory, and direct access to local files and connected apps. Claude Cowork defined the category. Gemini Enterprise Agent is the identical-shaped response. GPT‑6 + Atlas is the unified version.

The problem

Every one of those products is structurally cloud-hosted and sends your data to the vendor's servers on every request. For firms whose data contractually or legally cannot leave their own infrastructure — legal, healthcare, accounting, finance, government, and everyone adjacent — that category is unreachable.

The answer

FlatClaw is the same product shape, built out of open-source components, running entirely inside infrastructure the operator controls. Apache 2.0. Pulled, audited, deployed. Every line is yours.

What's in the box

A complete coworker stack — not a framework.

Eight pre-integrated components. Pull, deploy, use. Each one is replaceable and auditable on its own.

FlatClaw Portal

Next.js 16 + React 19 product surface. Chat, agent fleet, approvals, cron, MCP services, workspace files, Memory, Admin (RBAC). SSE-streamed tool use.

OpenClaw runtime

Self-hosted agent loop. Sessions, multi-step planning, sandboxed tool execution, RBAC enforced at every tool call. Owns per-agent memory.

Inference

Patched SGLang + Gemma 4 31B Dense on a single NVIDIA H100 (80 GB, native FP8) on Northflank's managed GPU fleet, served at the model's native 256K context.

Per-agent memory

OpenClaw's built-in per-agent SQLite memory — keyword (BM25) search over each agent's MEMORY.md and memory/ files, seeded automatically for every agent. The agent maintains it across sessions. Semantic recall via bge-m3 lands in v0.3.

MCP services

First-party Model Context Protocol servers: Google (Gmail/Calendar/Drive/Docs/Sheets), CalDAV/IMAP, and Jira. Per-user credentials scoped (tenant, user, service), never tenant-wide.

RBAC + per-user creds

Multiple users per tenant. Per-user Tool Access (allow/deny over built-in + MCP tools) on OpenClaw's native tools.deny, plus always-on cross-user isolation. Per-user credentials scoped (tenant, user, service).

Single-tenant by design

Each customer gets their own Northflank project. Strict isolation, dedicated H100, no shared state across tenants.

One image, every tenant

ghcr.io/skytruax/flatclaw-inference:latest — public on GHCR, ~18 GB, no baked weights. Every deployment pulls the same image.

Architecture

Single-tenant. Customer-owned. End-to-end.

Everything — Portal, OpenClaw Gateway, Inference (H100), and the weights-server — lives in one Northflank project. Customer holds the Northflank account directly. Nothing leaves their tenancy.

Customer's Northflank project · one per tenant

Browser

The user — an admin or end-user inside the customer's org

↓ HTTPS · cookie-auth

FlatClaw Portal

Next.js 16 + React 19 + SQLite

ChatAgentsApprovalsCronMCP servicesMemoryAdmin

↓ server-owned WebSocket · ws://:18789

OpenClaw Gateway

Agent runtime · sessions · tool dispatch · RBAC enforced at every tool call

↓ MCP (per-agent, deny-glob scoped) · per-user Tool Access (native tools.deny)

Google

Gmail · Calendar · Drive · Docs · Sheets · Contacts

CalDAV / IMAP

calendar · contacts · mail

Jira

Atlassian Cloud

Sandbox

per-tool exec · per-user scoped credentials

MCP services are first-party servers the agent calls over Model Context Protocol; per-user credentials scoped per (tenant, user, service). RBAC is OpenClaw's native per-agent tools.deny — always-on cross-user roster isolation (each agent sees only its own servers' tools) plus a per-user Tool Access panel that toggles built-in and MCP tools off; denied tools are filtered from the roster before the model sees them.

Per-agent memory uses OpenClaw's built-in per-agent SQLite engine — keyword (BM25) search over each agent's MEMORY.md + memory/ files. Semantic recall via bge-m3 (on its own GPU card) and RAGFlow cited-document retrieval land in v0.3.

↓ internal Northflank network · TLS · bearer-authenticated

Inference (GPU) · same Northflank project

Inference service

Patched SGLang · Gemma 4 31B-IT (FP8) · 256K context · NVIDIA H100 (80 GB · sm_90 · native FP8)

Weights served by the in-project weights-server pod over a Northflank-managed volume — staged once via Kaggle, never moved at boot.

No vendor egress

Zero packets to Anthropic, OpenAI, Google AI, Hugging Face, ElevenLabs, or any third-party inference endpoint. Verifiable with tcpdump.

Customer holds the account

Customer's own Northflank account, billed directly to them. We never touch the bill or the data. No second cloud, no BYOC plumbing.

One image, every tenant

ghcr.io/skytruax/flatclaw-inference:latest. SGLang base + entrypoint, no baked weights. Public, auditable, reproducible.

Token Economics

≈ $2,000 / month per tenant. Flat rate, not per token.

Indicative monthly cost at Northflank's published list pricing, single tenant, prod held warm 24/7. The H100 dominates; everything else combined is under $200. The rate is per tenant and scales with the tenant — not metered per token or per seat.

Monthly cost breakdown

Inference (H100 80GB, held warm)	~$1,800
Portal — nf-compute-400	~$50
OpenClaw Gateway — nf-compute-400	~$50
RAGFlow + corpus volume	~$30
weights-server + 200 GB nvme	~$30
Egress · TLS · observability	included
Total per tenant, all-in	~$2,000 / mo

List prices, round numbers. Committed-use or annual deals on Northflank typically reduce the GPU line. No second cloud, no BYOC plumbing — one bill, one vendor relationship.

How one H100 carries a tenant — and how it scales

Concurrency, not headcount, sets the load.

What the GPU serves is peak concurrent active sessions, not the tenant's total user count — people skim a result, edit a doc, take a call, ask a follow-up. The H100 is sized to that concurrent peak; the per-tenant rate doesn't move with seat count.

The 31B path handles 8–12 concurrent streams.

One H100 SGLang process at Gemma 4 31B FP8 sustains ~8–12 concurrent streaming chats with first-token latency in the 1–2 s range. SGLang's RadixAttention prefix cache earns most of that on conversational reuse.

Most user actions don't touch the LLM at all.

Memory recall, RAG retrieval, file reads, OAuth tool invocations — all gateway- or skill-side. The LLM is invoked for chat turns and tool-call planning. A typical coworker session is a handful of LLM calls, not hundreds.

Headroom for bursts, then a cascade.

Gemma 4 31B FP8 (~33 GB) + KV cache + bge-m3 fits in 80 GB with ~25 GB free. The v0.3 cascade lands a co-resident smaller Gemma in that headroom for fast-turn / planning traffic — same hardware, ~2× concurrent capacity.

Tenants scale the GPU plan, not the architecture.

When a tenant outgrows one card, the next step is a higher-tier Northflank GPU plan or a multi-GPU node — or a second inference service for triage. Same project, same architecture, same per-tenant model.

Private LLM

Mechanically provable, not marketed.

The privacy story is not a marketing claim. It is a test you can run yourself.

1Provision a tenant in your own Northflank project.
2Exercise the v0.1.0-shipped features end-to-end (chat, memory recall/write, scheduled-task fire, GPU cold-boot). As skills land in v0.2, each is added to this test loop.
3Run tcpdump on the tenant's Northflank project egress for the full session.
4Confirm zero packets to Anthropic, OpenAI, Google AI, Hugging Face, ElevenLabs, Chroma Cloud, or any third-party inference endpoint. Inference traffic stays inside the project — Portal → Gateway → H100 is all internal Northflank network. The only external egress: services the user explicitly connected via OAuth.

This check runs mechanically on every release. It is the promise the project exists to keep.

Technology

Best-in-class open-source, end to end.

Every dependency is MIT / Apache / BSD compatible. Nothing here is a vendor lock-in.

InferencePatched SGLang + Gemma 4 31B Dense

SiliconNVIDIA H100 · 80 GB · sm_90 · native FP8

SubstrateNorthflank's managed GPU fleet — one project per tenant

ContextTurboQuant turbo4 KV — 1M tokens on a single card (roadmap)

Agent runtimeOpenClaw — RBAC at every tool call · per-agent memory built in

FrontendNext.js 16 + React 19 + TypeScript + SQLite

Authbetter-auth (v1) · WorkOS SSO (v2)

MemoryOpenClaw built-in per-agent SQLite — keyword search, seeded per agent

RetrievalRAGFlow — cited document answers (v0.3)

Embeddings (v0.3)bge-m3 — semantic memory + RAG, on its own GPU card

Voice (v0.3)VoxCPM2 — open-weight cloning + TTS

Image (v0.3)ComfyUI + SDXL

Roadmap

Shipping in the open.

What's working today vs. what's coming next is honest, enumerated, and verifiable.

0.2.0

v0.2.0

This release

▸Live inference — Gemma 4 31B-IT FP8 on a dedicated H100 at the native 256K context
▸MCP service integrations — Google, CalDAV/IMAP, Jira, per-user credentials
▸Matured Portal — streamed chat with token + compaction meter, sessions, workspace files, MCP services, Admin
▸Per-agent memory — built-in SQLite keyword search, seeded for every agent
▸Per-user Tool Access — admin allow/deny over built-in + connected-MCP tools via OpenClaw's native tools.deny, on top of always-on cross-user isolation

0.3

v0.3

▸One-command tenant provisioning — provision-tenant.sh / destroy-tenant.sh (full Northflank tenant lifecycle)
▸RAGFlow — cited document retrieval behind a stable interface
▸Semantic memory + embeddings via bge-m3 — on its own GPU card
▸Scrapling web fetch + a first CRM connector, as MCP services
▸Voice — VoxCPM2 open-weight cloning + TTS
▸Image — ComfyUI + SDXL
▸Cascade routing + TurboQuant turbo4 — 1M-token context on a single card

0.4+

v0.4+

Future

▸WorkOS SSO for enterprise tenants (Okta / Azure AD / Google Workspace)
▸Optional shared-GPU multi-tenancy for an entry tier below the dedicated-GPU threshold
▸Audio/video transcription ingest in RAGFlow
▸A "studio" for users to author their own skills

Get started

Pull it. Audit it. Run it.

Apache 2.0 — explicit patent grant. OSI-approved. Bring your own infra.

github.com/skytruax/FlatClaw Inspect the inference image