Skip to content

Deploy a GenAI web app

A GenAI web app — a customer support assistant, an internal copilot, an “AI feature” inside a SaaS — is a different shape from a local coding agent. It serves many users at once, each session has different scope, and the security boundary needs to enforce per-tenant rules without becoming a bottleneck.

This guide shows how to put such an app behind OpenFirma. It builds on every other guide; cross-links are inline. The example app: a multi-tenant assistant where each user has their own session and the app calls OpenAI and a vendor SaaS on their behalf.

┌──────────────────┐ ┌─────────────────────────┐
│ HTTPS clients │──── HTTPS ────►│ Web app (Node/Py/Go) │
└──────────────────┘ │ issues per-session │
│ capabilities, makes │
│ outbound LLM calls │
└────────┬────────────────┘
│ HTTP_PROXY=...
┌─────────────────────────┐
│ Sidecar │ ◄── per-pod or per-host
│ (enforcement + inject) │
└────────┬────────────────┘
├──► api.openai.com (allowed)
├──► api.acme-vendor.com (allowed)
└──X paste.rs (denied)
┌─────────────────────────┐
│ Authority │ ◄── shared by all Sidecars
│ (issuance + bundles) │
└─────────────────────────┘

Three deployment shapes for the Sidecar:

  • Per-pod sidecar (Kubernetes / containers). Standard sidecar pattern: each app pod has a Sidecar container, the app talks to it over loopback. Right for production — strong tenant isolation, easy to scale horizontally.
  • Per-host daemon (VM / bare metal). One Sidecar per host, all app processes on the host route through it. Right for simpler topologies.
  • Embedded (gRPC interceptor mode). Sidecar in-process with the app via the gRPC interceptor. Right when you control the app’s HTTP client and want zero proxy footprint.

The rest of this guide uses per-pod sidecar. The patterns translate to the other shapes with minor config changes.

In a multi-tenant web app, “the agent” is not the app. The agent is the session — what the app is doing on behalf of one user. The choice that matters most:

ChoiceMeaningUse when
agent_id = <tenant_id>One agent identity per tenant; sessions are sub-units.Per-tenant policies, shared models.
agent_id = <user_id>One agent identity per end user.Strict per-user audit isolation.
agent_id = <app_name>A single agent identity for the whole app.Single-tenant; minimal isolation.

This guide uses agent_id = <tenant_id> and session_id = <user-session-uuid>. That gives you per-tenant policies plus per-session isolation in the audit log.

One Authority for the whole deployment. It signs every capability your app mints, streams policy bundles, broadcasts revocations.

# /etc/firma/firma.toml — the [authority] section
[authority]
listen_addr = "0.0.0.0:50051" # or behind an internal load balancer
policy_dir = "/etc/firma/policies"
issuance_policy_dir = "/etc/firma/issuance"
revocation_file = "/var/lib/firma/revocations.txt"
key_file = "/etc/firma/firma-authority.key"
max_ttl_seconds = 3600 # capabilities live at most 1h
bundle_ttl_seconds = 30 # push bundle updates every 30s
log_level = "info"

In production, run the Authority on a hardened host with limited access. Treat its signing key with the same care as a CA key.

The CA for HTTPS MITM (if you’re using it) lives separately on each Sidecar host — see Enable HTTPS MITM. Each host has its own CA; you don’t share one across hosts.

This is the policy that decides whether the app can ever mint a capability. It runs once per session, so it can afford richer checks.

/etc/firma/issuance/issuance.cedar:

// Tenant-scoped agents may request these classes.
permit (
principal,
action in [
Firma::Action::"model.inference.chat",
Firma::Action::"communication.external.send"
],
resource
) when {
// tenant ids we recognize
principal == Firma::Agent::"tenant-acme" ||
principal == Firma::Agent::"tenant-globex" ||
principal == Firma::Agent::"tenant-soylent"
};
// No tenant gets payment classes from this app.
forbid (
principal,
action == Firma::Action::"payment.transfer",
resource
);

The tenant list itself is declarative. When you onboard a new tenant, you add a line and push the bundle — no code change. When you offboard one, you remove the line and revoke their active capabilities.

/etc/firma/policies/genai-app.cedar:

// LLM calls: permitted to OpenAI for known tenants.
permit (
principal,
action == Firma::Action::"model.inference.chat",
resource
) when {
resource has "host" &&
resource.host == "api.openai.com"
};
// Vendor SaaS: permitted to one specific endpoint, with rate limits
// enforced via context.action_count.
permit (
principal,
action == Firma::Action::"communication.external.send",
resource
) when {
resource has "host" &&
resource.host == "api.acme-vendor.com" &&
context.action_count <= 100
};
// Hard floor: no exfiltration destinations, ever.
forbid (
principal,
action == Firma::Action::"communication.external.send",
resource
) when {
resource has "host" &&
(resource.host == "paste.rs" ||
resource.host == "transfer.sh" ||
resource.host == "0x0.st")
};

context.action_count is the per-session call counter. The rule “max 100 vendor calls per session” caps a runaway loop.

Each app pod runs a Sidecar with this config:

# /etc/firma/firma.toml — the [sidecar.*] sections
[sidecar.interceptor]
mode = "http_proxy"
listen_addr = "127.0.0.1:8080"
drain_timeout_secs = 30
[sidecar.interceptor.https_mitm]
enabled = true
intercept_hosts = ["api.openai.com", "api.acme-vendor.com"]
strict_hosts = ["api.acme-vendor.com"] # never fall back to CONNECT here
[sidecar.ca]
dir = "/etc/firma/firma-ca"
[sidecar.mapping]
rules_path = "/etc/firma/mapping-rules.toml"
rules_paths = []
default_protected = true # production!
[sidecar.policy]
dir = "/etc/firma/cache/policies" # populated by Authority stream
authority_url = "https://firma-authority.internal:50051"
[sidecar.constraint_enforcement]
bundle_ttl_seconds = 90
enforcement_timeout_ms = 50
[sidecar.capability_seed]
paths = [] # capabilities arrive via gRPC, not seed files
[sidecar.authority]
public_key_path = "/etc/firma/firma-authority.pub"
ca_cert_path = "/etc/firma/authority-ca.crt"
[sidecar.connector]
default_timeout_ms = 30000
[[sidecar.connector.hosts]]
host = "api.openai.com"
rps = 100
burst = 20
timeout_ms = 30000
[[sidecar.connector.hosts]]
host = "api.acme-vendor.com"
rps = 50
burst = 10
timeout_ms = 15000
[[sidecar.credentials]]
host = "api.openai.com"
mode = "vault"
header = "Authorization"
prefix = "Bearer "
secret_path = "secret/data/openai/api-key"
secret_key = "value"
[[sidecar.credentials]]
host = "api.acme-vendor.com"
mode = "vault"
header = "x-api-key"
secret_path = "secret/data/acme-vendor/api-key"
secret_key = "value"
[sidecar.credentials.vault]
addr = "https://vault.internal:8200"
# token via AppRole, configured via env
[sidecar.audit]
sink = "grpc"
grpc_url = "https://audit-collector.internal:9090"
signing_key_path = "/etc/firma/audit.key"
[sidecar.log]
level = "info"

A few things worth highlighting:

  • default_protected = true — anything not in mapping rules denies. Production posture.
  • authority_url uses https:// + authority.ca_cert_path — sidecar verifies Authority identity before trusting streamed bundles/revocations.
  • grpc audit sink — events go to a centralized collector, not to a local file. Multiple Sidecars feed one collector.
  • Vault for credentials — no API keys on disk. The Sidecar pulls them on first use and caches in memory.
  • strict_hosts on the vendor — if MITM fails (e.g. cert mismatch), the call denies rather than falling back to weaker CONNECT-only policy.

The app’s request handler issues a fresh capability for each user session. Pseudocode (Python):

import grpc
from firma_proto import authority_pb2, authority_pb2_grpc
def issue_capability_for_session(tenant_id: str, user_session_id: str):
channel = grpc.secure_channel(
"firma-authority.internal:50051",
grpc.ssl_channel_credentials(...),
)
stub = authority_pb2_grpc.AuthorityStub(channel)
req = authority_pb2.IssuanceRequest(
agent_id=f"tenant-{tenant_id}",
session_id=user_session_id,
requested_actions=[
"model.inference.chat",
"communication.external.send",
],
resource_scope="*",
requested_ttl_seconds=900, # 15 minutes
)
resp = stub.IssueCapability(req)
if resp.HasField("denied"):
raise PermissionError(resp.denied.reason)
return resp.allowed.raw_token

When a user starts a session, the app calls issue_capability_for_session(...), hands the resulting raw token to the Sidecar via the appropriate channel (a header on the proxied request, or a side channel — your design), and from then on the Sidecar can validate every call from that session against that capability.

For a 15-minute TTL with 1000 active sessions, the Authority issues 1000 capabilities every 15 minutes. The Sidecar holds them in its CapabilityMap. The hot path is unchanged.

Set the app’s HTTP client to use the loopback Sidecar:

Python (httpx / requests):

Terminal window
HTTP_PROXY=http://127.0.0.1:8080 \
HTTPS_PROXY=http://127.0.0.1:8080 \
SSL_CERT_FILE=/etc/firma/firma-ca/firma-ca.crt \
python -m gunicorn app:app

Node:

Terminal window
HTTPS_PROXY=http://127.0.0.1:8080 \
NODE_EXTRA_CA_CERTS=/etc/firma/firma-ca/firma-ca.crt \
node app.js

The app does not read OPENAI_API_KEY or vendor secrets. They live in Vault, the Sidecar pulls them, the app just makes calls without auth headers.

Every request the app proxies produces an audit event tagged with agent_id = tenant-<id>, session_id = <user-session>. Ship those events to your collector keyed on agent_id and you have per-tenant audit by construction — no app-side instrumentation needed.

For per-user accounting on top of that, the session_id is the unit. If you record the mapping (session_id → user_id) somewhere, you can join the audit stream against it.

A few practices that come up only at production scale.

Authority HA. The Authority is a single point of contact for capability issuance. Run two of them behind a load balancer; both point at the same policy_dir and key file. The Sidecar’s gRPC stream is independent per-Sidecar, and Sidecars reconnect automatically.

Bundle propagation latency. A new policy version takes bundle_ttl_seconds to propagate worst-case (Sidecars pull, Authority pushes). Plan for this when rolling out tightening rules — start with a stricter rule, deploy, wait for propagation, only then announce the change to tenants.

Revocation propagation. A firma authority revocations add <token_id> propagates within a second on the gRPC stream. For “kill this tenant immediately”, run revocation against every active capability for that tenant.

Capacity planning. Each Sidecar holds active capabilities + the policy bundle in memory. With 1000 active sessions and a 100 KB bundle, you’re well under 100 MB resident. The hot path stays bounded by the perf budgets (Stage 1 < 1ms, Stage 2 < 200µs) regardless of session count.

Failure modes. If the Authority is unreachable for longer than bundle_ttl_seconds, the Sidecar denies everything (PolicyBundleStale). This is the right shape — stale policy is not safe — but it means the Authority is effectively a critical dependency for your app’s availability. Monitor accordingly.

Putting it all together, the new-tenant workflow is:

  1. Add Firma::Agent::"tenant-newco" to the issuance policy.
  2. Add any tenant-specific runtime rules (a permit with principal == Firma::Agent::"tenant-newco", etc.).
  3. Push the policy bundle. Sidecars pick it up within bundle_ttl_seconds.
  4. Configure the app to use agent_id = "tenant-newco" for that tenant’s sessions.
  5. First session for the tenant: app calls IssueCapability, gets a token, app makes calls, Sidecar validates.

Offboarding is the inverse: remove the entries from issuance + runtime policy, push, the Sidecar denies new capabilities and stale ones expire.

PolicyBundleStale denials in production. Your Sidecars lost contact with the Authority. Check the network path, the Authority’s health, and consider raising bundle_ttl_seconds slightly to give yourself headroom for transient blips.

CapabilityScopeMismatch for legitimate calls. The capability’s resource_scope doesn’t match the request. Either tighten the scope at issuance time or loosen it. Match the scope to the agent’s mission, not to a wildcard — '*' is a smell in production.

Audit volume. A busy app produces a lot of events. Plan for the storage and the cost of shipping them. grpc sink + a horizontally scaled collector is the right shape.

Vault token rotation. AppRole renewal needs to happen before the token expires; the Sidecar does not auto-renew. Use a sidecar-of-the-sidecar (e.g. Vault Agent) to keep credentials fresh.

Sidecar restart drops in-memory capabilities. When a pod restarts, the Sidecar comes back with no CapabilityMap entries until sessions issue new ones. The app should retry on CapabilityNotFound by re-issuing.