Offline eval + online monitoring, one scoring engine. Self-host the open-source platform (launching soon), or let TokenSurf Cloud run it for you.
import tokensurf as ts @ts.track # capture each run as a Trace — NOT in your data path def my_agent(question: str) -> str: ...
Decorate your agent, define scorers, run it in CI. Python-first — other SDKs on the roadmap.
import tokensurf as ts from tokensurf.scorers import LLMJudge, ToolSequence, NoLoops @ts.track # capture each run as a Trace — NOT in your data path def my_agent(question: str) -> str: return agent.run(question) # Define what "good" means for this agent suite = ts.Suite( dataset="cases.jsonl", scorers=[ LLMJudge("helpfulness"), # provider-agnostic judge (via litellm) ToolSequence(["search", "answer"]), NoLoops(max_repeats=2), ], ) # In CI: fail the build when quality regresses report = suite.run(my_agent) report.assert_passing(threshold=0.9)
They're non-deterministic. A prompt tweak silently breaks tool use, sends them into loops, or returns confident wrong answers — and you find out from your users.
A model swap or prompt edit passes your unit tests but quietly degrades answer quality. Nothing errors — it just gets worse.
The agent calls the wrong tool, in the wrong order, or loops on a step. You can't catch that without grading the whole trajectory.
Once it ships you have no quality metric — only support tickets. You can't tell a good run from a bad one at a glance.
Capture every run, score it, get a verdict. Track → score → report.
Decorate your agent with @ts.track. Every call is recorded as a Trace — framework-agnostic, and not in your data path.
Grade traces with LLM-judge, deterministic, reference-based, and agent-trajectory checks. You decide what "good" means.
A pass/fail report in CI blocks regressions before they ship — plus a live quality signal in production.
Framework-agnostic — works with LangChain, LlamaIndex, or plain Python. Read the full architecture →
One scoring engine, four ways to grade a run. Mix and match per suite.
Score accuracy, completeness, relevance, and helpfulness on a 1–10 scale (normalized to 0–1). Provider-agnostic via litellm — judge with any model.
Fast, free, and repeatable — no model call. Assert the structural facts a correct run must satisfy.
Embedding similarity against a known-good answer — catch drift when there's no single exact string to match.
Most tools score only the final answer. These grade the entire multi-step trajectory — the tool calls, their order, and how the agent recovers.
The same scorers run before you ship and after. Production traces flow back into your eval set.
Catch regressions before they ship.
Know how your agent behaves with real users.
Not just a library — a self-hostable platform. The SDK runs evals; the server stores every run, shows trends over time, and is where you configure everything. Your data never leaves your infrastructure.
Run the server, the dashboard, and a Postgres database yourself. Your eval data and your agents' inputs and outputs stay inside your trust boundary — the answer to the data-residency objection.
docker compose up brings up the app and Postgres together as the easy path. Prefer your own setup? It's a standard service that runs against any Postgres you already have.
Every eval run is stored and charted over time, organized by project — one project per repo or app. See pass-rate trends, not just the last run.
Ping Slack, email, or a webhook when quality regresses — so a bad run finds you, not your users.
Set thresholds like "fail the build if pass-rate < 90%." Wire eval results straight into CI.
Configure scorers and your own judge keys once on your server; CI pulls the config centrally.
The server & dashboard are in active development — open source, launching soon. The Python SDK is the runner that feeds them.
Everything you need to test and watch agent quality — in CI and in production.
Free to self-host. Cloud when you want it.
The open-source platform is the complete product. TokenSurf Cloud is the same platform, hosted by us.
The whole platform is free to self-host. The cloud is the same platform, hosted by us — never a paywall around the essentials.
Free, Apache-2.0. The whole platform — no asterisks.
The same platform, hosted by us. No infra to run.
Self-host free, or join the cloud waitlist. No per-request fees, no lock-in.
Self-host the whole platform. Free forever, Apache-2.0.
Hosted eval + monitoring, managed for your team. Pricing at general availability.
Enterprise needs — SSO, on-prem support, custom scorers?
Talk to usSelf-host the open-source framework, or let TokenSurf Cloud run eval + monitoring for you.
import tokensurf as ts @ts.track # capture each run — not in your data path def my_agent(q): ... ts.Suite("cases.jsonl", scorers=[...]).run(my_agent) # pass/fail in CI
AI agent evaluation, monitoring, and self-hosting — answered.
@ts.track to capture each run as a trace, define scorers (LLM-judge, deterministic, reference-based, and agent-trajectory checks), then run the suite in CI to get a pass/fail report on every change.@track decorator; it never sits between your app and the model, so it can't add latency or a point of failure to production.