Framework-agnostic Not in your data path Provider-agnostic judge

Quality testing & monitoring
for AI agents

Offline eval + online monitoring, one scoring engine. Self-host the open-source platform (launching soon), or let TokenSurf Cloud run it for you.

Open source — launching soon
import tokensurf as ts

@ts.track          # capture each run as a Trace — NOT in your data path
def my_agent(question: str) -> str:
    ...
4
Scorer families
Offline + Online
One scoring engine
Python-first
@track SDK
Open source
Launching soon

Quickstart

Decorate your agent, define scorers, run it in CI. Python-first — other SDKs on the roadmap.

import tokensurf as ts
from tokensurf.scorers import LLMJudge, ToolSequence, NoLoops

@ts.track                       # capture each run as a Trace — NOT in your data path
def my_agent(question: str) -> str:
    return agent.run(question)

# Define what "good" means for this agent
suite = ts.Suite(
    dataset="cases.jsonl",
    scorers=[
        LLMJudge("helpfulness"),          # provider-agnostic judge (via litellm)
        ToolSequence(["search", "answer"]),
        NoLoops(max_repeats=2),
    ],
)

# In CI: fail the build when quality regresses
report = suite.run(my_agent)
report.assert_passing(threshold=0.9)

Agents fail quietly

They're non-deterministic. A prompt tweak silently breaks tool use, sends them into loops, or returns confident wrong answers — and you find out from your users.

Silent regressions

A model swap or prompt edit passes your unit tests but quietly degrades answer quality. Nothing errors — it just gets worse.

Broken tool use

The agent calls the wrong tool, in the wrong order, or loops on a step. You can't catch that without grading the whole trajectory.

No signal in prod

Once it ships you have no quality metric — only support tickets. You can't tell a good run from a bad one at a glance.

How It Works

Capture every run, score it, get a verdict. Track → score → report.

@track Step 1

Capture each run

Decorate your agent with @ts.track. Every call is recorded as a Trace — framework-agnostic, and not in your data path.

🧪 Step 2

Define scorers

Grade traces with LLM-judge, deterministic, reference-based, and agent-trajectory checks. You decide what "good" means.

📋 Step 3

Get a pass/fail report

A pass/fail report in CI blocks regressions before they ship — plus a live quality signal in production.

Framework-agnostic — works with LangChain, LlamaIndex, or plain Python. Read the full architecture →

The 4 scorer families

One scoring engine, four ways to grade a run. Mix and match per suite.

LLM-judge

Model-graded quality

Score accuracy, completeness, relevance, and helpfulness on a 1–10 scale (normalized to 0–1). Provider-agnostic via litellm — judge with any model.

accuracycompletenessrelevancehelpfulness
Deterministic

Exact, rule-based checks

Fast, free, and repeatable — no model call. Assert the structural facts a correct run must satisfy.

ExactMatchRegexJSONSchemaValidToolCalledLatencyUnder
Reference-based

Compare to expected

Embedding similarity against a known-good answer — catch drift when there's no single exact string to match.

EmbeddingSimilarity
Agent-trajectory · the differentiator

Grade the whole run

Most tools score only the final answer. These grade the entire multi-step trajectory — the tool calls, their order, and how the agent recovers.

ToolSequenceNoLoopsStepBudgetTaskCompletionRecovery

One engine, two modes

The same scorers run before you ship and after. Production traces flow back into your eval set.

Offline · in CI

pytest for agents At launch

Catch regressions before they ship.

  • Run test cases + scorers on every pull request
  • Fail the build when quality drops
  • Compare each run against a baseline
Online · in production

Datadog for agent quality Roadmap

Know how your agent behaves with real users.

  • Capture production traces, auto-scored
  • Surface drift and failure patterns
  • Feed real traces back into your eval set

Run your own quality platform Launching soon

Not just a library — a self-hostable platform. The SDK runs evals; the server stores every run, shows trends over time, and is where you configure everything. Your data never leaves your infrastructure.

Self-hosted, on your infrastructure

Run the server, the dashboard, and a Postgres database yourself. Your eval data and your agents' inputs and outputs stay inside your trust boundary — the answer to the data-residency objection.

One command to start — but Docker's optional

docker compose up brings up the app and Postgres together as the easy path. Prefer your own setup? It's a standard service that runs against any Postgres you already have.

Your runs, remembered

Every eval run is stored and charted over time, organized by project — one project per repo or app. See pass-rate trends, not just the last run.

Python SDK Server + dashboard Postgres Your infra
tokensurf · dashboard preview
project: support-agent pass-rate, last 8 runs
run #10420.90pass
run #10410.78fail
run #10400.88pass
run #1039errored
🔔

Notifications

Ping Slack, email, or a webhook when quality regresses — so a bad run finds you, not your users.

🚫

Quality gates

Set thresholds like "fail the build if pass-rate < 90%." Wire eval results straight into CI.

⚙️

Centralized config

Configure scorers and your own judge keys once on your server; CI pulls the config centrally.

The server & dashboard are in active development — open source, launching soon. The Python SDK is the runner that feeds them.

Built for agent quality

Everything you need to test and watch agent quality — in CI and in production.

Scoring
Four scorer families
LLM-judge, deterministic, reference-based, and agent-trajectory checks — one suite, mixed and matched per agent. Pick exactly what "good" means for your use case.
📝
Custom scorers
Write your own check as a plain Python function. Return a 0–1 score; TokenSurf handles aggregation, thresholds, and reporting. Your domain knowledge, expressed as code.
Rule-based assertions
Regex, JSON-schema validation, exact-match, and "tool was called" checks — fast, free, and deterministic. No model call, no flakiness.
🔗
Trajectory grading
Grade the whole multi-step run, not just the final answer: tool order, loops, step budget, task completion, and recovery. The part most eval tools miss.
Eval workflow
📄
Datasets as JSONL
Keep test cases in version control as plain JSONL. Review changes in pull requests like any other code — no proprietary format, no separate database.
CI gate
Run the suite on every pull request and fail the build when a scorer drops below your threshold. Compare against a baseline so you see exactly what regressed.
Online monitoring Roadmap
📡
Production trace capture
Stream live agent runs as Traces and auto-score them in production — the same scorers you use offline. Coming soon.
📈
Drift & failure detection
Surface quality drift and failure clusters from real traffic before users complain. Coming soon.
Reporting
📋
Eval reports
pass fail errored for every case and scorer, with the inputs and outputs that produced each verdict.
📊
Run comparison
Diff two runs to see exactly which cases improved or regressed after a model swap or prompt change. Numbers, not vibes.
📨
Export & webhooks
Push results to CI annotations, Slack, or your own dashboard. Wire quality into the tools your team already watches.
Privacy & trust
🛡
Not in your data path
TokenSurf observes runs via @track. It never sits between your app and the model, so it can't add latency or a new failure point to production.
🔒
Self-hosted platform
The server, dashboard, and Postgres run inside your own trust boundary with your provider keys — no data leaves your infra. Open source — launching soon.
Provider-agnostic judge
Judge with any model via litellm — OpenAI, Anthropic, Google, or a local model. No lock-in to one provider's grader.
Platform
🧩
Framework-agnostic SDK
@track works with LangChain, LlamaIndex, or plain Python functions. One decorator, any stack.
💾
Run history in Postgres
The platform stores every run in your own Postgres so you get trends over time, organized by project. SDK results are plain files you can inspect with git or jq too.
Cloud option Early access
Hosted storage, dashboards, and team features for when you'd rather not run infra. Join the waitlist for early access.

Free to self-host. Cloud when you want it.

The open-source platform is the complete product. TokenSurf Cloud is the same platform, hosted by us.

Self-host — get notified at launch Join the cloud waitlist

Open core

The whole platform is free to self-host. The cloud is the same platform, hosted by us — never a paywall around the essentials.

Self-host

Open source Launching soon

Free, Apache-2.0. The whole platform — no asterisks.

  • Python SDK (the eval runner)
  • Self-hosted server + dashboard
  • Your own Postgres database
  • All four scorer families
  • Runs in your infra — data never leaves
Self-host — get notified at launch
TokenSurf Cloud

Managed Early access

The same platform, hosted by us. No infra to run.

  • Everything in open source
  • We host the server + dashboard
  • Managed Postgres + backups
  • Team features & collaboration
  • We run and update it for you
Join the waitlist

Pricing

Self-host free, or join the cloud waitlist. No per-request fees, no lock-in.

Open Source

Free Launching soon

Self-host the whole platform. Free forever, Apache-2.0.

  • SDK + server + dashboard
  • All four scorer families
  • Your own Postgres database
  • Your trust boundary, your keys
  • Community support
Self-host — get notified at launch
TokenSurf Cloud

Early access Waitlist

Hosted eval + monitoring, managed for your team. Pricing at general availability.

  • Everything in open source
  • Hosted storage + dashboard
  • Online monitoring (managed)
  • Team roles & collaboration
  • Priority support
Join the waitlist

Enterprise needs — SSO, on-prem support, custom scorers?

Talk to us

Ship agents you can trust

Self-host the open-source framework, or let TokenSurf Cloud run eval + monitoring for you.

import tokensurf as ts

@ts.track   # capture each run — not in your data path
def my_agent(q): ...

ts.Suite("cases.jsonl", scorers=[...]).run(my_agent)  # pass/fail in CI
Open source — launching soon

Frequently asked questions

AI agent evaluation, monitoring, and self-hosting — answered.

What is AI agent evaluation?
AI agent evaluation measures whether an agent does its job correctly — right answers, correct tool use, no loops — by running it against test cases and scoring the results. TokenSurf runs these evals in CI so quality regressions are caught before they ship.
How do you test AI agents?
Decorate your agent with @ts.track to capture each run as a trace, define scorers (LLM-judge, deterministic, reference-based, and agent-trajectory checks), then run the suite in CI to get a pass/fail report on every change.
Can I self-host TokenSurf?
Yes. TokenSurf is a self-hostable platform — the SDK, server, dashboard, and a Postgres database run on your own infrastructure, so your eval data and your agents' inputs and outputs never leave your environment. Open source, launching soon.
Does TokenSurf sit in my data path?
No. TokenSurf observes runs through the @track decorator; it never sits between your app and the model, so it can't add latency or a point of failure to production.
What is the difference between offline eval and online monitoring?
Offline eval runs test cases in CI before you ship — pytest for agents. Online monitoring scores real production traces to catch drift and failures after you ship — Datadog for agent quality. TokenSurf uses one scoring engine for both.
Is TokenSurf open source?
The platform is open source under Apache-2.0 and launching soon. Until then you can join the launch list, and TokenSurf Cloud offers the same platform hosted for you, in early access.