Run adaptive A/B tests

You can set up adaptive A/B tests with the TensorZero Gateway to automatically distribute inference requests to the best performing variants (prompts, models, etc.) of your system. TensorZero supports any number of variants in an adaptive A/B test. In simple terms, you define:

A TensorZero function (“task”)
A set of candidate variants (prompts, models, etc.) to experiment with
A metric to optimize for

And TensorZero takes care of the rest. TensorZero’s experimentation algorithm is designed to efficiently find the best variant of the system within a specified confidence interval. You can add more variants over time and TensorZero will adjust the experiment accordingly while maintaining its statistical soundness.

See the Tutorial to learn more about functions and variants in TensorZero.

Configure

Let’s set up an adaptive A/B test with TensorZero.

You can find a complete runnable example of this guide on GitHub.

Configure your function

Let’s configure a function (“task”) with two variants (gpt-5-mini with two different prompts), a metric to optimize for, and the experimentation configuration.

tensorzero.toml

# Define a function for the task we're tackling
[functions.extract_entities]
type = "json"
output_schema = "output_schema.json"

# Define variants to experiment with (here, we have two different prompts)
[functions.extract_entities.variants.gpt-5-mini-good-prompt]
type = "chat_completion"
model = "openai::gpt-5-mini"
templates.system.path = "good_system_template.minijinja"
json_mode = "strict"

[functions.extract_entities.variants.gpt-5-mini-bad-prompt]
type = "chat_completion"
model = "openai::gpt-5-mini"
templates.system.path = "bad_system_template.minijinja"
json_mode = "strict"

# Define the experiment configuration
[functions.extract_entities.experimentation]
type = "track_and_stop" # the experimentation algorithm
candidate_variants = ["gpt-5-mini-good-prompt", "gpt-5-mini-bad-prompt"]
metric = "exact_match"
update_period_s = 60  # low for the sake of the demo (recommended: 300)

# Define the metric we're optimizing for
[metrics.exact_match]
type = "boolean"
level = "inference"
optimize = "max"

Deploy TensorZero

You must set up Postgres to use TensorZero’s automated experimentation features.

Make inference requests

Make an inference request just like you normally would and keep track of the inference ID or episode ID. You can use the TensorZero Inference API or the OpenAI-compatible Inference API.

response = t0.inference(
    function_name="extract_entities",
    input={
        "messages": [
            {
                "role": "user",
                "content": datapoint.input,
            }
        ]
    },
)

Send feedback for your metric

Send feedback for your metric and assign it to the inference ID or episode ID.

t0.feedback(
    metric_name="exact_match",
    value=True,
    inference_id=response.inference_id,
)

Track your experiment

That’s it. TensorZero will automatically adjust the distribution of inference requests between the two candidate variants based on their performance.You can track the experiment in the TensorZero UI. Visit the function’s detail page to see the variant weights and the estimated performance.If you run the code example, TensorZero starts by splitting traffic between the two variants but quickly starts shifting more and more traffic towards the gpt-5-mini-good-prompt variant. After a few hundred inferences, TensorZero becomes confident enough to declare it the winner and starts serving all the traffic to it.

You can add more variants at any time and TensorZero will adjust the experiment accordingly in a principled way.

Advanced

Configure fallback-only variants

In addition to candidate_variants, you can also specify fallback_variants in your configuration. If a variant fails for any results, TensorZero first resamples from candidate_variants. Once they are exhausted, it samples from fallback_variants with uniform probability. See the Configuration Reference for more details.

Customize the experimentation algorithm

The track_and_stop algorithm has multiple parameters that can be customized. For example, you can trade off the speed of the experiment with the statistical confidence of the results. The default parameters are sensible for most use cases, but advanced users might want to customize them. See the Configuration Reference for more details.

Introduction

Gateway

Optimization

Evaluations

Experimentation

Deployment

Operations

Configure

Advanced

Configure fallback-only variants

Customize the experimentation algorithm

Introduction

Gateway

Optimization

Evaluations

Experimentation

Deployment

Operations

​Configure

​Advanced

​Configure fallback-only variants

​Customize the experimentation algorithm

Configure

Advanced

Configure fallback-only variants

Customize the experimentation algorithm