The AI-powered evals framework
for ecommerce search.

Ecommerce search evaluation powered by LLM-as-a-Judge. Domain-aware scoring, deterministic checks, and full IR metrics.

For search teams tired of eyeballing result pages.

$ pip install veritail

Everything you need to evaluate search

LLM-as-a-Judge

Every search result scored for relevance using structured, domain-aware rubrics tailored to your industry vertical.

14 Industry Verticals

Built-in scoring expertise for automotive, electronics, fashion, groceries, medical, and 9 more industries — with domain-specific guidance that automatically adapts to each query.

Deterministic Quality Checks

Catch price outliers, missing titles, duplicate SKUs, and ranking mistakes with fast, rule-based checks.

Metrics & Observability

IR metrics, rich reports, and trace exports to observability platforms like Langfuse.

What you get after a run

veritail --queries queries.csv --vertical home-improvement

IR Metrics

MetricValue
NDCG@50.9829
NDCG@100.9829
MRR0.9800
MAP0.9751

Avg Score by Position

Bars should decrease top to bottom if ranking is correct.

#1
2.88
#2
3.00
#3
2.52
#4
1.84
#5
1.16

Query Corrections

OriginalCorrectedVerdict
intrior paint interior paint appropriate
led lighrs red lights inappropriate

Veritail catches when autocorrect fixes a typo but changes the product.

Worst Performing Queries

QueryTypeNDCG@10
self drilling drywall anchors long-tail 0.7829
pex crimp tool broad 0.8554
1/2 inch pex tubing red long-tail 0.9697

Autocomplete Quality

MetricValue
Avg Relevance 2.94/3
Avg Diversity 2.50/3
Total Flagged 4

Deterministic Checks

CheckPassedFailed
Near-duplicate products 103 22
Out-of-stock prominence 124 1
Price outlier 108 17

Search Result Judgment Query: 52 inch white ceiling fan with light and remote

#1 Hunter 52" White Ceiling Fan with LED Light Kit and Remote Control 3/3 match

Exact match on all criteria. 52-inch size, white finish, integrated LED light kit, and included remote control all satisfy the query. Correctly ranked at position 1.

#2 30" Bronze Outdoor Ceiling Fan 1/3 mismatch

Wrong size (30" vs. requested 52"), wrong finish (bronze vs. white), no light kit, and no remote. Only the broad product category — ceiling fan — is relevant to the query.

#3 52" White Ceiling Fan with LED Light (Pull-Chain Only) 2/3 partial

Correct size, finish, and light kit, but controlled by pull-chain instead of the remote the query specifies. A usable result, but missing a key requested feature.

Autocomplete Judgment Prefix: "dec"

deck screws deck stain deck boards composite deck railing decaf coffee deck post cap
Relevance 2/3 Diversity 2/3

Five of six suggestions target deck materials, but "decaf coffee" is completely off-domain — a grocery item with no relevance to home improvement, degrading the suggestion set.

Three steps to scored results

1

Load queries

Import your search queries. Veritail classifies each query and assigns domain-specific overlays automatically.

2

Evaluate

Your search adapter returns results. Deterministic checks flag defects, then the LLM judge scores each result on a 0–3 rubric.

3

Report

Get IR metrics, per-query breakdowns, and rich HTML reports. Compare two search configurations side by side.

Up and running in minutes

Install
$ pip install veritail
# With Anthropic / Gemini support:
$ pip install "veritail[cloud]"
Create an adapter
# my_adapter.py
from veritail import SearchAdapter, SearchResult

class MyAdapter(SearchAdapter):
    def search(self, query: str, **kwargs) -> list[SearchResult]:
        # Call your search engine here
        return [
            SearchResult(
                title="Product Name",
                price=29.99,
                url="https://example.com/product",
            )
        ]
Run evaluation
$ veritail --adapter my_adapter.MyAdapter \
          --queries queries.csv \
          --vertical electronics