Ecommerce search evaluation powered by LLM-as-a-Judge. Domain-aware scoring, deterministic checks, and full IR metrics.
For search teams tired of eyeballing result pages.
Features
Every search result scored for relevance using structured, domain-aware rubrics tailored to your industry vertical.
Built-in scoring expertise for automotive, electronics, fashion, groceries, medical, and 9 more industries — with domain-specific guidance that automatically adapts to each query.
Catch price outliers, missing titles, duplicate SKUs, and ranking mistakes with fast, rule-based checks.
IR metrics, rich reports, and trace exports to observability platforms like Langfuse.
Sample output
| Metric | Value |
|---|---|
| NDCG@5 | 0.9829 |
| NDCG@10 | 0.9829 |
| MRR | 0.9800 |
| MAP | 0.9751 |
Bars should decrease top to bottom if ranking is correct.
| Original | Corrected | Verdict |
|---|---|---|
| intrior paint | interior paint | appropriate |
| led lighrs | red lights | inappropriate |
Veritail catches when autocorrect fixes a typo but changes the product.
| Query | Type | NDCG@10 |
|---|---|---|
| self drilling drywall anchors | long-tail | 0.7829 |
| pex crimp tool | broad | 0.8554 |
| 1/2 inch pex tubing red | long-tail | 0.9697 |
| Metric | Value |
|---|---|
| Avg Relevance | 2.94/3 |
| Avg Diversity | 2.50/3 |
| Total Flagged | 4 |
| Check | Passed | Failed |
|---|---|---|
| Near-duplicate products | 103 | 22 |
| Out-of-stock prominence | 124 | 1 |
| Price outlier | 108 | 17 |
Exact match on all criteria. 52-inch size, white finish, integrated LED light kit, and included remote control all satisfy the query. Correctly ranked at position 1.
Wrong size (30" vs. requested 52"), wrong finish (bronze vs. white), no light kit, and no remote. Only the broad product category — ceiling fan — is relevant to the query.
Correct size, finish, and light kit, but controlled by pull-chain instead of the remote the query specifies. A usable result, but missing a key requested feature.
Five of six suggestions target deck materials, but "decaf coffee" is completely off-domain — a grocery item with no relevance to home improvement, degrading the suggestion set.
How it works
Import your search queries. Veritail classifies each query and assigns domain-specific overlays automatically.
Your search adapter returns results. Deterministic checks flag defects, then the LLM judge scores each result on a 0–3 rubric.
Get IR metrics, per-query breakdowns, and rich HTML reports. Compare two search configurations side by side.
Quick start
$ pip install veritail # With Anthropic / Gemini support: $ pip install "veritail[cloud]"
# my_adapter.py from veritail import SearchAdapter, SearchResult class MyAdapter(SearchAdapter): def search(self, query: str, **kwargs) -> list[SearchResult]: # Call your search engine here return [ SearchResult( title="Product Name", price=29.99, url="https://example.com/product", ) ]
$ veritail --adapter my_adapter.MyAdapter \ --queries queries.csv \ --vertical electronics