Advanced Usage Guide
Metric selection strategies
Use __all__ to compute all registered metrics:
t2s run -d ck25 -j ./datasets/ck25/eval/ -m __all__ -ee http://localhost:8886/
Or run focused subsets depending on your evaluation question:
Structural fidelity:
query_exact_match,token_f1,codebleuResult quality:
answerset_precision,answerset_recall,answerset_f1Ranking behavior:
mrr,ndcg,hit@1,p@1
Parallel execution
Enable multiprocessing across systems/files:
t2s run -d ck25 -j ./datasets/ck25/eval/ -m query_execution answerset_f1 -ee http://localhost:8886/ -p
Export controls
Useful flags:
-eqto include per-query scores-epto write output to a custom location-sto set explicit system names
Example:
t2s run \
-d ck25 \
-s AIFB DBPEDIA-CG \
-j ./datasets/ck25/eval/AIFB.jsonl ./datasets/ck25/eval/DBPEDIA-CG.jsonl \
-m query_execution answerset_f1 \
-ee http://localhost:8886/ \
-eq \
-ep ./datasets/ck25/results/custom-run.json
LLM-based metrics
If selected metrics require LLM support, configure the Ollama model:
t2s run -d ck25 -j ./datasets/ck25/eval/ -m llm_judge -ee http://localhost:8886/ -lo gemma3:4b