Advanced Usage Guide

Metric selection strategies

Use __all__ to compute all registered metrics:

t2s run -d ck25 -j ./datasets/ck25/eval/ -m __all__ -ee http://localhost:8886/

Or run focused subsets depending on your evaluation question:

Structural fidelity: query_exact_match, token_f1, codebleu
Result quality: answerset_precision, answerset_recall, answerset_f1
Ranking behavior: mrr, ndcg, hit@1, p@1

Parallel execution

Enable multiprocessing across systems/files:

t2s run -d ck25 -j ./datasets/ck25/eval/ -m query_execution answerset_f1 -ee http://localhost:8886/ -p

Export controls

Useful flags:

-eq to include per-query scores
-ep to write output to a custom location
-s to set explicit system names

Example:

t2s run \
  -d ck25 \
  -s AIFB DBPEDIA-CG \
  -j ./datasets/ck25/eval/AIFB.jsonl ./datasets/ck25/eval/DBPEDIA-CG.jsonl \
  -m query_execution answerset_f1 \
  -ee http://localhost:8886/ \
  -eq \
  -ep ./datasets/ck25/results/custom-run.json

LLM-based metrics

If selected metrics require LLM support, configure the Ollama model:

t2s run -d ck25 -j ./datasets/ck25/eval/ -m llm_judge -ee http://localhost:8886/ -lo gemma3:4b