AI Skills

LLM Evals

Benchmarks, evaluators, observability tools for LLM apps.

LLM Evals(4)

OpenAI's framework for benchmarking LLMs and an open-source registry of evals. Industry-standard test harness.

Open-source LLM engineering platform — tracing, prompt management, evaluations, datasets, playground.

CLI and library for evaluating, testing, and red-teaming LLM apps. Side-by-side prompt comparisons in your CI.

Open-source LLM observability — traces, evaluation, datasets, retrieval debugging. OpenTelemetry-native.