AI Skills
LLM Evals
Benchmarks, evaluators, observability tools for LLM apps.
All 76AI Agents8AI Frameworks8Coding Assistants2Code Generation6MCP Servers4RAG & Vector DBs4Prompt Engineering5LLM Evals4Voice & Speech AI4Image & Vision AI4Chatbots & Companions4AI Apps4Workflow Automation3Browser Automation2CLI Tools3IDE Extensions3DevOps & MLOps1Fine-Tuning3Local LLM Runtimes3Data Extraction1
LLM Evals
β 16.0kOpenAI Evals
by openai
OpenAI's framework for benchmarking LLMs and an open-source registry of evals. Industry-standard test harness.
Python
LLM Evals
β 11.0kLangfuse
by langfuse
Open-source LLM engineering platform β tracing, prompt management, evaluations, datasets, playground.
TypeScript
LLM Evals
β 7.0kpromptfoo
by promptfoo
CLI and library for evaluating, testing, and red-teaming LLM apps. Side-by-side prompt comparisons in your CI.
TypeScript
LLM Evals
β 6.0kArize Phoenix
by Arize-ai
Open-source LLM observability β traces, evaluation, datasets, retrieval debugging. OpenTelemetry-native.
Python