📄️ Testing LLM chains
Learn how to test complex LLM chains and RAG systems with unit tests and end-to-end validation to ensure reliable outputs and catch failures across multi-step prompts
📄️ Evaluating factuality
How to evaluate the factual accuracy of LLM outputs against reference information using promptfoo's factuality assertion
📄️ Evaluating RAG pipelines
Benchmark RAG pipeline performance by evaluating document retrieval accuracy and LLM output quality with factuality and context adherence metrics for 2-step analysis
📄️ HLE Benchmark
Run evaluations against Humanity's Last Exam using promptfoo - the most challenging AI benchmark with expert-crafted questions across 100+ subjects.
📄️ OpenAI vs Azure benchmark
Compare OpenAI vs Azure OpenAI performance across speed, cost, and model updates with automated benchmarks to optimize your LLM infrastructure decisions
📄️ Red teaming a Chatbase Chatbot
Learn how to test and secure Chatbase RAG chatbots against multi-turn conversation attacks with automated red teaming techniques and security benchmarks
📄️ Choosing the best GPT model
Compare GPT-4o vs GPT-4o-mini performance on your custom data with automated benchmarks to evaluate reasoning capabilities, costs, and response latency metrics
📄️ Claude 3.7 vs GPT-4.1
Learn how to benchmark Claude 3.7 against GPT-4.1 using your own data with promptfoo. Discover which model performs best for your specific use case.
📄️ Cohere Command-R benchmarks
Compare Cohere Command-R vs GPT-4 vs Claude performance with automated benchmarks to evaluate model accuracy on your specific use cases and datasets
📄️ Llama vs GPT benchmark
Compare Llama 3.1 405B vs GPT-4 performance on custom datasets using automated benchmarks and side-by-side evaluations to identify the best model for your use case
📄️ DBRX benchmarks
Compare DBRX vs Mixtral vs GPT-3.5 performance with custom benchmarks to evaluate real-world task accuracy and identify the optimal model for your use case
📄️ Deepseek benchmark
Compare Deepseek MoE (671B params) vs GPT-4.1 vs Llama-3 performance with custom benchmarks to evaluate code tasks and choose the optimal model for your needs
📄️ Evaluating LLM safety with HarmBench
Assess LLM vulnerabilities against 400+ adversarial attacks using HarmBench benchmarks to identify and prevent harmful outputs across 6 risk categories
📄️ Red teaming a CrewAI Agent
Evaluate CrewAI agent security and performance with automated red team testing. Compare agent responses across 100+ test cases to identify vulnerabilities.
📄️ Evaluating JSON outputs
Validate and test LLM JSON outputs with automated schema checks and field assertions to ensure reliable, well-formed data structures in your AI applications
📄️ Evaluate LangGraph
Hands-on tutorial (July 2025) on evaluating and red-teaming LangGraph agents with Promptfoo—includes setup, YAML tests, and security scans.
📄️ Choosing the right temperature for your LLM
Compare LLM temperature settings from 0.1-1.0 to optimize model creativity vs consistency with automated benchmarks and randomness metrics
📄️ Evaluating OpenAI Assistants
Compare OpenAI Assistant configurations and measure performance across different prompts, models, and tools to optimize your AI application's accuracy and reliability
📄️ Evaluating Replicate Lifeboat
Compare GPT-3.5 vs Llama2-70b performance on real-world prompts using Replicate Lifeboat API to benchmark model accuracy and response quality for your specific use case
📄️ Gemini vs GPT
Compare Gemini Pro vs GPT-4 performance on your custom datasets using automated benchmarks and side-by-side analysis to identify the best model for your use case
📄️ Gemma vs Llama
Compare Google Gemma vs Meta Llama performance on custom datasets using automated benchmarks and side-by-side evaluations to select the best model for your use case
📄️ Gemma vs Mistral/Mixtral
Compare Gemma vs Mistral vs Mixtral performance on your custom datasets with automated benchmarks to identify the best model for your specific use case
📄️ GPT 3.5 vs GPT 4
Compare GPT-3.5 vs GPT-4 performance on your custom datasets using automated benchmarks to evaluate costs, latency and reasoning capabilities for your use case
📄️ GPT-4o vs GPT-4o-mini
Compare GPT-4o vs GPT-4o-mini performance on your custom datasets with automated benchmarks to evaluate cost, latency and reasoning capabilities
📄️ GPT-4.1 vs GPT-4o MMLU
Compare GPT-4.1 and GPT-4o performance on MMLU academic reasoning tasks using promptfoo with step-by-step setup and research-backed optimization techniques.
📄️ gpt-4.1 vs o1
Benchmark OpenAI o1 reasoning models against GPT-4 for cost, latency, and accuracy to optimize model selection decisions
📄️ Using LangChain PromptTemplate with Promptfoo
Learn how to test LangChain PromptTemplate outputs systematically with Promptfoo's evaluation tools to validate prompt formatting and variable injection
📄️ Uncensored Llama2 benchmark
Compare Llama2 Uncensored vs GPT-3.5 responses on sensitive topics using automated benchmarks to evaluate model safety and content filtering capabilities
📄️ How to red team LLM applications
Protect your LLM applications from prompt injection, jailbreaks, and data leaks with automated red teaming tests that identify 20+ vulnerability types and security risks
📄️ Magistral AIME2024 Benchmark
Replicate Mistral Magistral AIME2024 math benchmark achieving 73.6% accuracy with detailed evaluation methodology and comparisons
📄️ Mistral vs Llama
Compare Mistral 7B, Mixtral 8x7B, and Llama 3.1 performance on custom benchmarks to optimize model selection for your specific LLM application needs
📄️ Mixtral vs GPT
Compare Mixtral vs GPT-4 performance on custom datasets using automated benchmarks and evaluation metrics to identify the optimal model for your use case
📄️ Multi-Modal Red Teaming
Red team multimodal AI systems using adversarial text, images, audio, and video inputs to identify cross-modal vulnerabilities
📄️ Phi vs Llama
Compare Phi 3 vs Llama 3.1 performance on your custom datasets using automated benchmarks and side-by-side evaluations to select the optimal model for your use case
📄️ Preventing hallucinations
Measure and reduce LLM hallucinations using perplexity metrics, RAG, and controlled decoding techniques to achieve 85%+ factual accuracy in AI outputs
📄️ Qwen vs Llama vs GPT
Compare Qwen-2-72B vs GPT-4o vs Llama-3-70B performance on customer support tasks with custom benchmarks to optimize your chatbot's response quality
📄️ Sandboxed Evaluations of LLM-Generated Code
Safely evaluate and benchmark LLM-generated code in isolated Docker containers to prevent security risks and catch errors before production deployment
📄️ Testing Guardrails
Learn how to test guardrails in your AI applications to prevent harmful content, detect PII, and block prompt injections
📄️ Evaluating LLM text-to-SQL performance
Compare text-to-SQL accuracy across GPT-3.5 and GPT-4 using automated test cases and schema validation to optimize database query generation performance