Skip to main content

Non-Functional Testing for AI & LLM Systems: The Complete Guide to Latency, Throughput, Consistency, and Token Efficiency

All Posts
MLOps18 min read

Non-Functional Testing for AI & LLM Systems: The Complete Guide to Latency, Throughput, Consistency, and Token Efficiency

By Gennoor Tech·March 22, 2026

Share this article

Traditional software QA has always distinguished between functional testing (does it produce the right output?) and non-functional testing (does it perform well under real-world conditions?). For AI and Large Language Model (LLM) systems, this distinction becomes even more critical because the probabilistic nature of LLMs introduces entirely new dimensions of non-functional risk.

Consider this: an LLM endpoint can return a perfectly accurate, well-grounded response and still fail in production if it takes 12 seconds to deliver that response, collapses under 50 concurrent users, produces wildly inconsistent answers to the same question, or burns through tokens at 3x the budgeted rate. These are not quality failures in the traditional sense. They are performance, scalability, reliability, and cost-efficiency failures — the domain of non-functional testing.

This guide covers the four pillars of non-functional testing for AI/LLM systems: Latency, Throughput, Consistency, and Token Efficiency. For each pillar, we cover what to measure, how to measure it, which open-source tools to use, and what AWS, Google Cloud, and Microsoft Azure offer natively.

The Four Pillars of Non-Functional Testing

Non-functional testing for AI/LLM systems can be organized into four core pillars, each addressing a distinct dimension of production readiness:

PillarWhat It TestsKey MetricsWhy It Matters
LatencyResponse speed and time to first outputTTFT, ITL, E2E Latency, P50/P95/P99User trust, UX quality, SLA compliance
ThroughputCapacity under concurrent loadRPS, TPS, Concurrent users, GPU utilizationScalability, cost planning, capacity planning
ConsistencyStability of outputs across repeat queriesSemantic similarity, output variance, format compliance rateReliability, predictability, regression detection
Token EfficiencyVerbosity and cost per interactionTokens/response, cost/query, output-to-input ratioCost control, budget forecasting, optimization

Pillar 1: Latency Testing

Understanding LLM Latency

LLM latency is fundamentally different from traditional API latency. In a conventional REST API, you send a request and receive a complete response. With LLMs, response generation is incremental and token-by-token, which means latency must be decomposed into several distinct measurements.

Key Latency Metrics

  • Time to First Token (TTFT) — The time between the server receiving a request and the model beginning to stream the first output token. This is the most user-visible metric — it determines how long a user stares at a blank screen. TTFT is heavily influenced by prompt length (longer prompts require more prefill computation), model size, and cold start behavior.
  • Inter-Token Latency (ITL) — The delay between consecutive output tokens during streaming. ITL determines the perceived smoothness of a streaming response. Spiky ITL creates a stuttering experience even if the total time is acceptable.
  • End-to-End Latency (E2E) — The total wall-clock time from request arrival to the completion of the last output token. This includes queuing time, prefill, generation, and any network overhead. E2E latency is the number that matters for non-streaming use cases and batch processing.
  • Percentile Latencies (P50, P95, P99) — Average latency is a misleading metric for LLMs because response times vary enormously based on output length. Percentile measurements reveal the tail-latency story. A system with a 2-second P50 but a 15-second P99 has a serious consistency problem.

Factors Affecting LLM Latency

  • Model size and architecture — larger models require more GPU compute per token
  • Input prompt length — longer prompts increase prefill time
  • Output length — more tokens mean longer generation phase
  • Hardware — GPU type, memory bandwidth, and interconnects matter
  • Batch size and concurrent request load
  • Cold starts — models scaled to zero must reload before serving
  • Network latency — especially for cloud-hosted and multi-region deployments
  • KV-cache efficiency and prefix caching

Open-Source Tools for Latency Testing

NVIDIA GenAI-Perf is NVIDIA's benchmarking tool specifically designed for LLM inference. It measures TTFT, ITL, E2E latency, and throughput with precise token-level granularity. It integrates natively with TensorRT-LLM and vLLM serving backends and supports sweeping across concurrency levels to map the full latency-throughput curve of a deployment.

LLMPerf (by Anyscale/Ray) spawns configurable concurrent requests, measures inter-token latency and generation throughput per request, and supports multiple providers including OpenAI, Anthropic, AWS Bedrock, Vertex AI, and any OpenAI-compatible endpoint via LiteLLM. It produces detailed JSON reports with per-request and aggregate statistics.

LLM Locust (by TrueFoundry) extends the popular Locust load testing framework with LLM-specific capabilities. It adds native tracking of TTFT and tokens-per-second during streaming responses, solves the Python GIL bottleneck, and provides a customized real-time web UI showing LLM-specific metrics.

GuideLLM (by Red Hat) simulates real-world traffic patterns and provides fine-grained metrics including requests per second, latency distributions, and concurrency analysis. It supports full sweeps to find the latency-throughput saturation point and outputs results in JSON, YAML, and CSV formats.

k6 (by Grafana Labs) is extremely memory-efficient (256 MB for a standard test versus JMeter's 760 MB) and supports tens of thousands of virtual users per instance. The Periscope framework extends k6 with pre-built scripts for OpenAI-compatible endpoints, with Grafana dashboards for visualization.

llm-load-test-azure is a community-maintained tool specifically for load testing LLM endpoints on Azure. It measures TTFT, time-between-tokens, and E2E latency and supports Azure OpenAI, Azure Model Catalog serverless, and managed-compute deployments.

Cloud-Native Latency Monitoring

Microsoft Azure (Foundry) provides built-in model leaderboards that benchmark LLMs across quality, safety, cost, and performance — including time-to-first-token and generated-tokens-per-second. Production monitoring is delivered through Azure Monitor and Application Insights, providing real-time dashboards for token consumption, latency distributions, error rates, and quality scores.

AWS (Bedrock & SageMaker) provides model evaluation jobs that can assess response latency across different models. AWS offers the aws-samples/load-test-llm-with-locust repository for load testing both SageMaker endpoints and Bedrock APIs. SageMaker endpoints expose CloudWatch metrics for invocation latency, model latency, and overhead latency.

Google Cloud (Vertex AI) provides a built-in model observability dashboard that tracks query rates, token throughput, first token latencies, and error rates. GKE Inference Gateway with load-aware and content-aware routing reduced TTFT by over 35% for certain workloads and improved P95 tail latency by 2x.

Pillar 2: Throughput Testing

Understanding LLM Throughput

Throughput measures the total capacity of an LLM deployment — how many requests or tokens it can process per unit of time. Unlike latency (which is per-request), throughput is a system-level metric that determines whether a deployment can handle real-world traffic volumes.

Key Throughput Metrics

  • Requests Per Second (RPS) — The total number of complete requests the system handles per second. This is the most intuitive capacity metric and directly maps to infrastructure cost planning.
  • Tokens Per Second (TPS) — The total output tokens generated per second across all concurrent requests. TPS better reflects actual GPU utilization than RPS because a system handling 10 short responses per second uses less compute than one handling 3 long responses.
  • Concurrent Users — The number of simultaneous requests the system can serve while maintaining acceptable latency. Beyond this threshold, requests queue and latency degrades.
  • GPU/CPU Utilization — Resource consumption during peak load. This determines cost efficiency and helps identify whether a deployment is over-provisioned or under-provisioned.

Load Testing vs Performance Benchmarking

Load testing and performance benchmarking are distinct but complementary approaches. Load testing simulates real-world traffic at scale to identify infrastructure bottlenecks like server capacity, autoscaling behavior, and network latency. Performance benchmarking measures the intrinsic performance of the model itself — throughput, token-level latency, and efficiency under controlled conditions. Both are needed for production readiness.

Open-Source Tools for Throughput Testing

Locust remains one of the most popular open-source load testing frameworks due to its Python-native scripting, lightweight greenlet-based concurrency (thousands of simulated users), and real-time web UI. AWS provides official sample scripts for load testing SageMaker and Bedrock endpoints.

Apache JMeter supports 20+ protocols natively and can simulate complex multi-step workflows. While more resource-intensive than k6 (approximately 760 MB per test versus 256 MB), it is well-suited for organizations with legacy testing infrastructure. JMeter is natively supported in Azure App Testing as a managed cloud-based load testing service.

LLMServingPerfEvaluator (by FriendliAI) generates realistic workloads by simulating requests arriving according to a Poisson distribution, allowing you to stress test at varying request rates. It supports comparing different serving engines (such as vLLM versus TGI) on the same hardware.

MLPerf Inference (by MLCommons) is the industry-standard benchmark suite for AI inference performance. Version 5.1 introduced benchmarks for DeepSeek-R1, Llama 3.1 8B, and Whisper Large V3, with expanded interactive scenarios testing performance under lower latency constraints for agentic applications.

Cloud-Native Throughput Capabilities

Microsoft Azure — Foundry leaderboards include throughput benchmarks refreshed periodically. Azure Load Testing supports running JMeter and Locust tests as a managed service. Azure OpenAI PTUs (Provisioned Throughput Units) guarantee a specific tokens-per-minute capacity.

AWS — Bedrock provides Provisioned Throughput options for dedicated capacity. SageMaker endpoints expose CloudWatch metrics for invocations, invocations-per-instance, and model errors. LLMPerf integrates natively with AWS Bedrock and SageMaker via LiteLLM.

Google Cloud — Vertex AI's observability dashboard displays token throughput metrics per model and endpoint. GKE Inference Gateway provides load-aware routing that scrapes real-time metrics (KV cache utilization, queue depth) from model servers.

Pillar 3: Consistency Testing

Understanding LLM Consistency

LLMs are probabilistic systems — the same input can produce different outputs across invocations. While some variability is expected and desirable for creative tasks, production systems require a baseline level of consistency in factual accuracy, format compliance, and semantic meaning.

Key Consistency Metrics

  • Semantic Similarity — Measures whether repeated queries produce semantically equivalent responses, even if the exact wording differs. Typically calculated using embedding-based cosine similarity. A high variance indicates an unreliable model.
  • Output Variance — Quantifies the spread of responses when the same prompt is submitted multiple times. Can be measured via token overlap, ROUGE scores, or embedding distance distributions.
  • Format Compliance Rate — Percentage of responses that adhere to the requested output format (JSON, table, bullet list, specific schema). Format inconsistency is a common production failure mode, especially with smaller models.
  • Semantic Robustness — Measures how much the model's output changes when the input is subjected to minor, meaning-preserving perturbations (typos, case changes, whitespace variations). A robust model should produce equivalent outputs regardless of trivial input variations.

Open-Source Tools for Consistency Testing

DeepEval is an open-source LLM testing framework that supports regression testing across model iterations. With its companion platform Confident AI, it provides side-by-side comparison tools for catching regressions. It supports metrics for correctness, hallucination detection, toxicity, and consistency, all configurable via Python test suites that integrate into CI/CD pipelines.

Arize Phoenix is an open-source AI observability platform that excels at detecting when model outputs quietly drift over time. It monitors embedding drift — changes in vector representations that indicate semantic shifts — and provides visual plots for tracking RAG pipeline quality.

TruLens is a semantic evaluation toolkit that provides automated evaluation metrics for coherence, relevance, and groundedness. It supports LLM-as-a-judge workflows for assessing output consistency and integrates with OpenTelemetry.

Robustness Gym is a library specifically designed for stress testing NLP models across various perturbation scenarios. It systematically applies transformations to inputs (synonym substitution, character-level noise, semantic-preserving rephrasing) and measures output stability.

Promptfoo is an open-source CLI tool for evaluating and testing LLM outputs. It supports running the same prompt set across multiple models or prompt variations, with automated assertions for format compliance, factual accuracy, and semantic similarity.

Cloud-Native Consistency Capabilities

Microsoft Azure — The Azure AI Evaluation SDK provides built-in evaluators for coherence, fluency, and relevance. Foundry supports compare runs functionality for side-by-side regression detection. PyRIT framework enables systematic adversarial testing.

AWS — Bedrock's automatic model evaluation includes semantic robustness testing, where prompts are perturbed approximately 5 times each (lowercase conversion, keyboard typos, number-to-word conversion, random case changes, whitespace variations). The robustness metric is calculated as the Delta F1 / F1 ratio.

Google Cloud — Vertex AI Model Monitoring v2 provides data drift and prediction drift detection. The evaluation service supports LLM-as-a-Judge methodology for assessing output quality and consistency at scale.

Pillar 4: Token Efficiency Testing

Understanding Token Efficiency

Token efficiency measures how economically a model uses tokens to deliver useful output. In production AI systems, token usage directly translates to cost — every prompt token processed and every completion token generated incurs a charge. A model that produces the correct answer in 200 tokens is more efficient than one that produces the same answer in 800 tokens with unnecessary elaboration.

Key Token Efficiency Metrics

  • Tokens per Response — Average and P95 output token count for a given task. High variance suggests inconsistent verbosity. Tracking this over time reveals whether prompt changes or model updates affect output length.
  • Cost per Query — The total cost (input tokens + output tokens at the model's pricing) for a single interaction. This is the metric that budget owners care about most.
  • Output-to-Input Ratio — The ratio of completion tokens to prompt tokens. For RAG systems with large context windows, a high input-to-output ratio may indicate inefficient retrieval (too much context for a short answer).
  • Task Completion Efficiency — Whether the model achieves the intended goal within a token budget. A model that exceeds the max_tokens limit and truncates its response has failed this metric even if the partial output is correct.

Open-Source Tools for Token Efficiency

Langfuse (MIT license, recently acquired by ClickHouse) is the most widely-adopted open-source LLM engineering platform. It tracks token usage per trace with automatic cost calculation for 100+ model pricing configurations, providing cost breakdowns by model, user, and session.

Helicone is a proxy-based observability tool that captures model calls without any SDK changes — you simply change the base URL. It automatically generates cost reports, tracks token consumption trends, and supports rate limiting and caching.

OpenLLMetry (by Traceloop) is an OpenTelemetry-compliant SDK whose semantic conventions for LLM telemetry were adopted by the OpenTelemetry project itself. It tracks token usage, cost, and latency per call and exports data to any OTLP-compatible backend.

Opik (by Comet) is an open-source platform that completes trace logging and evaluation approximately 7-14x faster than comparable tools. It provides token usage tracking, cost estimation, and integrates with both LLM application workflows and model training pipelines.

Cloud-Native Token Efficiency Monitoring

Microsoft Azure — Foundry's observability dashboard tracks token consumption, latency, error rates, and quality scores in real-time. Azure API Management integration provides detailed per-request token logging and cost estimation.

AWS — Bedrock provides CloudWatch metrics for input and output token counts per invocation. The fmeval library supports programmatic evaluation of model efficiency.

Google Cloud — Vertex AI tracks token throughput (input and output) per endpoint. Context caching for Gemini models and Memorystore integration provide architectural approaches to reducing token consumption.

Comprehensive Tool Comparison

ToolLatencyThroughputConsistencyToken Eff.Best For
GenAI-PerfYesYesGPU inference benchmarking (NVIDIA stack)
LLMPerfYesYesYesMulti-provider API benchmarking
LLM LocustYesYesRealistic load testing with streaming
GuideLLMYesYesPre-deployment capacity planning
k6 + PeriscopeYesYesYesCI/CD integrated load testing
DeepEvalYesLLM regression testing in CI/CD
Arize PhoenixYesYesYesDrift detection, RAG observability
LangfuseYesYesYesFull-stack LLM observability (self-hosted)
HeliconeYesYesZero-code cost tracking (proxy)
PromptfooYesYesPrompt evaluation and comparison
MLPerf InferenceYesYesIndustry-standard hardware benchmarks

Cloud Platform Comparison

CapabilityMicrosoft AzureAWSGoogle Cloud
Latency MonitoringApplication Insights + Foundry dashboardCloudWatch metrics, Bedrock invocation latencyModel Observability dashboard, Metrics Explorer
Throughput BenchmarksFoundry Leaderboard, Azure Load TestingBedrock Provisioned Throughput, SageMaker scalingVertex AI dashboard, GKE Inference Gateway
Consistency TestingAI Evaluation SDK, Compare Runs, PyRITBedrock Semantic Robustness (auto-perturbation)Model Monitoring v2 (drift detection)
Token/Cost TrackingFoundry Observability, APIM integrationCloudWatch token metrics, Bedrock pricing APINative cost estimation, context caching
Load Testing ServiceAzure App Testing (JMeter + Locust managed)No native managed serviceNo native managed service

The LLM Observability Ecosystem

Non-functional testing does not end at pre-deployment. Production LLM systems require continuous observability to catch latency regressions, throughput degradation, consistency drift, and cost anomalies.

Open-Source Observability Platforms

  • Langfuse (MIT License) — The most widely-used open LLMOps platform. Provides tracing, prompt management, evaluation, cost tracking, and session replay. Self-hostable with no feature restrictions.
  • Arize Phoenix (Elastic License 2.0) — Strong on drift detection and RAG pipeline monitoring. Captures multi-step agent traces and provides structured evaluation workflows.
  • Opik (Apache 2.0) — Fastest open-source tracing tool (7-14x faster than alternatives). Bridges LLM application observability and model training workflows.
  • OpenLLMetry/Traceloop — OpenTelemetry-native. Semantic conventions adopted by the OTel project. Works with LangChain, LlamaIndex, Haystack, and native SDKs.
  • Helicone — Proxy-based (zero-code). Cost reports, rate limiting, and caching without SDK changes.

Enterprise/Commercial Platforms

  • LangSmith — Deeply integrated with LangChain/LangGraph ecosystem. Best for teams already using LangChain.
  • Datadog LLM Observability — Enterprise-grade. Integrates with Vertex AI, Azure AI Foundry, and Bedrock. Full APM correlation.
  • Elastic — Integrations with both Azure AI Foundry and Vertex AI for production LLM observability.
  • Dynatrace — Automatic topology discovery and AI-powered anomaly detection for LLM workloads.
  • SigNoz — OpenTelemetry-native unified platform covering both LLM and traditional application observability.

Choosing an Observability Stack

A practical starting pattern: combine a gateway/proxy tool (Helicone or Portkey) for cost tracking with an evaluation tool (Phoenix or DeepEval) for quality metrics, and an OTel-based platform (SigNoz, Langfuse, or your existing APM) for unified observability. Start with cost and latency tracking — they are immediately actionable — then layer in quality metrics as the deployment matures.

Recommended Implementation Strategy

Phase 1: Pre-Deployment Benchmarking

  • Select 2-3 candidate models and run latency + throughput benchmarks using GenAI-Perf, LLMPerf, or GuideLLM
  • Measure TTFT, ITL, E2E latency, and TPS across a range of concurrency levels (1, 10, 50, 100 concurrent requests)
  • Establish baseline token usage per query type using a representative evaluation dataset
  • Compare costs across models using actual token metrics (not published benchmarks alone)
  • Run consistency tests: submit the same 50-100 prompts 3-5 times each and measure semantic similarity variance

Phase 2: Load Testing

  • Use Locust, k6, or LLM Locust to simulate expected production traffic patterns
  • Test at 1x, 2x, and 5x expected peak traffic to identify the saturation point
  • Validate auto-scaling behavior: measure how quickly new instances come online and whether cold starts cause latency spikes
  • Test cache hit rates with realistic query repetition patterns (typically 20-40% cache hits in production)
  • Define SLOs: P95 latency threshold, minimum RPS, maximum error rate, and maximum cost per 1,000 queries

Phase 3: Production Observability

  • Deploy Langfuse, Helicone, or your cloud's native monitoring for continuous cost and latency tracking
  • Set up alerting on P95 latency breaches, error rate spikes, and token consumption anomalies
  • Implement drift detection (Arize Phoenix or Vertex AI Model Monitoring) to catch silent quality degradation
  • Establish a weekly review cadence for cost-per-query trends and throughput utilization

Phase 4: Continuous Regression Testing

  • Integrate DeepEval or Promptfoo into CI/CD pipelines to catch consistency regressions on every prompt change
  • Maintain a golden dataset of 100+ queries with expected outputs for regression testing
  • Re-run full non-functional test suite before any model upgrade, prompt change, or infrastructure modification
  • Document and version all SLO thresholds alongside application code

Conclusion

Non-functional testing for AI/LLM systems is not optional — it is a production gate. A model that is accurate but slow, consistent but expensive, or fast but unreliable will fail in production just as surely as one that hallucinates.

The ecosystem of tools has matured significantly. Open-source tools like GenAI-Perf, LLMPerf, LLM Locust, DeepEval, Langfuse, and Arize Phoenix provide production-grade capabilities at zero licensing cost. All three major cloud platforms now offer native monitoring, evaluation, and observability features.

The key insight: non-functional testing for LLMs is a continuous process, not a one-time gate. LLMs change behavior with temperature settings, context changes, and even provider-side model updates. Continuous monitoring and automated regression testing are essential to maintaining production quality.

Organizations building production AI systems should establish clear SLOs for each non-functional pillar, instrument their deployments from day one, and integrate non-functional testing into their CI/CD pipelines alongside functional quality evaluations.

Non-Functional TestingLLM TestingLatencyThroughputAI ObservabilityMLOpsPerformance Testing
#LLMTesting#MLOps#AIPerformance#NonFunctionalTesting#AIObservability#GenAI
JK

Jalal Ahmed Khan

Microsoft Certified Trainer (MCT) · Founder, Gennoor Tech

14+ years in enterprise AI and cloud technologies. Delivered AI transformation programs for Fortune 500 companies across 6 countries including Boeing, Aramco, HDFC Bank, and Siemens. Holds 16 active Microsoft certifications including Azure AI Engineer and Power BI Analyst.

Found this insightful? Share with your network.

Stay ahead of the curve

Practitioner insights on enterprise AI delivered to your inbox. No spam, just signal.

AI Career Coach