Load Testing Open-Source LLMs on H100

I am working as an LLMOps engineer since a few years, currently on a mission for the French government.
Our team is building an API gateway to provide French administrations with access to open-source large language models (LLMs).
I recently focused on load testing newer models to evaluate if they could become candidates for production.

This post shares results and methodology for text models.
(Next week I’ll publish the same analysis for audio models.)

Benchmark Setup

All benchmarks were run with the following setup:

Hardware: 1× NVIDIA H100 (80 GB VRAM), 24 CPU cores, 230 GB RAM
Framework: vLLM 0.10.2
Environment: Kubernetes cluster (cloud deployment)
Dataset: prompts sampled randomly from Alpaca dataset
Prefix caching: disabled (to avoid biased results)
Protocol:
- For each concurrency level, run 3 iterations
- Results averaged to reduce variability from prompt length and output size

Throughput vs. Latency

📊 [placeholder: graph TTFT comparison across models]
📊 [placeholder: graph throughput comparison across models]

A Note on “Concurrent Requests” vs. “Users”

In practice, the number of concurrent requests is not equal to the number of simultaneous users.
Users do not all send queries at the exact same time.

Example:

100 users
Each sends a request every ~40 s
Average request takes 20 s to process

On average, each user is “active” (waiting for a response) 20/40 = 50% of the time.
So 100 users ≈ 50 concurrent requests.

This estimation is rough and depends on the type of usage:

Human interacting with a chatbot → lower concurrency
Automated classification pipeline → higher concurrency

Key Takeaways

Benchmarks were run on H100 (80 GB VRAM) with vLLM 0.10.2 in a Kubernetes environment
TTFT and throughput vary significantly depending on the model, even on identical hardware
Concurrency levels in benchmarks are not directly equal to number of end-users
Results provide a baseline for evaluating candidate models for production usage

Next Steps

I will share benchmarks for audio models next week.

In the meantime:

Have you measured similar metrics?
Do your results align, or differ, on other hardware / deployment setups?

👉 If you are working on LLM deployments and want to exchange notes on benchmarking methodology or results, feel free to reach out.

Load Testing Open-Source LLMs on H100#

Benchmark Setup#

Throughput vs. Latency#

A Note on “Concurrent Requests” vs. “Users”#

Key Takeaways#

Next Steps#