AI Infrastructure

Optimizing LLM Inference: From 500ms to 50ms

H1Cloud TeamDecember 15, 202515 min read

The Latency Problem

Large language models are powerful but slow. A 70B parameter model running on a single A100 GPU generates tokens at roughly 30-40 tokens per second — that is 25-33ms per token. For a 200-token response, the total generation time is 5-7 seconds. Add network overhead, preprocessing, and queueing time, and users experience 8-10 seconds of latency. For interactive applications, this is unacceptable.

Over the past year, we have optimized LLM inference pipelines for dozens of clients, reducing end-to-end latency by 10x or more. This post covers the techniques that consistently deliver the biggest improvements, ordered by implementation complexity and impact.

Quantization: The Biggest Win

Quantization reduces model weights from 16-bit floating point to lower precision — 8-bit, 4-bit, or even 2-bit. The impact is dramatic: a 70B model in FP16 requires 140 GB of GPU memory and two A100 GPUs. The same model in 4-bit (AWQ or GPTQ) requires 35 GB and fits on a single GPU, immediately halving inference cost and eliminating inter-GPU communication overhead.

Modern quantization methods preserve quality remarkably well:

# AWQ quantization with vLLM
from vllm import LLM, SamplingParams

# Load 4-bit AWQ quantized model
llm = LLM(
    model="TheBloke/Llama-2-70B-AWQ",
    quantization="awq",
    tensor_parallel_size=1,  # single GPU!
    gpu_memory_utilization=0.90,
    max_model_len=4096,
)

params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing:"], params)

In our benchmarks, AWQ 4-bit quantization on Llama 2 70B reduces perplexity by less than 0.5% compared to FP16 while increasing throughput by 3-4x. For most production use cases, this quality tradeoff is imperceptible.

Continuous Batching with vLLM

Traditional serving frameworks process requests one at a time or in fixed-size batches. vLLM introduced continuous batching (also called iteration-level batching), which dynamically adds new requests to the batch at every decoding step. This eliminates the waste of waiting for the longest sequence in a batch to finish before starting new requests.

The impact is substantial: continuous batching improves throughput by 2-5x compared to static batching, especially under variable-length workloads. Combined with PagedAttention — vLLM's memory management system that handles KV cache fragmentation — you can serve 3-4x more concurrent users per GPU.

# vLLM server with continuous batching (default)
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 128 \
  --gpu-memory-utilization 0.92 \
  --port 8000

Speculative Decoding

Speculative decoding is a technique where a small, fast "draft" model generates candidate tokens, and the large "target" model verifies them in parallel. Because verification is much faster than generation (it is a single forward pass for multiple tokens), this can accelerate generation by 2-3x without any quality loss.

The key insight is that for many tokens, the draft model's prediction matches the target model's output. If the draft model is correct 70% of the time and generates 5 candidate tokens per step, you effectively generate 3.5 tokens per forward pass of the target model instead of 1.

This technique works best when the draft model is well-aligned with the target model. We typically use a 1-2B parameter model from the same family (e.g., Llama 2 7B as the draft for Llama 2 70B) and fine-tune it on representative prompts to maximize acceptance rate.

KV Cache Optimization

The key-value cache stores attention states from previous tokens, avoiding recomputation during autoregressive generation. For long contexts, the KV cache can consume more GPU memory than the model weights themselves. A 70B model with 4096 token context in FP16 uses roughly 40 GB just for the KV cache.

Optimization strategies:

  • KV cache quantization: Quantize the KV cache to FP8 or INT8 during inference. This halves KV cache memory with negligible quality impact, allowing longer contexts or more concurrent requests.
  • Grouped Query Attention (GQA): Models using GQA (like Llama 2 70B) share KV heads across query heads, reducing KV cache size by 4-8x compared to multi-head attention.
  • Prefix caching: For applications with common system prompts, cache the KV state of the system prompt and reuse it across requests. This eliminates redundant computation for the shared prefix, reducing time-to-first-token by 50-80% for long system prompts.
  • PagedAttention: vLLM's PagedAttention manages KV cache like virtual memory with paging, eliminating fragmentation and waste. This alone can increase the number of concurrent sequences per GPU by 2-4x.

Infrastructure-Level Optimizations

Beyond model-level techniques, the serving infrastructure itself offers optimization opportunities:

  • Request routing: Route requests to the GPU with the most available KV cache capacity, not just the lowest load. This maximizes batch efficiency and reduces queueing.
  • Model sharding: For models that require tensor parallelism, use NVLink-connected GPUs within a single node. Cross-node tensor parallelism over InfiniBand adds 100-500us per layer, which compounds to significant latency at 80+ layers.
  • Streaming responses: Use server-sent events (SSE) to stream tokens to the client as they are generated. While this does not reduce total generation time, it dramatically improves perceived latency — users see the first token within 100-200ms instead of waiting 5+ seconds for the complete response.

Putting It All Together

Combining all these techniques, we achieved the following results for a Llama 2 70B deployment on a single H100 GPU:

# Before optimization (FP16, naive serving)
Time to first token:  480ms
Token generation:     28 tokens/sec
Concurrent users:     4
Cost per 1M tokens:   $2.80

# After optimization (AWQ 4-bit, vLLM, speculative decoding, KV cache optimization)
Time to first token:  45ms
Token generation:     120 tokens/sec
Concurrent users:     32
Cost per 1M tokens:   $0.35

That is a 10x reduction in latency, 4x increase in throughput, 8x more concurrent users, and 8x cost reduction — all on the same hardware. The cumulative effect of these optimizations is multiplicative, not additive. Each technique removes a different bottleneck, and together they unlock the full potential of modern GPU hardware for LLM inference.

Want help implementing these practices?

Let H1Cloud Handle Your Infrastructure

Ready to get started?

Let's Build Your AI-Powered Future

Talk to our team about your infrastructure needs. Custom solutions, transparent pricing, and 24/7 expert support.