Building Low-Latency ML Inference Pipelines: Real-Time Serving for AI Agencies
A fintech client came to a three-person AI agency in Chicago with a problem that sounds simple on paper: they needed their fraud detection model to return a decision in under 50 milliseconds. Their existing system ran batch predictions every hour, which meant fraudulent transactions could process for up to 59 minutes before being flagged. That gap was costing them $340,000 per month in chargebacks.
The agency had built the model. It was a solid gradient-boosted tree with 94% precision. But the model sat in a Jupyter notebook, running on a single EC2 instance, processing CSVs. Getting it to respond in 50ms under load โ with 99.9% uptime and the ability to handle 2,000 requests per second during peak hours โ was a completely different engineering challenge.
Three months later, the agency had built a real-time inference pipeline that averaged 12ms response times at p99 and handled 5,000 requests per second with room to spare. The client's monthly chargeback losses dropped to $41,000. The agency parlayed that success into a $600,000 annual platform contract.
Real-time ML serving is where agency work transforms from "we built you a model" to "we built you a system." And systems are where the real money lives.
Why Real-Time Serving Is the Next Agency Capability Gap
Here is the market reality: most AI agencies can train models. A growing number can deploy them as batch pipelines. But very few can deliver low-latency, high-throughput inference systems that run reliably in production.
This matters because the most valuable AI use cases require real-time predictions:
- Fraud detection needs sub-100ms decisions before transactions complete
- Recommendation engines need to personalize in the time between a page request and page render
- Dynamic pricing needs to calculate optimal prices before the customer sees the product
- Content moderation needs to evaluate user submissions before they go live
- Autonomous systems need predictions in single-digit milliseconds
If your agency can only deliver batch predictions, you are locked out of these high-value engagements. The agencies that master real-time serving will capture the most lucrative contracts in the market.
The Anatomy of a Real-Time Inference Pipeline
A production inference pipeline has five distinct layers, each with its own latency budget and optimization strategies.
Layer 1: Request Handling (Target: 1-3ms)
The front door of your inference system. This layer receives prediction requests, validates inputs, and routes them to the appropriate model.
Key components:
- API Gateway: Handles authentication, rate limiting, and request routing. AWS API Gateway, Kong, or a simple nginx reverse proxy all work. The choice depends on the client's existing infrastructure.
- Request validation: Check that all required features are present and within expected ranges. Reject malformed requests early โ do not waste compute on garbage input.
- Request serialization: Convert the incoming payload (usually JSON) into the format your model expects. This seems trivial but can add 5-10ms if done carelessly with large payloads.
Optimization tactics:
- Use protocol buffers or FlatBuffers instead of JSON for internal communication. JSON parsing is surprisingly expensive at scale.
- Keep validation logic simple and fast. Complex validation should happen asynchronously, not in the critical path.
- Maintain persistent connections between services. Connection setup overhead adds up at high request rates.
Layer 2: Feature Retrieval (Target: 2-10ms)
This is where most latency hides. Your model needs features to make predictions, and those features need to come from somewhere at prediction time.
The three types of features by retrieval pattern:
Request features come directly from the prediction request itself. User ID, transaction amount, item being viewed โ whatever the client application sends. These cost zero additional latency.
Precomputed features are calculated in batch and stored for fast retrieval. A user's average purchase amount over the last 90 days, a product's return rate, a merchant's fraud history. These live in a feature store and require a key-value lookup at prediction time.
Real-time features are computed on the fly from recent events. The number of transactions in the last 5 minutes, the current session's click count, the time since last login. These require a streaming computation layer.
The feature store is your critical infrastructure here. For online serving, you need sub-millisecond lookups by entity key. Redis, DynamoDB, or a purpose-built feature store like Feast or Tecton handles this. The key architectural decisions:
- Cache hot features in memory. Features for recently active users should be in a Redis cache, not fetched from a database on every request.
- Batch feature updates on a schedule. Precomputed features should be refreshed every hour or daily, not computed on demand.
- Use feature vectors, not individual features. Fetch all features for an entity in a single lookup, not one lookup per feature. This is the difference between 1ms and 50ms.
- Handle missing features gracefully. New users will not have historical features. Your pipeline needs default values or a fallback model for cold-start scenarios.
Layer 3: Model Inference (Target: 1-20ms)
The actual prediction computation. This is what most people think of when they hear "model serving," but it is often not the latency bottleneck.
Serving frameworks:
- TensorFlow Serving or TorchServe for neural networks. Both handle model versioning, batching, and GPU utilization out of the box.
- ONNX Runtime for cross-framework deployment. Convert your model to ONNX format and serve it with a single runtime regardless of training framework.
- Triton Inference Server for multi-model, multi-framework serving. Supports dynamic batching and can serve TensorFlow, PyTorch, and ONNX models simultaneously.
- Custom serving with FastAPI or similar for simple models (scikit-learn, XGBoost). Sometimes the simplest approach is a Python web server with your model loaded in memory.
Optimization tactics for the inference layer:
- Model quantization. Convert 32-bit floating point weights to 16-bit or 8-bit integers. This typically reduces model size by 2-4x and inference time by 30-50% with minimal accuracy loss. For tree-based models, the equivalent is reducing tree depth or number of trees.
- Dynamic batching. Collect multiple requests over a short window (1-5ms) and process them as a single batch. GPUs are much more efficient at batch inference than one-at-a-time processing. Triton and TensorFlow Serving support this natively.
- Model compilation. Tools like TensorRT (for NVIDIA GPUs), Apache TVM, or ONNX Runtime's graph optimization can compile your model into optimized machine code for specific hardware. This alone can provide 2-5x speedups.
- Feature preprocessing in the model graph. If your model requires normalization, encoding, or other preprocessing, include it in the model artifact itself rather than doing it in Python before inference. This eliminates serialization overhead between preprocessing and prediction.
Layer 4: Post-Processing (Target: 1-3ms)
Raw model outputs rarely go directly to the client application. They need to be transformed into business-meaningful responses.
Common post-processing steps:
- Converting raw probabilities to binary decisions using threshold logic
- Mapping class indices to human-readable labels
- Applying business rules (e.g., "never block transactions under $10 regardless of fraud score")
- Formatting confidence scores for downstream consumption
- Logging predictions for monitoring and retraining
Optimization tactics:
- Keep business logic simple and in-memory. Do not make database calls in the post-processing path.
- Use lookup tables instead of conditional logic where possible. A dictionary lookup is faster than a chain of if-else statements.
- Defer non-critical processing. Logging, analytics, and monitoring can happen asynchronously after the response is sent.
Layer 5: Response Delivery (Target: 1-2ms)
Serialize the response and send it back. Straightforward, but a few things matter:
- Minimize response payload size. Send only what the client needs. Do not include feature values, intermediate computations, or debug information in production responses.
- Use compression for large responses. If your response includes embeddings or large arrays, gzip compression reduces network transfer time.
- Connection pooling and keep-alive. Reuse connections to avoid TCP handshake overhead on every request.
Latency Budget Planning
Before you write a single line of code, create a latency budget. This is a table that allocates your total latency target across each layer.
Example for a 50ms target:
- Request handling: 2ms
- Feature retrieval: 15ms
- Model inference: 20ms
- Post-processing: 3ms
- Response delivery: 2ms
- Buffer: 8ms
The buffer is essential. It accounts for garbage collection pauses, network jitter, and the general entropy of distributed systems. Without it, you will hit your latency target at p50 but blow past it at p99.
Measure at p99, not p50. The average latency means nothing if 1% of your requests take 500ms. Clients experience tail latency, not average latency. Design and optimize for p99.
Scaling Strategies
Real-time inference systems need to handle variable load โ peak hours, marketing campaigns, seasonal spikes. Here is how to scale without over-provisioning.
Horizontal scaling with load balancing. Run multiple instances of your inference service behind a load balancer. Each instance holds a copy of the model in memory. This is the simplest and most effective scaling strategy for most workloads.
Autoscaling based on request queue depth. Scale up when the number of pending requests exceeds a threshold. Scale down when the queue is empty. Use request queue depth, not CPU utilization, as your scaling metric โ CPU can be low while requests are waiting for I/O.
GPU sharing for neural network workloads. A single GPU can serve multiple models or handle multiple request streams. Use Triton's model-level scheduling or Kubernetes GPU time-slicing to maximize GPU utilization.
Regional deployment for global clients. If your client serves users worldwide, deploy inference endpoints in multiple regions. A request from Tokyo should not route to a server in Virginia โ that adds 150ms of network latency before your pipeline even starts.
Request coalescing for duplicate predictions. If multiple requests ask for the same prediction within a short window (common in recommendation systems), cache the result and serve it to all requesters. A simple in-memory cache with a 1-second TTL can reduce inference load by 20-40%.
Monitoring and Observability
Real-time systems need real-time monitoring. Here is what to track:
Latency metrics (per layer):
- p50, p95, p99 latency for each pipeline layer
- End-to-end latency distribution
- Feature retrieval latency by feature source
- Model inference latency by model version
Throughput metrics:
- Requests per second (current vs. capacity)
- Concurrent request count
- Queue depth and wait time
- Rejection rate (requests dropped due to overload)
Model performance metrics:
- Prediction distribution (is the model suddenly predicting all one class?)
- Feature distribution (are input features drifting from training distributions?)
- Confidence score distribution (are predictions becoming less confident?)
- Error rate (how often does the pipeline fail to return a prediction?)
Infrastructure metrics:
- CPU/GPU utilization per inference instance
- Memory usage and garbage collection frequency
- Network I/O between services
- Feature store cache hit rate
Set alerts on these thresholds:
- p99 latency exceeds 2x your target
- Error rate exceeds 0.1%
- Prediction distribution shifts more than 2 standard deviations from baseline
- Feature store cache hit rate drops below 80%
Delivery Timeline and Pricing
Real-time inference pipelines are complex engineering projects. Price and timeline accordingly.
Typical delivery phases:
- Week 1-2: Architecture design, latency budget, infrastructure provisioning
- Week 3-4: Feature store setup and feature pipeline implementation
- Week 5-6: Model optimization (quantization, compilation) and serving infrastructure
- Week 7-8: Integration testing, load testing, failover testing
- Week 9-10: Production deployment, monitoring setup, runbook creation
Pricing guidance:
- Simple pipeline (single model, <100 rps, >200ms target): $40,000 - $80,000
- Standard pipeline (1-3 models, <1000 rps, <100ms target): $80,000 - $200,000
- High-performance pipeline (ensemble, >1000 rps, <50ms target): $200,000 - $500,000
Ongoing operations typically run $5,000 - $15,000 per month depending on scale and SLA requirements.
Common Failure Modes and Mitigations
Cold start latency. The first request after a deployment or scale-up event is always slower because the model needs to load into memory and JIT compilation needs to run. Mitigation: implement health checks that include a warm-up prediction before the instance receives traffic.
Feature store timeouts. If the feature store is slow or unavailable, your entire pipeline stalls. Mitigation: implement circuit breakers with fallback to cached features or default values. A prediction with slightly stale features is better than no prediction.
Memory leaks in long-running inference processes. Python-based inference servers are notorious for gradual memory growth. Mitigation: implement periodic worker recycling (restart workers every N hours) and monitor memory usage per process.
Model loading failures during deployment. A new model version might fail to load due to incompatible dependencies or corrupted artifacts. Mitigation: implement blue-green deployments where the new version is fully loaded and validated before receiving traffic.
Thundering herd on cache expiration. When a cached feature expires, all concurrent requests try to recompute it simultaneously. Mitigation: implement staggered cache expiration with jitter, or use a cache-aside pattern with a single-flight mechanism.
Your Next Step
Audit your current client deployments. For each one, ask: "Would this client get more value if predictions were available in real-time instead of batch?" If the answer is yes โ and it usually is for fraud detection, recommendations, pricing, and personalization use cases โ draft a proposal for a real-time serving upgrade. Use the latency budget framework from this post to scope the work and the pricing guidance to quote it. The upgrade from batch to real-time is one of the highest-value upsells in the AI agency business.