Spiralling GPT-4 costs, unacceptable latency, and zero visibility into hallucinations. We rebuilt their LLM stack - and the numbers transformed.
A legal technology firm had built an internal document Q&A tool on GPT-4. It worked - but not sustainably. Monthly token costs had climbed past $40,000 with no ceiling in sight.
Response latency was 8–12 seconds per query. Legal professionals were abandoning the tool mid-session. The engineering team had no visibility into hallucination rates, answer quality, or which queries were draining the budget.
Worse - the legal team was losing confidence. A few high-profile hallucinated answers had landed in client documents. The tool was becoming a liability, not an asset.
Redesigned chunking strategy and semantic re-ranking replaced naive similarity search. Precision retrieval meant GPT-4 got better context and answered more accurately.
Routine queries (80% of volume) now routed to a QLoRA fine-tuned model. GPT-4 / Azure OpenAI reserved for complex multi-document reasoning only.
LangSmith deployed for end-to-end tracing: every query, cost, latency, and answer quality score. Hallucination detection on every response.
Cost-sensitive query types now served via vLLM on dedicated instances. Response caching via LangChain reduced repeated query costs by a further 30%.
The problem wasn't the model. It was the absence of an intelligent operational layer around it. We built one from the ground up.
Six months after deployment, the results were unambiguous. The legal team went from questioning the tool's value to making it a daily workflow dependency.
52% reduction through model routing, fine-tuning, and intelligent caching.
68% latency reduction. Legal professionals can now get answers in real time.
Better retrieval + fine-tuned model = dramatically more accurate RAG answers.
Every LLM interaction now fully traced. Hallucinations flagged automatically.
Legal team confidence restored. Tool went from liability to essential workflow.
Our team holds industry-recognised certifications. You get proven experts - not consultants learning on your project.
If you're spending more than you expected on LLM infrastructure - or you have no visibility into what's actually happening - we should talk.
Book a Free ConsultationNo sales pitch. First call is a free diagnostic of your LLM stack.