LLMOps  ·  Legal Tech  ·  Case Study

How a Legal Tech Firm Cut LLM Costs by 52% While Improving Answer Accuracy to 94%

Spiralling GPT-4 costs, unacceptable latency, and zero visibility into hallucinations. We rebuilt their LLM stack - and the numbers transformed.

52%
Token Cost Reduction
68%
Latency Reduction
94%
Answer Accuracy (RAG)
100%
Hallucination Tracked

Costs Spiralling. Latency Killing UX. Zero Visibility.

A legal technology firm had built an internal document Q&A tool on GPT-4. It worked - but not sustainably. Monthly token costs had climbed past $40,000 with no ceiling in sight.

Response latency was 8–12 seconds per query. Legal professionals were abandoning the tool mid-session. The engineering team had no visibility into hallucination rates, answer quality, or which queries were draining the budget.

Worse - the legal team was losing confidence. A few high-profile hallucinated answers had landed in client documents. The tool was becoming a liability, not an asset.

The core problem wasn't the LLM itself - it was the absence of an operational layer. No observability, no cost controls, no retrieval optimisation, and no feedback loop for improving accuracy.
$40K+
Monthly LLM Costs Before
With no ceiling and no visibility into what was driving spend
8–12s
Query Latency
0%
Hallucination Visibility
GPT-4
Every Single Query
Falling
Team Confidence

Optimised RAG Pipeline

Redesigned chunking strategy and semantic re-ranking replaced naive similarity search. Precision retrieval meant GPT-4 got better context and answered more accurately.

Fine-tuned Smaller Model

Routine queries (80% of volume) now routed to a QLoRA fine-tuned model. GPT-4 / Azure OpenAI reserved for complex multi-document reasoning only.

Full LLM Observability

LangSmith deployed for end-to-end tracing: every query, cost, latency, and answer quality score. Hallucination detection on every response.

vLLM Self-Hosted Inference

Cost-sensitive query types now served via vLLM on dedicated instances. Response caching via LangChain reduced repeated query costs by a further 30%.

A Full LLMOps Stack - Not Just a Prompt Tweak

The problem wasn't the model. It was the absence of an intelligent operational layer around it. We built one from the ground up.

  • Redesigned RAG pipeline with optimised chunking and semantic re-ranking
  • Fine-tuned smaller model (QLoRA) for 80% of routine query volume
  • Pinecone vector database with metadata filtering for precise retrieval
  • LangSmith for full observability: traces, costs, latency, hallucination tracking
  • vLLM self-hosted inference for cost-sensitive query types
  • Azure OpenAI reserved for complex multi-document reasoning
  • LangChain orchestration with intelligent caching layer
LangChain Pinecone vLLM LangSmith Azure OpenAI QLoRA

The Numbers Speak for Themselves

Six months after deployment, the results were unambiguous. The legal team went from questioning the tool's value to making it a daily workflow dependency.

$40K → $19K/month - token costs cut by 52% with no reduction in capability. The fine-tuned model routes 80% of queries at a fraction of GPT-4 cost while delivering better accuracy for routine legal document Q&A.

Token Costs: $40K → $19K/month

52% reduction through model routing, fine-tuning, and intelligent caching.

Latency: 8–12s → 2.6–4s per query

68% latency reduction. Legal professionals can now get answers in real time.

Answer Accuracy: climbed to 94%

Better retrieval + fine-tuned model = dramatically more accurate RAG answers.

100% Hallucination Visibility

Every LLM interaction now fully traced. Hallucinations flagged automatically.

Daily Active Usage: 3x increase

Legal team confidence restored. Tool went from liability to essential workflow.

Certified Expertise You Can Trust

Our team holds industry-recognised certifications. You get proven experts - not consultants learning on your project.

Microsoft AI Cloud Partner Program  |  Partner ID: 7099667
AWS Partner Network Member
100+
Projects Delivered
15+
Industries Served
8
Service Domains

Are Your LLM Costs Under Control?

If you're spending more than you expected on LLM infrastructure - or you have no visibility into what's actually happening - we should talk.

Book a Free Consultation

No sales pitch. First call is a free diagnostic of your LLM stack.

Book a Free Consultation →