for now:
The right next step is to turn your current metadata lookup system into a routed retrieval system with offline document preparation and grounded answer generation.
Your current design already proves one important thing: the corpus is structured enough to answer exact questions like “What is the Gazette for date X?” The part that is missing is not “more RAG.” It is a second layer that can read inside the matched document, select the right sections, and synthesize them without losing legal structure. Public-sector chatbot guidance points in the same direction: these systems work best when they are narrow, task-oriented, use plain language, keep answers short, stay current, and provide a clear path to human follow-up when they cannot fully resolve the question. (National Center for State Courts)
The architecture I would use
I would split the system into two lanes.
Lane 1: deterministic lookup
Use this for questions such as:
- “What is the Gazette for date X?”
- “Which ordinance was published on day Y?”
- “What is the latest version of rule Z?”
This lane should mostly bypass the LLM. It should hit PostgreSQL or your structured store directly, return the exact record, and optionally let the LLM rewrite it into plain language. This keeps latency low and avoids hallucinations.
Lane 2: grounded synthesis
Use this for questions such as:
- “Summarize the ordinance from Gazette X.”
- “How do I file a complaint?”
- “When did street parking legislation first emerge?”
This lane should do five things in order:
- identify the document set with metadata
- retrieve the relevant text sections
- expand to surrounding legal structure
- synthesize from those sections only
- answer with citations
This is where tool calling helps. OpenAI’s function-calling guide is explicit that models can be connected to external data and actions through structured tools, which is exactly what you need for routing between SQL lookup, document retrieval, summary generation, and procedural extraction. (OpenAI Developers)
Orchestration for 50+ concurrent users
For your scale, I would not overbuild. A good production baseline is:
- FastAPI for the HTTP layer
- PostgreSQL for metadata and legal structure
- Qdrant for semantic and hybrid retrieval
- Redis + Celery for background jobs
- object storage for PDFs and derived artifacts
- API-based LLMs first, not self-hosting first
FastAPI’s own deployment docs say that when deploying you typically want replication to take advantage of multiple cores and handle more requests. They also note that in Kubernetes you will usually run a single Uvicorn process per container and scale by replicas instead of cramming many workers into one container. Celery’s docs describe the exact queue model you need: a task queue with workers consuming jobs from a broker, supporting both real-time processing and scheduling. (FastAPI)
The key is not “how many workers?” The key is queue separation.
I would create three execution classes:
A. Interactive queue
For:
- exact lookup
- short procedural answers
- answer generation over already indexed chunks
Target latency here should feel like chat.
B. Heavy synthesis queue
For:
- full-gazette summaries
- multi-document comparisons
- historical timeline reconstruction
These jobs are slower and should not block chat.
C. Ingestion queue
For:
- PDF parsing
- OCR fallback
- embeddings
- summary generation
- document reindexing
That separation is what protects the user experience when several users ask expensive questions at the same time.
Context processing for large gazettes and laws
This is your second major problem, and the answer is hierarchical summarization plus parent-child retrieval.
OpenAI’s long-document summarization cookbook shows the correct pattern: split a large document into manageable pieces, summarize the pieces, then combine them into a higher-level summary with controllable detail. That is much more reliable than stuffing one huge document into a single prompt. (OpenAI Developers)
For your corpus, I would precompute four levels:
1. Section summary
For each article, clause, or ordinance block:
- plain-language summary
- legal summary
- key obligations
- dates
- penalties
- responsible office
2. Document summary
For each ordinance or law:
- what it does
- what changed
- who is affected
- current status
- relationship to prior rules
3. Gazette-day summary
For each gazette date:
- all ordinances
- major changes
- repeals
- amendments
- citizen-facing impact
4. Topic timeline summary
For topics like parking, complaints, permits, zoning:
- first appearance
- major amendments
- latest controlling rule
- related procedures
That gives you a very efficient runtime pattern:
- retrieve the record with metadata
- load the precomputed summary
- verify against the source sections
- answer with evidence
This is much faster and cheaper than re-reading the whole PDF every time.
Document parsing is not optional
Your current failure mode likely starts before retrieval.
If the parser breaks reading order, merges columns, loses tables, or cuts article boundaries badly, the best model in the world will still answer poorly. That is why I would treat parsing as a first-class subsystem.
Docling’s docs say it supports advanced PDF understanding, including page layout, reading order, table structure, formulas, and a unified document representation. PaddleOCR-VL-1.5’s current model card says it is built for document understanding and reaches state-of-the-art accuracy on OmniDocBench v1.5, with strong robustness to skew, warping, and screen-photography artifacts. (Docling Project)
For your project, that means:
- use Docling first for structured extraction
- use OCR/document VLM fallback only for hard scans
- chunk by legal structure, not just by token count
I would chunk at three levels:
- full document
- structural unit like chapter or article range
- small retrieval chunk
At runtime, retrieve small chunks, rerank them, then expand to the parent structural unit before asking the LLM to answer.
Which vector database I recommend
For your workload, my first recommendation is Qdrant.
Why:
- it supports hybrid retrieval
- it can combine dense and sparse queries
- it documents Reciprocal Rank Fusion for combining them
- it works well with metadata-heavy search patterns
Qdrant’s hybrid query docs show exactly the pattern you need: prefetch sparse and dense candidates, fuse them with RRF, then limit the final set. That is a strong fit for legal corpora because legal search needs both semantic meaning and lexical precision. (qdrant.tech)
Why not pure vector search
Your corpus has:
- dates
- gazette numbers
- ordinance numbers
- jurisdictions
- statuses
- exact legal phrases
So the right pattern is:
metadata filter → hybrid retrieval → rerank → synthesize
When I would choose something else
If you want the least operational work, Pinecone is a credible managed alternative. Its own docs recommend a single hybrid index for most cases because it reduces operational overhead and allows single-request hybrid queries. (Pinecone Docs)
If you want the smallest possible stack and your corpus is still moderate, pgvector is a real option. But its own README shows the tradeoff: with approximate indexes, filtering is applied after the index scan, so filtered ANN queries often need tuning with ef_search, iterative scans, partial indexes, or partitioning. That is workable, but it is not as clean for metadata-heavy legal search as a dedicated retrieval system. (GitHub)
So my recommendation is:
- Qdrant + PostgreSQL for the best balance
- Pinecone + PostgreSQL if you want lower ops
- pgvector only if simplicity matters more than retrieval sophistication
API models or local models
For your current stage, I would start with API models.
Not because local serving is bad. Because your true bottlenecks are still:
- parsing
- routing
- retrieval quality
- summary design
- evidence formatting
Local inference becomes attractive later, when you know your real token volume and your workload is steady enough to keep GPUs busy.
Why APIs make sense first
OpenAI’s current API pricing page lists gpt-5.4-mini at $0.75 per 1M input tokens and $4.50 per 1M output tokens, with Batch pricing at half that level. The same docs say prompt caching can reduce latency by up to 80% and input token cost by up to 90% when requests share long prompt prefixes, and the Batch API gives 50% lower costs, a separate higher-rate-limit pool, and async completion within 24 hours. OpenAI’s data-controls guide also says API data is not used to train or improve OpenAI models unless you explicitly opt in. (OpenAI Developers)
Anthropic’s current pricing and Sonnet pages say Claude Sonnet 4.6 starts at $3 per million input tokens and $15 per million output tokens, supports a 1M-token context window, and also offers up to 90% savings with prompt caching and 50% savings with batch processing. (Claude Platform)
That leads to a practical rule:
Use API models for:
- live chat answers
- routing
- short summaries
- high-quality final answer synthesis
Use Batch / offline jobs for:
- embeddings
- document summaries
- gazette summaries
- timeline generation
- nightly reprocessing
This is the highest-leverage cost optimization for your case.
When local inference becomes worth it
vLLM is the right self-hosting path when you get there, because it provides an OpenAI-compatible server. That lowers migration friction. But self-hosting only starts to make economic sense when you have one or more of these conditions:
- strict sovereignty or residency needs
- large, steady token volume
- strong in-house ops capability
- predictable, heavy offline workloads
Until then, APIs are usually cheaper in total engineering cost, even if the raw per-token price looks higher. vLLM solves serving. It does not solve parsing, retrieval, routing, monitoring, or concurrency tuning for you. (vLLM)
How your example questions should work
“When did street parking legislation first emerge?”
This should not be treated as plain semantic search.
The system should:
- detect a history/timeline question
- filter candidate documents by topic and jurisdiction
- sort by publication or effective date
- retrieve the earliest candidates
- verify that they actually introduce parking rules
- answer with the earliest verified source and later milestones
That is metadata logic plus retrieval plus synthesis.
“How do I submit a law for floor approval?”
This is a procedure question.
The system should:
- search manuals, workflow rules, forms, and deadlines
- extract a step-by-step procedure
- answer in plain language
- cite the rule and any required forms or offices
“How do I file a complaint?”
This should return:
- who can file
- where to file
- required documents
- deadlines
- online or in-person options
- what happens next
That is a structured service answer, not a chunk dump.
Product design ideas that fit this domain
The strongest product idea is to stop thinking in terms of “chat only” and start thinking in terms of answer cards.
I would render answers as:
Direct answer
One short paragraph.
Legal basis
Gazette, ordinance, section, date.
What to do next
For procedures.
Related resources
Forms, offices, newer version, older version.
Limits
If the evidence is weak, say so.
That matches public legal-information guidance well. The NCSC guide explicitly recommends plain language, short responses, clear expectation-setting, and a clear path to follow up with the court when the chatbot cannot answer everything. (National Center for State Courts)
My direct answers to your four questions
1. Orchestration
Use FastAPI + PostgreSQL + Qdrant + Redis/Celery + object storage + API models. Scale FastAPI by replicas. Keep interactive traffic separate from heavy summary and ingestion jobs. (FastAPI)
2. Context processing
Use hierarchical summarization, structural chunking, parent-child retrieval, and precomputed summaries. Do not rely on one huge prompt per gazette. (OpenAI Developers)
3. Vector infrastructure
Use Qdrant first. It is a strong fit for metadata-aware hybrid retrieval. Use Pinecone if you want lower operational burden. Use pgvector only if you deliberately want a smaller, simpler stack and can tune filtered ANN behavior yourself. (qdrant.tech)
4. Cost-benefit
For your scale today, API models are the better first choice. Use prompt caching and batch for the heavy offline work. Revisit vLLM only after you have measured real usage and know that GPU utilization will stay high enough to justify the operational overhead. (OpenAI Developers)
Bottom line
Your best architecture is not “better RAG.” It is:
structured ingestion → metadata routing → hybrid retrieval → reranking → grounded synthesis → citations
That is the design that will let your widget feel fluid like a general AI assistant while still behaving like a trustworthy government information system. The next concrete step is to define four request flows only: lookup, summary, procedure, and history.