Architecture Suggestions for a Chatbot (Website Widget)

Personal Project: My idea is to create a chatbot widget for a state-government-focused website. The goal is for the AI to answer questions based on a robust database (PDFs, legislation, and metadata). Currently, I use a RAG (Retrieval-Augmented Generation) system that handles simple metadata searches (e.g., “What is the Gazette for date X?”). I built this using pre-defined prefixes where the user asks a question, the system searches the database, checks the extracted metadata, and returns it. However, if a user asks, “Summarize the ordinance from Gazette X,” the system fails because it lacks the AI logic for that type of processing.

I am facing technical limitations in two main pillars:

  1. Scalability: How can I support 50+ concurrent users while maintaining performance?

  2. Synthesis Capability: The current system locates the document but cannot “read” or summarize the internal content (e.g., “Summarize the ordinances from day X”) efficiently.

The Challenge

My database is structured, but the current retrieval logic is limited to search filters rather than an LLM operating over the file content. I need to evolve the architecture so the AI doesn’t just find the file but processes the text within it to generate contextual answers.

Specific Questions:

  • Orchestration: For 50 concurrent users, what is the best stack to manage request queues and concurrency?

  • Context Processing: How do you handle extensive documents (Gazettes/Laws) so that summaries fit within the LLM’s context window without losing crucial information?

  • Vector Infrastructure: Which vector database do you recommend for this workload to ensure low latency?

  • Cost-Benefit Ratio: Considering scale, is it more cost-effective to use API-based models (OpenAI/Anthropic) or local instances (e.g., Llama 3 via vLLM) to process these summaries?

Desired Workflow Example:

The user would ask:

  • “When did street parking legislation first emerge?”

  • “How do I submit a law for floor approval?”

  • “How do I file a complaint?”

The AI should answer the question directly, similar to existing AIs (Gemini, ChatGPT, etc.). The user should have a fluid experience—asking about legislative processes or history and receiving a synthesized response based on my database, rather than just a link to a PDF.

1 Like

for now:


The right next step is to turn your current metadata lookup system into a routed retrieval system with offline document preparation and grounded answer generation.

Your current design already proves one important thing: the corpus is structured enough to answer exact questions like “What is the Gazette for date X?” The part that is missing is not “more RAG.” It is a second layer that can read inside the matched document, select the right sections, and synthesize them without losing legal structure. Public-sector chatbot guidance points in the same direction: these systems work best when they are narrow, task-oriented, use plain language, keep answers short, stay current, and provide a clear path to human follow-up when they cannot fully resolve the question. (National Center for State Courts)

The architecture I would use

I would split the system into two lanes.

Lane 1: deterministic lookup

Use this for questions such as:

  • “What is the Gazette for date X?”
  • “Which ordinance was published on day Y?”
  • “What is the latest version of rule Z?”

This lane should mostly bypass the LLM. It should hit PostgreSQL or your structured store directly, return the exact record, and optionally let the LLM rewrite it into plain language. This keeps latency low and avoids hallucinations.

Lane 2: grounded synthesis

Use this for questions such as:

  • “Summarize the ordinance from Gazette X.”
  • “How do I file a complaint?”
  • “When did street parking legislation first emerge?”

This lane should do five things in order:

  1. identify the document set with metadata
  2. retrieve the relevant text sections
  3. expand to surrounding legal structure
  4. synthesize from those sections only
  5. answer with citations

This is where tool calling helps. OpenAI’s function-calling guide is explicit that models can be connected to external data and actions through structured tools, which is exactly what you need for routing between SQL lookup, document retrieval, summary generation, and procedural extraction. (OpenAI Developers)

Orchestration for 50+ concurrent users

For your scale, I would not overbuild. A good production baseline is:

  • FastAPI for the HTTP layer
  • PostgreSQL for metadata and legal structure
  • Qdrant for semantic and hybrid retrieval
  • Redis + Celery for background jobs
  • object storage for PDFs and derived artifacts
  • API-based LLMs first, not self-hosting first

FastAPI’s own deployment docs say that when deploying you typically want replication to take advantage of multiple cores and handle more requests. They also note that in Kubernetes you will usually run a single Uvicorn process per container and scale by replicas instead of cramming many workers into one container. Celery’s docs describe the exact queue model you need: a task queue with workers consuming jobs from a broker, supporting both real-time processing and scheduling. (FastAPI)

The key is not “how many workers?” The key is queue separation.

I would create three execution classes:

A. Interactive queue

For:

  • exact lookup
  • short procedural answers
  • answer generation over already indexed chunks

Target latency here should feel like chat.

B. Heavy synthesis queue

For:

  • full-gazette summaries
  • multi-document comparisons
  • historical timeline reconstruction

These jobs are slower and should not block chat.

C. Ingestion queue

For:

  • PDF parsing
  • OCR fallback
  • embeddings
  • summary generation
  • document reindexing

That separation is what protects the user experience when several users ask expensive questions at the same time.

Context processing for large gazettes and laws

This is your second major problem, and the answer is hierarchical summarization plus parent-child retrieval.

OpenAI’s long-document summarization cookbook shows the correct pattern: split a large document into manageable pieces, summarize the pieces, then combine them into a higher-level summary with controllable detail. That is much more reliable than stuffing one huge document into a single prompt. (OpenAI Developers)

For your corpus, I would precompute four levels:

1. Section summary

For each article, clause, or ordinance block:

  • plain-language summary
  • legal summary
  • key obligations
  • dates
  • penalties
  • responsible office

2. Document summary

For each ordinance or law:

  • what it does
  • what changed
  • who is affected
  • current status
  • relationship to prior rules

3. Gazette-day summary

For each gazette date:

  • all ordinances
  • major changes
  • repeals
  • amendments
  • citizen-facing impact

4. Topic timeline summary

For topics like parking, complaints, permits, zoning:

  • first appearance
  • major amendments
  • latest controlling rule
  • related procedures

That gives you a very efficient runtime pattern:

  • retrieve the record with metadata
  • load the precomputed summary
  • verify against the source sections
  • answer with evidence

This is much faster and cheaper than re-reading the whole PDF every time.

Document parsing is not optional

Your current failure mode likely starts before retrieval.

If the parser breaks reading order, merges columns, loses tables, or cuts article boundaries badly, the best model in the world will still answer poorly. That is why I would treat parsing as a first-class subsystem.

Docling’s docs say it supports advanced PDF understanding, including page layout, reading order, table structure, formulas, and a unified document representation. PaddleOCR-VL-1.5’s current model card says it is built for document understanding and reaches state-of-the-art accuracy on OmniDocBench v1.5, with strong robustness to skew, warping, and screen-photography artifacts. (Docling Project)

For your project, that means:

  • use Docling first for structured extraction
  • use OCR/document VLM fallback only for hard scans
  • chunk by legal structure, not just by token count

I would chunk at three levels:

  • full document
  • structural unit like chapter or article range
  • small retrieval chunk

At runtime, retrieve small chunks, rerank them, then expand to the parent structural unit before asking the LLM to answer.

Which vector database I recommend

For your workload, my first recommendation is Qdrant.

Why:

  • it supports hybrid retrieval
  • it can combine dense and sparse queries
  • it documents Reciprocal Rank Fusion for combining them
  • it works well with metadata-heavy search patterns

Qdrant’s hybrid query docs show exactly the pattern you need: prefetch sparse and dense candidates, fuse them with RRF, then limit the final set. That is a strong fit for legal corpora because legal search needs both semantic meaning and lexical precision. (qdrant.tech)

Why not pure vector search

Your corpus has:

  • dates
  • gazette numbers
  • ordinance numbers
  • jurisdictions
  • statuses
  • exact legal phrases

So the right pattern is:

metadata filter → hybrid retrieval → rerank → synthesize

When I would choose something else

If you want the least operational work, Pinecone is a credible managed alternative. Its own docs recommend a single hybrid index for most cases because it reduces operational overhead and allows single-request hybrid queries. (Pinecone Docs)

If you want the smallest possible stack and your corpus is still moderate, pgvector is a real option. But its own README shows the tradeoff: with approximate indexes, filtering is applied after the index scan, so filtered ANN queries often need tuning with ef_search, iterative scans, partial indexes, or partitioning. That is workable, but it is not as clean for metadata-heavy legal search as a dedicated retrieval system. (GitHub)

So my recommendation is:

  • Qdrant + PostgreSQL for the best balance
  • Pinecone + PostgreSQL if you want lower ops
  • pgvector only if simplicity matters more than retrieval sophistication

API models or local models

For your current stage, I would start with API models.

Not because local serving is bad. Because your true bottlenecks are still:

  • parsing
  • routing
  • retrieval quality
  • summary design
  • evidence formatting

Local inference becomes attractive later, when you know your real token volume and your workload is steady enough to keep GPUs busy.

Why APIs make sense first

OpenAI’s current API pricing page lists gpt-5.4-mini at $0.75 per 1M input tokens and $4.50 per 1M output tokens, with Batch pricing at half that level. The same docs say prompt caching can reduce latency by up to 80% and input token cost by up to 90% when requests share long prompt prefixes, and the Batch API gives 50% lower costs, a separate higher-rate-limit pool, and async completion within 24 hours. OpenAI’s data-controls guide also says API data is not used to train or improve OpenAI models unless you explicitly opt in. (OpenAI Developers)

Anthropic’s current pricing and Sonnet pages say Claude Sonnet 4.6 starts at $3 per million input tokens and $15 per million output tokens, supports a 1M-token context window, and also offers up to 90% savings with prompt caching and 50% savings with batch processing. (Claude Platform)

That leads to a practical rule:

Use API models for:

  • live chat answers
  • routing
  • short summaries
  • high-quality final answer synthesis

Use Batch / offline jobs for:

  • embeddings
  • document summaries
  • gazette summaries
  • timeline generation
  • nightly reprocessing

This is the highest-leverage cost optimization for your case.

When local inference becomes worth it

vLLM is the right self-hosting path when you get there, because it provides an OpenAI-compatible server. That lowers migration friction. But self-hosting only starts to make economic sense when you have one or more of these conditions:

  • strict sovereignty or residency needs
  • large, steady token volume
  • strong in-house ops capability
  • predictable, heavy offline workloads

Until then, APIs are usually cheaper in total engineering cost, even if the raw per-token price looks higher. vLLM solves serving. It does not solve parsing, retrieval, routing, monitoring, or concurrency tuning for you. (vLLM)

How your example questions should work

“When did street parking legislation first emerge?”

This should not be treated as plain semantic search.

The system should:

  1. detect a history/timeline question
  2. filter candidate documents by topic and jurisdiction
  3. sort by publication or effective date
  4. retrieve the earliest candidates
  5. verify that they actually introduce parking rules
  6. answer with the earliest verified source and later milestones

That is metadata logic plus retrieval plus synthesis.

“How do I submit a law for floor approval?”

This is a procedure question.

The system should:

  1. search manuals, workflow rules, forms, and deadlines
  2. extract a step-by-step procedure
  3. answer in plain language
  4. cite the rule and any required forms or offices

“How do I file a complaint?”

This should return:

  • who can file
  • where to file
  • required documents
  • deadlines
  • online or in-person options
  • what happens next

That is a structured service answer, not a chunk dump.

Product design ideas that fit this domain

The strongest product idea is to stop thinking in terms of “chat only” and start thinking in terms of answer cards.

I would render answers as:

Direct answer

One short paragraph.

Legal basis

Gazette, ordinance, section, date.

What to do next

For procedures.

Related resources

Forms, offices, newer version, older version.

Limits

If the evidence is weak, say so.

That matches public legal-information guidance well. The NCSC guide explicitly recommends plain language, short responses, clear expectation-setting, and a clear path to follow up with the court when the chatbot cannot answer everything. (National Center for State Courts)

My direct answers to your four questions

1. Orchestration

Use FastAPI + PostgreSQL + Qdrant + Redis/Celery + object storage + API models. Scale FastAPI by replicas. Keep interactive traffic separate from heavy summary and ingestion jobs. (FastAPI)

2. Context processing

Use hierarchical summarization, structural chunking, parent-child retrieval, and precomputed summaries. Do not rely on one huge prompt per gazette. (OpenAI Developers)

3. Vector infrastructure

Use Qdrant first. It is a strong fit for metadata-aware hybrid retrieval. Use Pinecone if you want lower operational burden. Use pgvector only if you deliberately want a smaller, simpler stack and can tune filtered ANN behavior yourself. (qdrant.tech)

4. Cost-benefit

For your scale today, API models are the better first choice. Use prompt caching and batch for the heavy offline work. Revisit vLLM only after you have measured real usage and know that GPU utilization will stay high enough to justify the operational overhead. (OpenAI Developers)

Bottom line

Your best architecture is not “better RAG.” It is:

structured ingestion → metadata routing → hybrid retrieval → reranking → grounded synthesis → citations

That is the design that will let your widget feel fluid like a general AI assistant while still behaving like a trustworthy government information system. The next concrete step is to define four request flows only: lookup, summary, procedure, and history.

1 Like

Excellent explanation, thank you very much.

There were many points I hadn’t yet analyzed from this perspective. Now I have a much clearer direction on how to scale the repository. I’m going to dive deeper into these recommendations. This was very helpful!

1 Like

I don’t understand. I have made RAG AIs with 2 different services (Google NotebookLM https://notebooklm.google.com and Copilot Studio https://copilotstudio.microsoft.com ) and they can summarize anything that is in their source documents.

Are you writing your own program to implement the RAG? With what language?

There should be several libraries for Python to do what you want. I don’t know about Javascript libraries.

1 Like