AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

QA System ArchitectureRetrieval-Augmented Generation (RAG)Retrieval ComponentGeneration ComponentGrounding and CitationKnowledge Base ManagementDocument IngestionChunking for QAKnowledge Gaps and CoverageQuality and AccuracyEvaluation FrameworkHallucination Detection and PreventionContinuous Quality MonitoringProduction ConsiderationsLatency OptimizationAccess ControlMulti-Language SupportYour Next Step
Home/Blog/Delivering Enterprise Question Answering Systems โ€” Building AI That Finds Answers in Your Client's Knowledge Base
Delivery

Delivering Enterprise Question Answering Systems โ€” Building AI That Finds Answers in Your Client's Knowledge Base

A

Agency Script Editorial

Editorial Team

ยทMarch 20, 2026ยท12 min read
question answeringragknowledge managemententerprise ai

A knowledge management agency in San Francisco was hired by a 50,000-employee technology company to solve their internal knowledge access problem. Employees spent an average of 23 minutes finding answers to work-related questions โ€” searching through internal wikis, Confluence pages, Slack archives, and shared drives. With 12,000 questions asked daily across the organization, that was 4,600 hours of employee time per day spent searching for information that already existed somewhere in the company's knowledge base. The agency built a question answering system that accepted natural language questions and returned precise answers extracted from the company's internal documents, complete with source citations. The system achieved 91% accuracy on questions where the answer existed in the knowledge base, reduced average answer time from 23 minutes to 45 seconds, and became the most-used internal tool within three months. The estimated productivity savings exceeded $18 million annually.

Enterprise question answering (QA) systems combine information retrieval with natural language understanding to provide direct answers to questions from a knowledge base. Unlike search, which returns a list of potentially relevant documents for the user to read, QA systems extract or generate the specific answer from the source material. For AI agencies, enterprise QA is one of the highest-ROI deliverables because the productivity improvement is immediate, measurable, and affects every employee in the organization.

QA System Architecture

Retrieval-Augmented Generation (RAG)

The dominant architecture for enterprise QA is RAG โ€” Retrieval-Augmented Generation. RAG combines a retrieval system that finds relevant documents with a generation model that produces answers from those documents.

RAG pipeline stages:

  1. Question processing: Parse the user's question, extract key terms, and optionally rephrase for better retrieval
  2. Document retrieval: Search the knowledge base to find the most relevant passages
  3. Context assembly: Select and arrange the retrieved passages to form the context for answer generation
  4. Answer generation: Feed the question and context to a language model that generates a precise answer
  5. Citation and verification: Link the answer back to source documents and verify that the answer is grounded in the retrieved context
  6. Response formatting: Present the answer with citations, confidence indicators, and links to source documents

Retrieval Component

The retrieval component determines the ceiling of your QA system's accuracy โ€” if the relevant document is not retrieved, the generation model cannot produce the correct answer.

Retrieval strategy:

  • Use semantic search (dense retrieval) as the primary retrieval method
  • Supplement with keyword search (sparse retrieval) for exact term matching
  • Combine with reciprocal rank fusion or a learned combination
  • Retrieve 10-20 passages for re-ranking, then pass the top 3-5 to the generation model

Retrieval quality targets:

  • Recall@10: The relevant passage should be in the top 10 results at least 90% of the time
  • Recall@3: The relevant passage should be in the top 3 results at least 80% of the time (because only the top 3-5 passages will be in the generation context)

Generation Component

The generation model reads the retrieved passages and the question, then produces a natural language answer.

Model selection:

  • GPT-4 / GPT-4o: Highest quality answer generation, best for complex questions requiring reasoning. Cost: approximately $0.01-0.03 per question (depending on context length).
  • Claude 3.5 Sonnet / Claude 3 Haiku: Strong quality, good for questions requiring careful reading and citation. Cost competitive with GPT-4.
  • Open-source models (Llama 3, Mistral): Self-hostable, lower per-query cost at scale, suitable when data privacy requires on-premises deployment. Quality slightly below frontier models but improving rapidly.

Generation prompt design:

  • Instruct the model to answer ONLY based on the provided context
  • Instruct the model to say "I don't have enough information" when the context does not contain the answer
  • Instruct the model to cite the specific source document for each claim in the answer
  • Include examples of well-formatted answers with citations
  • Specify the desired answer length and format (concise direct answer vs. detailed explanation)

Grounding and Citation

Enterprise QA systems must ground their answers in source documents. An answer without a verifiable source is not trustworthy in a business context.

Grounding enforcement:

  • Instruct the generation model to produce answers that are directly supported by the retrieved context
  • Implement post-generation verification that checks each claim in the answer against the source passages
  • Flag answers where the model appears to generate information not present in the context (hallucination detection)
  • Present citations inline with the answer text so users can verify each claim

Citation implementation:

  • Each passage in the context is labeled with a source identifier (document title, page number, section)
  • The generation model is instructed to include these identifiers in its answer
  • The UI renders citations as clickable links that open the source document at the relevant passage
  • Track citation click rates to measure user trust and verification behavior

Knowledge Base Management

Document Ingestion

The knowledge base must ingest documents from diverse sources and keep them up to date.

Common enterprise knowledge sources:

  • Internal wikis (Confluence, Notion, SharePoint)
  • Document repositories (Google Drive, SharePoint, Dropbox)
  • Communication archives (Slack, Teams, email)
  • Ticketing systems (Jira, ServiceNow, Zendesk)
  • Code repositories (GitHub, GitLab) for technical documentation
  • CRM notes and customer interaction records

Ingestion pipeline:

  1. Connect to each source via API or file system access
  2. Extract text content and metadata from each document
  3. Track document versions โ€” detect new, updated, and deleted documents
  4. Preprocess text (clean, normalize, chunk)
  5. Generate embeddings and update the vector index
  6. Schedule incremental updates to keep the index current

Freshness requirements:

  • For rapidly changing sources (Slack, ticketing systems): Update every 15-60 minutes
  • For moderately changing sources (wikis, document repositories): Update daily
  • For stable sources (policies, procedures): Update weekly or on change notification

Chunking for QA

QA systems benefit from different chunking strategies than general search.

Optimal chunk sizes for QA:

  • Short chunks (100-200 tokens): Higher precision โ€” each chunk is more likely to contain a focused, specific answer. But may lack context.
  • Medium chunks (200-500 tokens): Good balance of precision and context. The default choice for most QA systems.
  • Long chunks (500-1000 tokens): More context for complex questions. But may include irrelevant information that distracts the generation model.

Context window management:

  • Retrieve more passages than you include in the generation context
  • Use a re-ranker to select the most relevant passages
  • Concatenate selected passages with clear separators indicating the source of each passage
  • Include the question at the beginning and end of the context to help the model stay focused

Knowledge Gaps and Coverage

Identifying knowledge gaps:

  • Track questions where the system responds with "I don't have enough information"
  • Analyze these questions to identify topics not covered by the knowledge base
  • Report knowledge gaps to the client's content team so they can create documentation for uncovered topics
  • Track the gap closure rate over time as an indicator of knowledge base improvement

Quality and Accuracy

Evaluation Framework

QA systems need rigorous evaluation across multiple dimensions.

Evaluation dimensions:

  • Answer accuracy: Is the answer factually correct based on the source documents?
  • Answer completeness: Does the answer address all aspects of the question?
  • Answer relevance: Is the answer focused on what was asked, without unnecessary information?
  • Citation accuracy: Are the cited sources actually the sources of the information in the answer?
  • Hallucination rate: How often does the system generate information not present in the retrieved documents?
  • Abstention accuracy: When the system says it does not know, is it correct? (The relevant information truly is not in the knowledge base.)

Evaluation dataset:

  • Create 200-500 question-answer-source triples
  • Include questions of varying difficulty (factoid questions, multi-hop questions, comparison questions, procedural questions)
  • Include questions where the answer is NOT in the knowledge base (to test abstention behavior)
  • Have domain experts validate the ground truth answers
  • Version the evaluation set and update quarterly

Hallucination Detection and Prevention

Hallucination โ€” generating plausible but unsupported information โ€” is the most critical quality concern for enterprise QA.

Prevention strategies:

  • Strict grounding instructions: Explicitly instruct the model to only use information from the provided context
  • Low temperature: Use a generation temperature of 0.0-0.3 to reduce creative elaboration
  • Context sufficiency check: Before generating an answer, have the model assess whether the context contains sufficient information. If not, abstain.
  • Extractive bias: Instruct the model to prefer quoting directly from the source rather than paraphrasing

Detection strategies:

  • Entailment checking: Use a natural language inference model to verify that each sentence in the answer is entailed by a sentence in the context
  • Claim decomposition: Break the answer into individual claims and verify each claim against the source
  • Consistency checking: Generate the answer multiple times (with temperature > 0) and check for consistency. Inconsistent answers often indicate hallucination.

Continuous Quality Monitoring

Human evaluation loop:

  • Sample 2-5% of production questions and answers for human review
  • Have reviewers rate accuracy, completeness, and citation correctness
  • Track quality scores over time to detect degradation
  • Use reviewer corrections as feedback for system improvement

Automated quality metrics:

  • Track the proportion of questions where the system abstains (too high indicates poor retrieval, too low may indicate over-confidence)
  • Track answer length distribution (sudden changes may indicate generation quality issues)
  • Track citation density (answers without citations may indicate hallucination)
  • Track user feedback signals (thumbs up/down, follow-up questions on the same topic)

Production Considerations

Latency Optimization

Enterprise users expect answers within 2-5 seconds.

Latency breakdown:

  • Query embedding: 20-50ms
  • Retrieval: 20-100ms
  • Re-ranking: 100-300ms
  • Generation: 1-3 seconds (the bottleneck)
  • Total: 1.5-3.5 seconds

Optimization strategies:

  • Use streaming generation to show the answer progressively as it is generated
  • Cache answers for frequent questions (20-30% of questions are repeats)
  • Use a fast re-ranker (ColBERT or a small cross-encoder) to reduce re-ranking latency
  • Pre-compute embeddings for common query patterns

Access Control

Enterprise knowledge bases contain information with different access levels. The QA system must respect these access controls.

Access control implementation:

  • Tag each document with access permissions (which users or groups can see it)
  • At query time, filter the retrieval results to include only documents the querying user has access to
  • Never include restricted documents in the generation context for unauthorized users
  • Audit access patterns to detect unauthorized information exposure

Multi-Language Support

Enterprise knowledge bases often contain documents in multiple languages, and users may ask questions in their preferred language.

Cross-language QA approaches:

  • Use multilingual embedding models (Cohere Embed v3, multilingual-e5) that place documents in the same vector space regardless of language
  • Use a multilingual generation model that can read context in one language and generate answers in another
  • Alternatively, translate the query to the document's language for retrieval, and translate the answer back to the user's language

Your Next Step

Identify the single most common category of internal questions in your client's organization โ€” questions about HR policies, IT procedures, product specifications, or customer account information. Collect 100 real questions from employees in that category. For each question, find the answer in the existing documentation (this manual process proves the answers exist but are hard to find). Build a minimal RAG system using those 100 questions: embed the relevant documents, set up retrieval, and connect a generation model. Test the system on the 100 questions and measure answer accuracy. This proof of concept takes 2-3 days and produces the most compelling demo possible โ€” showing the client that their employees can get instant, accurate answers to questions that currently take 20+ minutes to research. Use the accuracy results and the demo to scope the full production project.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification