A knowledge management agency in San Francisco was hired by a 50,000-employee technology company to solve their internal knowledge access problem. Employees spent an average of 23 minutes finding answers to work-related questions โ searching through internal wikis, Confluence pages, Slack archives, and shared drives. With 12,000 questions asked daily across the organization, that was 4,600 hours of employee time per day spent searching for information that already existed somewhere in the company's knowledge base. The agency built a question answering system that accepted natural language questions and returned precise answers extracted from the company's internal documents, complete with source citations. The system achieved 91% accuracy on questions where the answer existed in the knowledge base, reduced average answer time from 23 minutes to 45 seconds, and became the most-used internal tool within three months. The estimated productivity savings exceeded $18 million annually.
Enterprise question answering (QA) systems combine information retrieval with natural language understanding to provide direct answers to questions from a knowledge base. Unlike search, which returns a list of potentially relevant documents for the user to read, QA systems extract or generate the specific answer from the source material. For AI agencies, enterprise QA is one of the highest-ROI deliverables because the productivity improvement is immediate, measurable, and affects every employee in the organization.
QA System Architecture
Retrieval-Augmented Generation (RAG)
The dominant architecture for enterprise QA is RAG โ Retrieval-Augmented Generation. RAG combines a retrieval system that finds relevant documents with a generation model that produces answers from those documents.
RAG pipeline stages:
- Question processing: Parse the user's question, extract key terms, and optionally rephrase for better retrieval
- Document retrieval: Search the knowledge base to find the most relevant passages
- Context assembly: Select and arrange the retrieved passages to form the context for answer generation
- Answer generation: Feed the question and context to a language model that generates a precise answer
- Citation and verification: Link the answer back to source documents and verify that the answer is grounded in the retrieved context
- Response formatting: Present the answer with citations, confidence indicators, and links to source documents
Retrieval Component
The retrieval component determines the ceiling of your QA system's accuracy โ if the relevant document is not retrieved, the generation model cannot produce the correct answer.
Retrieval strategy:
- Use semantic search (dense retrieval) as the primary retrieval method
- Supplement with keyword search (sparse retrieval) for exact term matching
- Combine with reciprocal rank fusion or a learned combination
- Retrieve 10-20 passages for re-ranking, then pass the top 3-5 to the generation model
Retrieval quality targets:
- Recall@10: The relevant passage should be in the top 10 results at least 90% of the time
- Recall@3: The relevant passage should be in the top 3 results at least 80% of the time (because only the top 3-5 passages will be in the generation context)
Generation Component
The generation model reads the retrieved passages and the question, then produces a natural language answer.
Model selection:
- GPT-4 / GPT-4o: Highest quality answer generation, best for complex questions requiring reasoning. Cost: approximately $0.01-0.03 per question (depending on context length).
- Claude 3.5 Sonnet / Claude 3 Haiku: Strong quality, good for questions requiring careful reading and citation. Cost competitive with GPT-4.
- Open-source models (Llama 3, Mistral): Self-hostable, lower per-query cost at scale, suitable when data privacy requires on-premises deployment. Quality slightly below frontier models but improving rapidly.
Generation prompt design:
- Instruct the model to answer ONLY based on the provided context
- Instruct the model to say "I don't have enough information" when the context does not contain the answer
- Instruct the model to cite the specific source document for each claim in the answer
- Include examples of well-formatted answers with citations
- Specify the desired answer length and format (concise direct answer vs. detailed explanation)
Grounding and Citation
Enterprise QA systems must ground their answers in source documents. An answer without a verifiable source is not trustworthy in a business context.
Grounding enforcement:
- Instruct the generation model to produce answers that are directly supported by the retrieved context
- Implement post-generation verification that checks each claim in the answer against the source passages
- Flag answers where the model appears to generate information not present in the context (hallucination detection)
- Present citations inline with the answer text so users can verify each claim
Citation implementation:
- Each passage in the context is labeled with a source identifier (document title, page number, section)
- The generation model is instructed to include these identifiers in its answer
- The UI renders citations as clickable links that open the source document at the relevant passage
- Track citation click rates to measure user trust and verification behavior
Knowledge Base Management
Document Ingestion
The knowledge base must ingest documents from diverse sources and keep them up to date.
Common enterprise knowledge sources:
- Internal wikis (Confluence, Notion, SharePoint)
- Document repositories (Google Drive, SharePoint, Dropbox)
- Communication archives (Slack, Teams, email)
- Ticketing systems (Jira, ServiceNow, Zendesk)
- Code repositories (GitHub, GitLab) for technical documentation
- CRM notes and customer interaction records
Ingestion pipeline:
- Connect to each source via API or file system access
- Extract text content and metadata from each document
- Track document versions โ detect new, updated, and deleted documents
- Preprocess text (clean, normalize, chunk)
- Generate embeddings and update the vector index
- Schedule incremental updates to keep the index current
Freshness requirements:
- For rapidly changing sources (Slack, ticketing systems): Update every 15-60 minutes
- For moderately changing sources (wikis, document repositories): Update daily
- For stable sources (policies, procedures): Update weekly or on change notification
Chunking for QA
QA systems benefit from different chunking strategies than general search.
Optimal chunk sizes for QA:
- Short chunks (100-200 tokens): Higher precision โ each chunk is more likely to contain a focused, specific answer. But may lack context.
- Medium chunks (200-500 tokens): Good balance of precision and context. The default choice for most QA systems.
- Long chunks (500-1000 tokens): More context for complex questions. But may include irrelevant information that distracts the generation model.
Context window management:
- Retrieve more passages than you include in the generation context
- Use a re-ranker to select the most relevant passages
- Concatenate selected passages with clear separators indicating the source of each passage
- Include the question at the beginning and end of the context to help the model stay focused
Knowledge Gaps and Coverage
Identifying knowledge gaps:
- Track questions where the system responds with "I don't have enough information"
- Analyze these questions to identify topics not covered by the knowledge base
- Report knowledge gaps to the client's content team so they can create documentation for uncovered topics
- Track the gap closure rate over time as an indicator of knowledge base improvement
Quality and Accuracy
Evaluation Framework
QA systems need rigorous evaluation across multiple dimensions.
Evaluation dimensions:
- Answer accuracy: Is the answer factually correct based on the source documents?
- Answer completeness: Does the answer address all aspects of the question?
- Answer relevance: Is the answer focused on what was asked, without unnecessary information?
- Citation accuracy: Are the cited sources actually the sources of the information in the answer?
- Hallucination rate: How often does the system generate information not present in the retrieved documents?
- Abstention accuracy: When the system says it does not know, is it correct? (The relevant information truly is not in the knowledge base.)
Evaluation dataset:
- Create 200-500 question-answer-source triples
- Include questions of varying difficulty (factoid questions, multi-hop questions, comparison questions, procedural questions)
- Include questions where the answer is NOT in the knowledge base (to test abstention behavior)
- Have domain experts validate the ground truth answers
- Version the evaluation set and update quarterly
Hallucination Detection and Prevention
Hallucination โ generating plausible but unsupported information โ is the most critical quality concern for enterprise QA.
Prevention strategies:
- Strict grounding instructions: Explicitly instruct the model to only use information from the provided context
- Low temperature: Use a generation temperature of 0.0-0.3 to reduce creative elaboration
- Context sufficiency check: Before generating an answer, have the model assess whether the context contains sufficient information. If not, abstain.
- Extractive bias: Instruct the model to prefer quoting directly from the source rather than paraphrasing
Detection strategies:
- Entailment checking: Use a natural language inference model to verify that each sentence in the answer is entailed by a sentence in the context
- Claim decomposition: Break the answer into individual claims and verify each claim against the source
- Consistency checking: Generate the answer multiple times (with temperature > 0) and check for consistency. Inconsistent answers often indicate hallucination.
Continuous Quality Monitoring
Human evaluation loop:
- Sample 2-5% of production questions and answers for human review
- Have reviewers rate accuracy, completeness, and citation correctness
- Track quality scores over time to detect degradation
- Use reviewer corrections as feedback for system improvement
Automated quality metrics:
- Track the proportion of questions where the system abstains (too high indicates poor retrieval, too low may indicate over-confidence)
- Track answer length distribution (sudden changes may indicate generation quality issues)
- Track citation density (answers without citations may indicate hallucination)
- Track user feedback signals (thumbs up/down, follow-up questions on the same topic)
Production Considerations
Latency Optimization
Enterprise users expect answers within 2-5 seconds.
Latency breakdown:
- Query embedding: 20-50ms
- Retrieval: 20-100ms
- Re-ranking: 100-300ms
- Generation: 1-3 seconds (the bottleneck)
- Total: 1.5-3.5 seconds
Optimization strategies:
- Use streaming generation to show the answer progressively as it is generated
- Cache answers for frequent questions (20-30% of questions are repeats)
- Use a fast re-ranker (ColBERT or a small cross-encoder) to reduce re-ranking latency
- Pre-compute embeddings for common query patterns
Access Control
Enterprise knowledge bases contain information with different access levels. The QA system must respect these access controls.
Access control implementation:
- Tag each document with access permissions (which users or groups can see it)
- At query time, filter the retrieval results to include only documents the querying user has access to
- Never include restricted documents in the generation context for unauthorized users
- Audit access patterns to detect unauthorized information exposure
Multi-Language Support
Enterprise knowledge bases often contain documents in multiple languages, and users may ask questions in their preferred language.
Cross-language QA approaches:
- Use multilingual embedding models (Cohere Embed v3, multilingual-e5) that place documents in the same vector space regardless of language
- Use a multilingual generation model that can read context in one language and generate answers in another
- Alternatively, translate the query to the document's language for retrieval, and translate the answer back to the user's language
Your Next Step
Identify the single most common category of internal questions in your client's organization โ questions about HR policies, IT procedures, product specifications, or customer account information. Collect 100 real questions from employees in that category. For each question, find the answer in the existing documentation (this manual process proves the answers exist but are hard to find). Build a minimal RAG system using those 100 questions: embed the relevant documents, set up retrieval, and connect a generation model. Test the system on the 100 questions and measure answer accuracy. This proof of concept takes 2-3 days and produces the most compelling demo possible โ showing the client that their employees can get instant, accurate answers to questions that currently take 20+ minutes to research. Use the accuracy results and the demo to scope the full production project.