Delivering Enterprise Question Answering Systems — Building AI That Finds Answers in Your Client's Knowledge Base

A knowledge management agency in San Francisco was hired by a 50,000-employee technology company to solve their internal knowledge access problem. Employees spent an average of 23 minutes finding answers to work-related questions — searching through internal wikis, Confluence pages, Slack archives, and shared drives. With 12,000 questions asked daily across the organization, that was 4,600 hours of employee time per day spent searching for information that already existed somewhere in the company's knowledge base. The agency built a question answering system that accepted natural language questions and returned precise answers extracted from the company's internal documents, complete with source citations. The system achieved 91% accuracy on questions where the answer existed in the knowledge base, reduced average answer time from 23 minutes to 45 seconds, and became the most-used internal tool within three months. The estimated productivity savings exceeded $18 million annually.

Enterprise question answering (QA) systems combine information retrieval with natural language understanding to provide direct answers to questions from a knowledge base. Unlike search, which returns a list of potentially relevant documents for the user to read, QA systems extract or generate the specific answer from the source material. For AI agencies, enterprise QA is one of the highest-ROI deliverables because the productivity improvement is immediate, measurable, and affects every employee in the organization.

QA System Architecture

Retrieval-Augmented Generation (RAG)

The dominant architecture for enterprise QA is RAG — Retrieval-Augmented Generation. RAG combines a retrieval system that finds relevant documents with a generation model that produces answers from those documents.

RAG pipeline stages:

Question processing: Parse the user's question, extract key terms, and optionally rephrase for better retrieval
Document retrieval: Search the knowledge base to find the most relevant passages
Context assembly: Select and arrange the retrieved passages to form the context for answer generation
Answer generation: Feed the question and context to a language model that generates a precise answer
Citation and verification: Link the answer back to source documents and verify that the answer is grounded in the retrieved context
Response formatting: Present the answer with citations, confidence indicators, and links to source documents

Retrieval Component

The retrieval component determines the ceiling of your QA system's accuracy — if the relevant document is not retrieved, the generation model cannot produce the correct answer.

Retrieval strategy:

Use semantic search (dense retrieval) as the primary retrieval method
Supplement with keyword search (sparse retrieval) for exact term matching
Combine with reciprocal rank fusion or a learned combination
Retrieve 10-20 passages for re-ranking, then pass the top 3-5 to the generation model

Retrieval quality targets:

Recall@10: The relevant passage should be in the top 10 results at least 90% of the time
Recall@3: The relevant passage should be in the top 3 results at least 80% of the time (because only the top 3-5 passages will be in the generation context)

Generation Component

The generation model reads the retrieved passages and the question, then produces a natural language answer.

Model selection:

GPT-4 / GPT-4o: Highest quality answer generation, best for complex questions requiring reasoning. Cost: approximately $0.01-0.03 per question (depending on context length).
Claude 3.5 Sonnet / Claude 3 Haiku: Strong quality, good for questions requiring careful reading and citation. Cost competitive with GPT-4.
Open-source models (Llama 3, Mistral): Self-hostable, lower per-query cost at scale, suitable when data privacy requires on-premises deployment. Quality slightly below frontier models but improving rapidly.

Generation prompt design:

Instruct the model to answer ONLY based on the provided context
Instruct the model to say "I don't have enough information" when the context does not contain the answer
Instruct the model to cite the specific source document for each claim in the answer
Include examples of well-formatted answers with citations
Specify the desired answer length and format (concise direct answer vs. detailed explanation)

Grounding and Citation

Enterprise QA systems must ground their answers in source documents. An answer without a verifiable source is not trustworthy in a business context.

Grounding enforcement:

Instruct the generation model to produce answers that are directly supported by the retrieved context
Implement post-generation verification that checks each claim in the answer against the source passages
Flag answers where the model appears to generate information not present in the context (hallucination detection)
Present citations inline with the answer text so users can verify each claim

Citation implementation:

Each passage in the context is labeled with a source identifier (document title, page number, section)
The generation model is instructed to include these identifiers in its answer
The UI renders citations as clickable links that open the source document at the relevant passage
Track citation click rates to measure user trust and verification behavior

Knowledge Base Management

Document Ingestion

The knowledge base must ingest documents from diverse sources and keep them up to date.

Common enterprise knowledge sources:

Internal wikis (Confluence, Notion, SharePoint)
Document repositories (Google Drive, SharePoint, Dropbox)
Communication archives (Slack, Teams, email)
Ticketing systems (Jira, ServiceNow, Zendesk)
Code repositories (GitHub, GitLab) for technical documentation
CRM notes and customer interaction records

Ingestion pipeline:

Connect to each source via API or file system access
Extract text content and metadata from each document
Track document versions — detect new, updated, and deleted documents
Preprocess text (clean, normalize, chunk)
Generate embeddings and update the vector index
Schedule incremental updates to keep the index current

Freshness requirements:

For rapidly changing sources (Slack, ticketing systems): Update every 15-60 minutes
For moderately changing sources (wikis, document repositories): Update daily
For stable sources (policies, procedures): Update weekly or on change notification

Chunking for QA

QA systems benefit from different chunking strategies than general search.

Optimal chunk sizes for QA:

Short chunks (100-200 tokens): Higher precision — each chunk is more likely to contain a focused, specific answer. But may lack context.
Medium chunks (200-500 tokens): Good balance of precision and context. The default choice for most QA systems.
Long chunks (500-1000 tokens): More context for complex questions. But may include irrelevant information that distracts the generation model.

Context window management:

Retrieve more passages than you include in the generation context
Use a re-ranker to select the most relevant passages
Concatenate selected passages with clear separators indicating the source of each passage
Include the question at the beginning and end of the context to help the model stay focused

Knowledge Gaps and Coverage

Identifying knowledge gaps:

Track questions where the system responds with "I don't have enough information"
Analyze these questions to identify topics not covered by the knowledge base
Report knowledge gaps to the client's content team so they can create documentation for uncovered topics
Track the gap closure rate over time as an indicator of knowledge base improvement

Quality and Accuracy

Evaluation Framework

QA systems need rigorous evaluation across multiple dimensions.

Evaluation dimensions:

Answer accuracy: Is the answer factually correct based on the source documents?
Answer completeness: Does the answer address all aspects of the question?
Answer relevance: Is the answer focused on what was asked, without unnecessary information?
Citation accuracy: Are the cited sources actually the sources of the information in the answer?
Hallucination rate: How often does the system generate information not present in the retrieved documents?
Abstention accuracy: When the system says it does not know, is it correct? (The relevant information truly is not in the knowledge base.)

Evaluation dataset:

Create 200-500 question-answer-source triples
Include questions of varying difficulty (factoid questions, multi-hop questions, comparison questions, procedural questions)
Include questions where the answer is NOT in the knowledge base (to test abstention behavior)
Have domain experts validate the ground truth answers
Version the evaluation set and update quarterly

Hallucination Detection and Prevention

Hallucination — generating plausible but unsupported information — is the most critical quality concern for enterprise QA.

Prevention strategies:

Strict grounding instructions: Explicitly instruct the model to only use information from the provided context
Low temperature: Use a generation temperature of 0.0-0.3 to reduce creative elaboration
Context sufficiency check: Before generating an answer, have the model assess whether the context contains sufficient information. If not, abstain.
Extractive bias: Instruct the model to prefer quoting directly from the source rather than paraphrasing

Detection strategies:

Entailment checking: Use a natural language inference model to verify that each sentence in the answer is entailed by a sentence in the context
Claim decomposition: Break the answer into individual claims and verify each claim against the source
Consistency checking: Generate the answer multiple times (with temperature > 0) and check for consistency. Inconsistent answers often indicate hallucination.

Continuous Quality Monitoring

Human evaluation loop:

Sample 2-5% of production questions and answers for human review
Have reviewers rate accuracy, completeness, and citation correctness
Track quality scores over time to detect degradation
Use reviewer corrections as feedback for system improvement

Automated quality metrics:

Track the proportion of questions where the system abstains (too high indicates poor retrieval, too low may indicate over-confidence)
Track answer length distribution (sudden changes may indicate generation quality issues)
Track citation density (answers without citations may indicate hallucination)
Track user feedback signals (thumbs up/down, follow-up questions on the same topic)

Production Considerations

Latency Optimization

Enterprise users expect answers within 2-5 seconds.

Latency breakdown:

Query embedding: 20-50ms
Retrieval: 20-100ms
Re-ranking: 100-300ms
Generation: 1-3 seconds (the bottleneck)
Total: 1.5-3.5 seconds

Optimization strategies:

Use streaming generation to show the answer progressively as it is generated
Cache answers for frequent questions (20-30% of questions are repeats)
Use a fast re-ranker (ColBERT or a small cross-encoder) to reduce re-ranking latency
Pre-compute embeddings for common query patterns

Access Control

Enterprise knowledge bases contain information with different access levels. The QA system must respect these access controls.

Access control implementation:

Tag each document with access permissions (which users or groups can see it)
At query time, filter the retrieval results to include only documents the querying user has access to
Never include restricted documents in the generation context for unauthorized users
Audit access patterns to detect unauthorized information exposure

Multi-Language Support

Enterprise knowledge bases often contain documents in multiple languages, and users may ask questions in their preferred language.

Cross-language QA approaches:

Use multilingual embedding models (Cohere Embed v3, multilingual-e5) that place documents in the same vector space regardless of language
Use a multilingual generation model that can read context in one language and generate answers in another
Alternatively, translate the query to the document's language for retrieval, and translate the answer back to the user's language

Your Next Step

Identify the single most common category of internal questions in your client's organization — questions about HR policies, IT procedures, product specifications, or customer account information. Collect 100 real questions from employees in that category. For each question, find the answer in the existing documentation (this manual process proves the answers exist but are hard to find). Build a minimal RAG system using those 100 questions: embed the relevant documents, set up retrieval, and connect a generation model. Test the system on the 100 questions and measure answer accuracy. This proof of concept takes 2-3 days and produces the most compelling demo possible — showing the client that their employees can get instant, accurate answers to questions that currently take 20+ minutes to research. Use the accuracy results and the demo to scope the full production project.

QA System Architecture

Retrieval-Augmented Generation (RAG)

RAG pipeline stages:

Question processing: Parse the user's question, extract key terms, and optionally rephrase for better retrieval
Document retrieval: Search the knowledge base to find the most relevant passages
Context assembly: Select and arrange the retrieved passages to form the context for answer generation
Answer generation: Feed the question and context to a language model that generates a precise answer
Citation and verification: Link the answer back to source documents and verify that the answer is grounded in the retrieved context
Response formatting: Present the answer with citations, confidence indicators, and links to source documents

Retrieval Component

The retrieval component determines the ceiling of your QA system's accuracy — if the relevant document is not retrieved, the generation model cannot produce the correct answer.

Retrieval strategy:

Use semantic search (dense retrieval) as the primary retrieval method
Supplement with keyword search (sparse retrieval) for exact term matching
Combine with reciprocal rank fusion or a learned combination
Retrieve 10-20 passages for re-ranking, then pass the top 3-5 to the generation model

Retrieval quality targets:

Recall@10: The relevant passage should be in the top 10 results at least 90% of the time
Recall@3: The relevant passage should be in the top 3 results at least 80% of the time (because only the top 3-5 passages will be in the generation context)

Generation Component

The generation model reads the retrieved passages and the question, then produces a natural language answer.

Model selection:

GPT-4 / GPT-4o: Highest quality answer generation, best for complex questions requiring reasoning. Cost: approximately $0.01-0.03 per question (depending on context length).
Claude 3.5 Sonnet / Claude 3 Haiku: Strong quality, good for questions requiring careful reading and citation. Cost competitive with GPT-4.
Open-source models (Llama 3, Mistral): Self-hostable, lower per-query cost at scale, suitable when data privacy requires on-premises deployment. Quality slightly below frontier models but improving rapidly.

Generation prompt design:

Instruct the model to answer ONLY based on the provided context
Instruct the model to say "I don't have enough information" when the context does not contain the answer
Instruct the model to cite the specific source document for each claim in the answer
Include examples of well-formatted answers with citations
Specify the desired answer length and format (concise direct answer vs. detailed explanation)

Grounding and Citation

Enterprise QA systems must ground their answers in source documents. An answer without a verifiable source is not trustworthy in a business context.

Grounding enforcement:

Instruct the generation model to produce answers that are directly supported by the retrieved context
Implement post-generation verification that checks each claim in the answer against the source passages
Flag answers where the model appears to generate information not present in the context (hallucination detection)
Present citations inline with the answer text so users can verify each claim

Citation implementation:

Each passage in the context is labeled with a source identifier (document title, page number, section)
The generation model is instructed to include these identifiers in its answer
The UI renders citations as clickable links that open the source document at the relevant passage
Track citation click rates to measure user trust and verification behavior

Knowledge Base Management

Document Ingestion

The knowledge base must ingest documents from diverse sources and keep them up to date.

Common enterprise knowledge sources:

Internal wikis (Confluence, Notion, SharePoint)
Document repositories (Google Drive, SharePoint, Dropbox)
Communication archives (Slack, Teams, email)
Ticketing systems (Jira, ServiceNow, Zendesk)
Code repositories (GitHub, GitLab) for technical documentation
CRM notes and customer interaction records

Ingestion pipeline:

Connect to each source via API or file system access
Extract text content and metadata from each document
Track document versions — detect new, updated, and deleted documents
Preprocess text (clean, normalize, chunk)
Generate embeddings and update the vector index
Schedule incremental updates to keep the index current

Freshness requirements:

For rapidly changing sources (Slack, ticketing systems): Update every 15-60 minutes
For moderately changing sources (wikis, document repositories): Update daily
For stable sources (policies, procedures): Update weekly or on change notification

Chunking for QA

QA systems benefit from different chunking strategies than general search.

Optimal chunk sizes for QA:

Short chunks (100-200 tokens): Higher precision — each chunk is more likely to contain a focused, specific answer. But may lack context.
Medium chunks (200-500 tokens): Good balance of precision and context. The default choice for most QA systems.
Long chunks (500-1000 tokens): More context for complex questions. But may include irrelevant information that distracts the generation model.

Context window management:

Retrieve more passages than you include in the generation context
Use a re-ranker to select the most relevant passages
Concatenate selected passages with clear separators indicating the source of each passage
Include the question at the beginning and end of the context to help the model stay focused

Knowledge Gaps and Coverage

Identifying knowledge gaps:

Track questions where the system responds with "I don't have enough information"
Analyze these questions to identify topics not covered by the knowledge base
Report knowledge gaps to the client's content team so they can create documentation for uncovered topics
Track the gap closure rate over time as an indicator of knowledge base improvement

Quality and Accuracy

Evaluation Framework

QA systems need rigorous evaluation across multiple dimensions.

Evaluation dimensions:

Answer accuracy: Is the answer factually correct based on the source documents?
Answer completeness: Does the answer address all aspects of the question?
Answer relevance: Is the answer focused on what was asked, without unnecessary information?
Citation accuracy: Are the cited sources actually the sources of the information in the answer?
Hallucination rate: How often does the system generate information not present in the retrieved documents?
Abstention accuracy: When the system says it does not know, is it correct? (The relevant information truly is not in the knowledge base.)

Evaluation dataset:

Create 200-500 question-answer-source triples
Include questions of varying difficulty (factoid questions, multi-hop questions, comparison questions, procedural questions)
Include questions where the answer is NOT in the knowledge base (to test abstention behavior)
Have domain experts validate the ground truth answers
Version the evaluation set and update quarterly

Hallucination Detection and Prevention

Hallucination — generating plausible but unsupported information — is the most critical quality concern for enterprise QA.

Prevention strategies:

Strict grounding instructions: Explicitly instruct the model to only use information from the provided context
Low temperature: Use a generation temperature of 0.0-0.3 to reduce creative elaboration
Context sufficiency check: Before generating an answer, have the model assess whether the context contains sufficient information. If not, abstain.
Extractive bias: Instruct the model to prefer quoting directly from the source rather than paraphrasing

Detection strategies:

Entailment checking: Use a natural language inference model to verify that each sentence in the answer is entailed by a sentence in the context
Claim decomposition: Break the answer into individual claims and verify each claim against the source
Consistency checking: Generate the answer multiple times (with temperature > 0) and check for consistency. Inconsistent answers often indicate hallucination.

Continuous Quality Monitoring

Human evaluation loop:

Sample 2-5% of production questions and answers for human review
Have reviewers rate accuracy, completeness, and citation correctness
Track quality scores over time to detect degradation
Use reviewer corrections as feedback for system improvement

Automated quality metrics:

Track the proportion of questions where the system abstains (too high indicates poor retrieval, too low may indicate over-confidence)
Track answer length distribution (sudden changes may indicate generation quality issues)
Track citation density (answers without citations may indicate hallucination)
Track user feedback signals (thumbs up/down, follow-up questions on the same topic)

Production Considerations

Latency Optimization

Enterprise users expect answers within 2-5 seconds.

Latency breakdown:

Query embedding: 20-50ms
Retrieval: 20-100ms
Re-ranking: 100-300ms
Generation: 1-3 seconds (the bottleneck)
Total: 1.5-3.5 seconds

Optimization strategies:

Use streaming generation to show the answer progressively as it is generated
Cache answers for frequent questions (20-30% of questions are repeats)
Use a fast re-ranker (ColBERT or a small cross-encoder) to reduce re-ranking latency
Pre-compute embeddings for common query patterns

Access Control

Enterprise knowledge bases contain information with different access levels. The QA system must respect these access controls.

Access control implementation:

Tag each document with access permissions (which users or groups can see it)
At query time, filter the retrieval results to include only documents the querying user has access to
Never include restricted documents in the generation context for unauthorized users
Audit access patterns to detect unauthorized information exposure

Multi-Language Support

Enterprise knowledge bases often contain documents in multiple languages, and users may ask questions in their preferred language.

Cross-language QA approaches:

Use multilingual embedding models (Cohere Embed v3, multilingual-e5) that place documents in the same vector space regardless of language
Use a multilingual generation model that can read context in one language and generate answers in another
Alternatively, translate the query to the document's language for retrieval, and translate the answer back to the user's language

Delivering Enterprise Question Answering Systems — Building AI That Finds Answers in Your Client's Knowledge Base

QA System Architecture

Retrieval-Augmented Generation (RAG)

Retrieval Component

Generation Component

Grounding and Citation

Knowledge Base Management

Document Ingestion

Chunking for QA

Knowledge Gaps and Coverage

Quality and Accuracy

Evaluation Framework

Hallucination Detection and Prevention

Continuous Quality Monitoring

Production Considerations

Latency Optimization

Access Control

Multi-Language Support

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?

Delivering Enterprise Question Answering Systems — Building AI That Finds Answers in Your Client's Knowledge Base

QA System Architecture

Retrieval-Augmented Generation (RAG)

Retrieval Component

Generation Component

Grounding and Citation

Knowledge Base Management

Document Ingestion

Chunking for QA

Knowledge Gaps and Coverage

Quality and Accuracy

Evaluation Framework

Hallucination Detection and Prevention

Continuous Quality Monitoring

Production Considerations

Latency Optimization

Access Control

Multi-Language Support

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?