Code:Implement RAG engine with semantic search




Overview

The code implements a sophisticated RAG pipeline that:

  1. Processes a user’s natural language query to understand its intent and context.
  2. Retrieves relevant information from a knowledge base (vector database) using semantic search.
  3. Generates a coherent, accurate answer using a Large Language Model (LLM), grounded in the retrieved information.
  4. Returns the answer along with sources and metadata for transparency and traceability.



1. RAGConfig Dataclass

Function: A configuration container that holds all the tunable parameters and service endpoints for the RAG engine. Using a dataclass ensures a clean and organized configuration structure.

Features and Description:

  • Service Endpoints: URLs for connecting to dependent microservices.
    • vector_db_url: Endpoint for the vector database service (e.g., Qdrant, Chroma) for semantic search.
    • llm_service_url: Endpoint for the LLM inference service (e.g., vLLM, Text Generation Inference).
    • data_warehouse_url & data_lake_url: Endpoints for other data sources (though not heavily used in the provided code).
  • Search Configuration: Parameters controlling the retrieval process.
    • max_context_length: The maximum number of characters from retrieved documents to send to the LLM (to avoid exceeding model context windows).
    • top_k_documents: The total number of document chunks to retrieve from the vector database.
    • similarity_threshold: The minimum cosine similarity score a document must have to be considered relevant.
    • context_overlap: (Unused in provided code) Typically used when chunking documents to prevent losing context at chunk boundaries.
  • LLM Configuration: Parameters for the text generation step.
    • default_model: The name of the LLM model to use for generation (e.g., “llama2-7b-semiconductor”).
    • max_tokens: The maximum number of tokens the LLM should generate in its response.
    • temperature: Controls the randomness of the output (lower = more deterministic, higher = more creative).
  • Query Processing: Flags to enable/disable advanced query manipulation.
    • enable_query_expansion: If True, adds related semiconductor-specific terms to the query.
    • enable_query_rewriting: If True, rephrases questions into statements for better retrieval.
    • enable_intent_classification: (Unused in provided code, but implied) Likely a flag for the overall intent classification step.
  • Response Configuration: Parameters for the final output.
    • include_sources: Whether to include source documents in the response.
    • max_sources: The maximum number of sources to return in the response.
    • confidence_threshold: (Unused in provided code) Could be used to filter out low-confidence responses before returning them to the user.



2. QueryIntent Dataclass

Function: A structured data object to store the results of analyzing the user’s query. It captures the what and the about what of the query.

Features and Description:

  • intent_type: Categorizes the query’s purpose (e.g., 'analysis', 'troubleshooting', 'optimization', 'general'). This drives which data collections to search and how to prompt the LLM.
  • confidence: A score (0.0 to 1.0) representing how confident the classifier is in the assigned intent_type.
  • entities: A list of extracted named entities (e.g., “5nm”, “650°C”, “10 torr”) from the query.
  • process_modules: A list of semiconductor process modules mentioned (e.g., ['lithography', 'etch']).
  • equipment_types: A list of equipment types mentioned (e.g., ['scanner', 'etcher']).



3. SearchResult Dataclass

Function: A structured data object to represent a single document chunk retrieved from the vector database, along with its metadata and relevance score.

Features and Description:

  • content: The actual text content of the retrieved document chunk.
  • source: The origin of the document (e.g., “AMAT Centura Manual”, “SEMI E10 Standard”).
  • collection: The name of the vector database collection it was found in (e.g., "equipment_manuals").
  • score: The similarity score (e.g., cosine similarity) between the query and this document.
  • metadata: A dictionary of additional metadata from the vector DB (e.g., doc_id, date, author).
  • rank: The rank/position of this result within its specific collection.



4. RAGResponse Dataclass

Function: The final, comprehensive output of the RAG pipeline. It contains the answer and all supporting information, making the system transparent and auditable.

Features and Description:

  • answer: The final text generated by the LLM.
  • sources: A list of SearchResult objects that were used as context for the generation.
  • query_intent: The QueryIntent object derived from the user’s original query.
  • context_used: The exact string of concatenated source content that was sent to the LLM. Crucial for debugging.
  • confidence: An overall confidence score (0.0 to 1.0) for the response, calculated based on search scores, intent confidence, and response quality.
  • response_time_ms: The total time taken to process the query, in milliseconds. Key for performance monitoring.
  • tokens_generated: The number of tokens the LLM generated.
  • model_used: The name of the LLM model that generated the answer.



5. QueryProcessor Class

Function: The NLP powerhouse of the RAG engine. It is responsible for understanding, enriching, and refining the user’s raw query to maximize the quality of retrieval.

Key Methods and Description:

  • _initialize_nlp(): Downloads NLTK data and loads the spaCy model for advanced NLP tasks like named entity recognition (NER).
  • classify_intent(query: str) -> QueryIntent: The core intent classification method.
    • How it works: Uses regex patterns to match against the query text to identify keywords for analysis, troubleshooting, and optimization.
    • It scores each intent category and picks the one with the highest score.
    • It calls entity and module extraction methods to populate the QueryIntent object.
  • _extract_entities(query: str): Uses spaCy’s NER and custom regex patterns for semiconductor units (nm, °C, torr) to find specific values and terms in the query.
  • _extract_process_modules(query: str) & _extract_equipment_types(query: str): Simple keyword matchers that check the query against predefined sets of domain-specific terms (lithography, scanner, etcher, etc.).
  • expand_query(query: str, intent: QueryIntent) -> str: Query Expansion. Adds related terms to the query to broaden the search and improve recall. For example, a query about “lithography” might be expanded to include “exposure resist mask reticle overlay”.
  • rewrite_query(query: str, intent: QueryIntent) -> str: Query Rewriting. Converts questions into statements, which are often better for semantic search. For example, “How to fix etcher error 123?” becomes “fix etcher error 123 procedure method steps troubleshooting”.



6. SemiconductorRAGEngine Class

Function: The main orchestrator. It ties all components together to execute the end-to-end RAG workflow.

Key Methods and Description:

  • __init__(config: RAGConfig): Initializes the engine with its configuration, creates a QueryProcessor, and sets up an asynchronous HTTP client.
  • process_query(query: str, ...) -> RAGResponse: The primary public method. Its workflow is:
    1. Intent Classification: Calls query_processor.classify_intent().
    2. Query Enhancement: Expands and rewrites the query if enabled.
    3. Semantic Search: Calls _semantic_search() with the processed query.
    4. Context Building: Calls _build_context() to format the retrieved results into a single string for the LLM.
    5. Response Generation: Calls _generate_response() to send the prompt (with context) to the LLM service.
    6. Confidence Calculation: Calls _calculate_confidence().
    7. Metrics & Logging: Records the processing time and logs metrics.
  • _semantic_search(query: str, ...) -> List[SearchResult]: Communicates with the vector database service. It uses the QueryIntent to _select_collections() (e.g., search failure_analysis for troubleshooting intents) and performs a cross-collection search.
  • _build_context(search_results: List[SearchResult], ...) -> str: Intelligently combines the top search results into a single string, ensuring it doesn’t exceed the max_context_length. It adds source attribution ([Source: ...]) and uses _diversify_results() to include results from different sources/collections, avoiding redundant information.
  • _generate_response(original_query: str, context: str, ...) -> Dict[str, Any]: Constructs a detailed prompt for the LLM. It uses _build_system_prompt() to create an expert persona and task instructions tailored to the query intent. It then sends this prompt to the LLM service via HTTP and returns the result.
  • get_health_status() -> Dict[str, Any]: A crucial method for deployment. It performs health checks on all dependent services (vector DB, LLM service) and reports their status, allowing for monitoring and alerting.
  • shutdown(): Properly closes the HTTP client to avoid resource leaks during application shutdown.



7. Global Instance & Getter Function

  • _rag_engine = None: A global variable to hold a singleton instance of the SemiconductorRAGEngine.
  • get_rag_engine(config: Optional[RAGConfig] = None) -> SemiconductorRAGEngine: A factory function that implements the Singleton pattern. It ensures only one instance of the RAG engine exists, which is efficient for resource management (e.g., HTTP connection pooling, loading models once). If no instance exists, it creates one; otherwise, it returns the existing one.



Information Flow Explanation

  1. User Query Initiation:

    • The process starts when a user submits a query through a User Interface (Web/Mobile App)
    • The query is sent to the RAG Engine’s Query Processing module
  2. Query Processing Phase:

    • Intent Classification analyzes the query to determine its purpose (troubleshooting, optimization, analysis, or general)
    • Query Expansion adds relevant semiconductor-specific terms to improve retrieval
    • Query Rewriting reformulates questions into statements for better semantic search
  3. Semantic Search Phase:

    • The processed query is used to search the Vector Database
    • Based on the determined intent, specific knowledge collections are targeted:
      • Troubleshooting → Failure Analysis, Equipment Manuals, SOP Documents
      • Optimization → Process Knowledge, Yield Learning
      • Analysis → Technical Reports, Process Knowledge, Yield Learning
    • Results are ranked by relevance and filtered by similarity threshold
  4. Context Construction:

    • The top search results are compiled into a context string
    • Context is formatted with source attribution and trimmed to fit the LLM’s context window
    • Diversification ensures information comes from multiple sources/collections
  5. Response Generation:

    • A specialized prompt is created combining system instructions, context, and the original query
    • The prompt is sent to the LLM Service for generation
    • The LLM returns a generated response grounded in the provided context
  6. Response Delivery:

    • The final response is packaged with sources, confidence score, and metadata
    • The RAGResponse is returned to the User Interface
    • Logs and metrics are recorded throughout the process



Key Architectural Features

  • Modular Design: Each component has a specific responsibility, making the system maintainable and extensible
  • Semantic Understanding: Goes beyond keyword matching to understand query intent and context
  • Domain Specialization: Tailored specifically for semiconductor manufacturing with specialized knowledge collections
  • Transparency: Provides sources and confidence scores to build trust in the AI’s responses
  • Performance Monitoring: Built-in logging and metrics tracking for operational oversight

This architecture enables the RAG engine to provide accurate, context-aware responses to complex technical queries in the semiconductor manufacturing domain.


This document provides a detailed breakdown of the provided Python code, which implements a Retrieval-Augmented Generation (RAG) engine tailored for the semiconductor manufacturing AI ecosystem. The system combines semantic search, natural language processing (NLP), and Large Language Model (LLM) generation to deliver accurate, context-aware responses to technical queries.




📦 Overview of the System Architecture

The RAG engine operates in a modular pipeline:

  1. Query Processing: Enhance and classify user queries.
  2. Semantic Search: Retrieve relevant documents from a vector database.
  3. Context Building: Assemble and deduplicate retrieved information.
  4. Response Generation: Use an LLM with context to generate a response.
  5. Confidence Scoring & Response Packaging: Evaluate reliability and return structured output.

It uses:

  • HTTP clients (httpx, aiohttp) to communicate with microservices.
  • NLP libraries (nltk, spacy, sentence-transformers) for query understanding.
  • Dataclasses for clean data modeling.
  • Asynchronous programming (asyncio) for high performance.



🔧 Key Components and Their Functions




✅ 1. RAGConfig (Dataclass)

Purpose: Central configuration container for the RAG engine.

Field Type Default Value Description
vector_db_url str "http://localhost:8091" Endpoint for the semantic search/vector DB service.
llm_service_url str "http://localhost:8092" LLM inference API endpoint.
data_warehouse_url str "http://localhost:8089" Not used directly; reserved for structured data integration.
data_lake_url str "http://localhost:8090" Reserved for unstructured data access.
max_context_length int 4000 Max number of characters allowed in the context passed to the LLM.
top_k_documents int 10 Number of top documents to retrieve from search.
similarity_threshold float 0.7 Minimum similarity score for a document to be included.
context_overlap int 200 Unused in current code; likely intended for overlapping context chunks.
default_model str "llama2-7b-semiconductor" Default LLM model name.
max_tokens int 1024 Max tokens to generate in the response.
temperature float 0.3 Controls randomness in LLM output (low = more deterministic).
enable_query_expansion bool True Enables adding domain-specific terms to the query.
enable_query_rewriting bool True Rewrites questions into statements for better retrieval.
enable_intent_classification bool True Enables intent detection (analysis, troubleshooting, etc.).
include_sources bool True Whether to include source attribution in the response.
max_sources int 5 Max number of sources to include in the final response.
confidence_threshold float 0.6 Minimum confidence to consider a response reliable.

🛠️ Usage: This config is passed to the main engine and influences every stage of processing.




✅ 2. QueryIntent (Dataclass)

Purpose: Represents the classified intent and extracted entities from a user query.

Field Type Description
intent_type str One of: analysis, troubleshooting, optimization, general. Guides response strategy.
confidence float Confidence score (0–1) of the intent classification.
entities List[str] Extracted technical terms (e.g., “100nm”, “500°C”).
process_modules List[str] Identified semiconductor processes (e.g., “lithography”, “etch”).
equipment_types List[str] Equipment mentioned (e.g., “scanner”, “implanter”).

🎯 Purpose: Enables context-aware retrieval and response generation by tailoring behavior to the query type.




✅ 3. SearchResult (Dataclass)

Purpose: Encapsulates a single document retrieved during semantic search.

Field Type Description
content str The actual text content of the document.
source str Origin of the document (e.g., “manual_etcher_v2.pdf”).
collection str Database collection (e.g., “equipment_manuals”).
score float Similarity score from vector search (0–1).
metadata Dict[str, Any] Additional metadata (e.g., author, date, process step).
rank int Rank within its collection.

🔍 Used to build the context and attribute sources in the final answer.




✅ 4. RAGResponse (Dataclass)

Purpose: Final structured response returned to the caller.

Field Type Description
answer str The generated natural language answer.
sources List[SearchResult] Top sources used to generate the answer.
query_intent QueryIntent Classified intent of the original query.
context_used str Full context string sent to the LLM.
confidence float Overall confidence in the answer (0–1).
response_time_ms float Total processing time in milliseconds.
tokens_generated int Number of tokens in the LLM output.
model_used str Name of the LLM used.

📦 This is the final output of the RAG pipeline — rich, traceable, and measurable.




🧠 QueryProcessor Class

Processes and enhances user queries before retrieval.



🔹 Initialization (__init__)

  • Loads NLP models (spaCy, NLTK).
  • Defines domain-specific vocabularies:
    • process_modules: Lithography, etch, deposition, etc.
    • equipment_types: Scanner, etcher, furnace, etc.
    • measurement_types: CD, overlay, thickness, etc.
  • Downloads required NLTK datasets (punkt, stopwords, averaged_perceptron_tagger).
  • Attempts to load en_core_web_sm spaCy model; falls back to regex if unavailable.

⚠️ Error Handling: Logs warnings if NLP models fail to load.




🔹 classify_intent(query: str) -> QueryIntent

Analyzes the query to determine user intent using regex-based pattern matching.



Intent Types:

Intent Trigger Keywords/Patterns
Analysis “analyze”, “what causes”, “trend”, “correlation”
Troubleshooting “fix”, “error”, “not working”, “root cause”
Optimization “optimize”, “improve yield”, “adjust recipe”
General Default fallback



Logic:

  1. Scores each intent based on regex matches.
  2. Selects the highest-scoring intent.
  3. Calculates confidence as: score / (total_score + 1).
  4. Extracts entities and domain-specific terms.
  5. Returns a QueryIntent object.

✅ Used to route to appropriate knowledge bases and customize prompts.




🔹 _extract_entities(query: str)

Extracts technical measurements using regex patterns:

  • Dimensions: \d+nm, \d+um
  • Temperature: \d+°C, \d+K
  • Pressure: \d+torr, \d+mbar
  • Percentages: \d+%
  • Angstroms: \d+A

🧩 If spaCy is available, it uses NER; otherwise, falls back to regex.




🔹 _extract_process_modules(query: str)

Finds matches from the self.process_modules set (e.g., “litho”, “etch”).

🔎 Case-insensitive substring matching.




🔹 _extract_equipment_types(query: str)

Same logic as above, but for equipment (e.g., “scanner”, “furnace”).




🔹 expand_query(query: str, intent: QueryIntent)

Adds domain-specific synonyms based on detected process modules.



Example:

  • Query: "lithography issue"
  • Expanded: "lithography issue exposure resist mask overlay"

📚 Enhances recall in semantic search by adding related terms.




🔹 rewrite_query(query: str, intent: QueryIntent)

Converts questions into statements for better semantic search matching.



Rewriting Rules:

Input Output
What is lithography? lithography definition explanation
How to fix etcher error? fix etcher error troubleshooting root cause solution

Also appends intent-specific suffixes like:

  • " troubleshooting root cause solution" for troubleshooting queries.

🔁 Improves retrieval accuracy by aligning with how documents are phrased.




🤖 SemiconductorRAGEngine Class

Main orchestrator of the RAG pipeline.




🔹 Initialization (__init__)

  • Accepts a RAGConfig.
  • Initializes QueryProcessor.
  • Sets up httpx.AsyncClient for async HTTP calls.



🔹 process_query(query: str, context: Optional[Dict]) -> RAGResponse

Core method: Executes the full RAG pipeline.



Steps:

  1. Intent Classification: query_processor.classify_intent()
  2. Query Enhancement:

    • Expand query (if enabled)
    • Rewrite query (if enabled)
  3. Semantic Search: _semantic_search()
  4. Context Building: _build_context()
  5. LLM Generation: _generate_response()
  6. Confidence Calculation: _calculate_confidence()
  7. Response Packaging: Create RAGResponse

⏱️ Measures and logs response time and metrics.

❌ On error, returns a safe fallback message.




🔹 _semantic_search(query: str, intent: QueryIntent)

Performs cross-collection vector search via HTTP.



Steps:

  1. Select Collections based on intent using _select_collections().
  2. Send POST request to /search/cross-collection with:
   {
     "query": "processed query",
     "collections": ["failure_analysis", "equipment_manuals"],
     "top_k_per_collection": 5
   }
Enter fullscreen mode

Exit fullscreen mode

  1. Parse results into SearchResult objects.
  2. Filter by similarity_threshold.
  3. Return top top_k_documents.

🔍 Uses intent to focus search scope and improve relevance.




🔹 _select_collections(intent: QueryIntent)

Maps intent to relevant document collections.

Intent Collections Added
troubleshooting failure_analysis, equipment_manuals, sop_documents
optimization process_knowledge, yield_learning
analysis Adds process_knowledge, yield_learning
general All collections
+ Process modules → process_knowledge
+ Equipment → equipment_manuals

🎯 Dynamic collection selection improves efficiency and relevance.




🔹 _build_context(search_results: List[SearchResult])

Assembles search results into a context string for the LLM.



Features:

  • Limits total length to max_context_length.
  • Adds source attribution: [Source: manual.pdf | Collection: equipment_manuals].
  • Uses _diversify_results() to avoid redundancy.
  • Inserts --- separators between documents.

📑 Ensures traceability and avoids context overflow.




🔹 _diversify_results(results: List[SearchResult])

Prevents redundancy by ensuring diversity across:

  • Sources (e.g., different PDFs)
  • Collections (e.g., manuals vs. reports)



Algorithm:

  1. First pass: Pick top results from unique source+collection pairs.
  2. Second pass: Fill remaining slots with highest-scoring leftovers.

🔄 Avoids “echo chamber” effect where similar documents dominate.




🔹 _generate_response(original_query, context, intent)

Sends a prompt to the LLM service.



Prompt Structure:

[SYSTEM PROMPT]
Context Information:
[Source: X] ...
[Source: Y] ...

User Question: What causes CD variation in lithography?
Response:
Enter fullscreen mode

Exit fullscreen mode



System Prompt:

Built via _build_system_prompt(intent) — includes:

  • Role definition: “Expert semiconductor engineer”
  • Domain knowledge: SEMI standards, process modules, metrology
  • Intent-specific instructions:

    • Troubleshooting → Root cause, steps, solutions
    • Optimization → Trade-offs, recommendations
    • Analysis → Trends, statistical significance

🧠 Tailored prompts lead to higher-quality, domain-appropriate answers.




🔹 _calculate_confidence(...) -> float

Estimates reliability of the response using weighted factors:

Factor Weight Description
Avg. search score 40% Quality of retrieved documents
Intent confidence 30% How sure we are about the query type
Number of sources 20% More sources → higher confidence
Response length 10% Longer = more detailed (proxy)

📊 Final score clamped to 0–1. Used for quality monitoring and routing decisions.




🔹 get_health_status() -> Dict

Checks availability of dependent services:

Service Check Endpoint Healthy Status
Vector DB /health HTTP 200
LLM Service /health HTTP 200

Returns:

{
  "status": "healthy|degraded|error",
  "services": {
    "vector_db": "healthy",
    "llm_service": "unreachable"
  }
}
Enter fullscreen mode

Exit fullscreen mode

🩺 Used for monitoring and auto-recovery.




🔹 shutdown()

Safely closes the HTTP client to prevent resource leaks.

🧼 Ensures clean shutdown in async environments.




🌐 Global Instance & Factory Function

_rag_engine = None

def get_rag_engine(config: Optional[RAGConfig] = None) -> SemiconductorRAGEngine:
    global _rag_engine
    if _rag_engine is None:
        if config is None:
            config = RAGConfig()
        _rag_engine = SemiconductorRAGEngine(config)
    return _rag_engine
Enter fullscreen mode

Exit fullscreen mode

✅ Implements a singleton pattern to avoid recreating the engine.
💡 Ensures efficient resource usage in web servers or APIs.




📊 Logging and Metrics

  • logger: For operational logs (rag_manager).
  • metrics_logger: For observability (counters, histograms).

    • Logs:
    • rag_queries_processed: Counter
    • rag_response_time_ms: Histogram

📈 Enables performance monitoring, debugging, and SLA tracking.




Summary of Key Features

Feature Description
Domain-Specific NLP Understands semiconductor terms, units, and processes.
Intent-Aware Routing Tailors search and response based on query type.
Query Expansion & Rewriting Boosts retrieval recall and precision.
Multi-Source Context Building Aggregates and deduplicates knowledge.
Confidence Scoring Quantifies answer reliability.
Source Attribution Transparent, auditable responses.
Async & Scalable Uses asyncio and httpx for high throughput.
Health Monitoring Checks service dependencies.
Configurable All parameters are adjustable via RAGConfig.



🚀 Use Case Example

User Query:

“Why is my etcher showing high particle counts after cleaning?”

Pipeline:

  1. Intent: troubleshooting (high confidence)
  2. Entities: “particle counts”
  3. Process Module: “etch”
  4. Query Rewritten: "high particle counts after cleaning troubleshooting root cause solution"
  5. Search Collections: failure_analysis, equipment_manuals, sop_documents
  6. Retrieved Docs: SOP for post-cleaning verification, particle fault tree
  7. LLM Prompt: “Root cause analysis for high particle counts after etcher cleaning…”
  8. Response: Lists likely causes (residue, chamber contamination), steps to verify, and corrective actions.
  9. Sources: Cites SOP and failure analysis report.
  10. Confidence: 0.82



🛑 Potential Improvements

Area Suggestion
Query Expansion Use embeddings to find semantically similar terms instead of hard-coded lists.
Entity Extraction Fine-tune spaCy on semiconductor documents for better NER.
Caching Cache frequent queries to reduce latency.
Fallback Mechanism If confidence < threshold, escalate to human expert.
Feedback Loop Allow users to rate responses to improve over time.



Conclusion

This RAG engine is a robust, domain-optimized AI assistant for semiconductor manufacturing. It combines semantic search, intent classification, context-aware prompting, and confidence estimation to deliver accurate, explainable, and trustworthy responses.

It is well-structured, modular, and production-ready, making it ideal for integration into:

  • Factory support systems
  • Engineer chatbots
  • Yield enhancement platforms
  • Knowledge management portals

📌 Final Note: This code exemplifies enterprise-grade RAG design — combining NLP, search, and LLMs in a reliable, observable, and maintainable way.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *