feat(llm-application-dev): modernize to LangGraph and latest models v2.0.0

- Migrate from LangChain 0.x to LangChain 1.x/LangGraph patterns - Update model references to Claude 4.5 and GPT-5.2 - Add Voyage AI as primary embedding recommendation - Add structured outputs with Pydantic - Replace deprecated initialize_agent() with StateGraph - Fix security: use AST-based safe math instead of unsafe execution - Add plugin.json and README.md for consistency - Bump marketplace version to 1.3.3
2026-03-18 17:47:16 +00:00 · 2026-01-19 15:43:25 -05:00
parent e827cc713a
commit 8be0e8ac7a
12 changed files with 1940 additions and 708 deletions
--- a/plugins/llm-application-dev/skills/rag-implementation/SKILL.md
+++ b/plugins/llm-application-dev/skills/rag-implementation/SKILL.md
@@ -23,187 +23,276 @@ Master Retrieval-Augmented Generation (RAG) to build LLM applications that provi
 **Purpose**: Store and retrieve document embeddings efficiently

 **Options:**
- **Pinecone**: Managed, scalable, fast queries
- **Weaviate**: Open-source, hybrid search
+- **Pinecone**: Managed, scalable, serverless
+- **Weaviate**: Open-source, hybrid search, GraphQL
 - **Milvus**: High performance, on-premise
- **Chroma**: Lightweight, easy to use
- **Qdrant**: Fast, filtered search
- **FAISS**: Meta's library, local deployment
+- **Chroma**: Lightweight, easy to use, local development
+- **Qdrant**: Fast, filtered search, Rust-based
+- **pgvector**: PostgreSQL extension, SQL integration

 ### 2. Embeddings
 **Purpose**: Convert text to numerical vectors for similarity search

-**Models:**
- **text-embedding-ada-002** (OpenAI): General purpose, 1536 dims
- **all-MiniLM-L6-v2** (Sentence Transformers): Fast, lightweight
- **e5-large-v2**: High quality, multilingual
- **Instructor**: Task-specific instructions
- **bge-large-en-v1.5**: SOTA performance
+**Models (2026):**
+| Model | Dimensions | Best For |
+|-------|------------|----------|
+| **voyage-3-large** | 1024 | Claude apps (Anthropic recommended) |
+| **voyage-code-3** | 1024 | Code search |
+| **text-embedding-3-large** | 3072 | OpenAI apps, high accuracy |
+| **text-embedding-3-small** | 1536 | OpenAI apps, cost-effective |
+| **bge-large-en-v1.5** | 1024 | Open source, local deployment |
+| **multilingual-e5-large** | 1024 | Multi-language support |

 ### 3. Retrieval Strategies
 **Approaches:**
 - **Dense Retrieval**: Semantic similarity via embeddings
 - **Sparse Retrieval**: Keyword matching (BM25, TF-IDF)
- **Hybrid Search**: Combine dense + sparse
+- **Hybrid Search**: Combine dense + sparse with weighted fusion
 - **Multi-Query**: Generate multiple query variations
- **HyDE**: Generate hypothetical documents
+- **HyDE**: Generate hypothetical documents for better retrieval

 ### 4. Reranking
 **Purpose**: Improve retrieval quality by reordering results

 **Methods:**
- **Cross-Encoders**: BERT-based reranking
+- **Cross-Encoders**: BERT-based reranking (ms-marco-MiniLM)
 - **Cohere Rerank**: API-based reranking
 - **Maximal Marginal Relevance (MMR)**: Diversity + relevance
 - **LLM-based**: Use LLM to score relevance

-## Quick Start
+## Quick Start with LangGraph

 ```python
-from langchain.document_loaders import DirectoryLoader
-from langchain.text_splitters import RecursiveCharacterTextSplitter
-from langchain.embeddings import OpenAIEmbeddings
-from langchain.vectorstores import Chroma
-from langchain.chains import RetrievalQA
-from langchain.llms import OpenAI
+from langgraph.graph import StateGraph, START, END
+from langchain_anthropic import ChatAnthropic
+from langchain_voyageai import VoyageAIEmbeddings
+from langchain_pinecone import PineconeVectorStore
+from langchain_core.documents import Document
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from typing import TypedDict, Annotated

-# 1. Load documents
-loader = DirectoryLoader('./docs', glob="**/*.txt")
-documents = loader.load()
+class RAGState(TypedDict):
+    question: str
+    context: list[Document]
+    answer: str

-# 2. Split into chunks
-text_splitter = RecursiveCharacterTextSplitter(
-    chunk_size=1000,
-    chunk_overlap=200,
-    length_function=len
-)
-chunks = text_splitter.split_documents(documents)
+# Initialize components
+llm = ChatAnthropic(model="claude-sonnet-4-5")
+embeddings = VoyageAIEmbeddings(model="voyage-3-large")
+vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
+retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

-# 3. Create embeddings and vector store
-embeddings = OpenAIEmbeddings()
-vectorstore = Chroma.from_documents(chunks, embeddings)
+# RAG prompt
+rag_prompt = ChatPromptTemplate.from_template(
+    """Answer based on the context below. If you cannot answer, say so.

-# 4. Create retrieval chain
-qa_chain = RetrievalQA.from_chain_type(
-    llm=OpenAI(),
-    chain_type="stuff",
-    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
-    return_source_documents=True
+    Context:
+    {context}
+
+    Question: {question}
+
+    Answer:"""
 )

-# 5. Query
-result = qa_chain({"query": "What are the main features?"})
-print(result['result'])
-print(result['source_documents'])
+async def retrieve(state: RAGState) -> RAGState:
+    """Retrieve relevant documents."""
+    docs = await retriever.ainvoke(state["question"])
+    return {"context": docs}
+
+async def generate(state: RAGState) -> RAGState:
+    """Generate answer from context."""
+    context_text = "\n\n".join(doc.page_content for doc in state["context"])
+    messages = rag_prompt.format_messages(
+        context=context_text,
+        question=state["question"]
+    )
+    response = await llm.ainvoke(messages)
+    return {"answer": response.content}
+
+# Build RAG graph
+builder = StateGraph(RAGState)
+builder.add_node("retrieve", retrieve)
+builder.add_node("generate", generate)
+builder.add_edge(START, "retrieve")
+builder.add_edge("retrieve", "generate")
+builder.add_edge("generate", END)
+
+rag_chain = builder.compile()
+
+# Use
+result = await rag_chain.ainvoke({"question": "What are the main features?"})
+print(result["answer"])
 ```

 ## Advanced RAG Patterns

-### Pattern 1: Hybrid Search
+### Pattern 1: Hybrid Search with RRF
+
 ```python
-from langchain.retrievers import BM25Retriever, EnsembleRetriever
+from langchain_community.retrievers import BM25Retriever
+from langchain.retrievers import EnsembleRetriever

-# Sparse retriever (BM25)
-bm25_retriever = BM25Retriever.from_documents(chunks)
-bm25_retriever.k = 5
+# Sparse retriever (BM25 for keyword matching)
+bm25_retriever = BM25Retriever.from_documents(documents)
+bm25_retriever.k = 10

-# Dense retriever (embeddings)
-embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
+# Dense retriever (embeddings for semantic search)
+dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

-# Combine with weights
+# Combine with Reciprocal Rank Fusion weights
 ensemble_retriever = EnsembleRetriever(
-    retrievers=[bm25_retriever, embedding_retriever],
-    weights=[0.3, 0.7]
+    retrievers=[bm25_retriever, dense_retriever],
+    weights=[0.3, 0.7]  # 30% keyword, 70% semantic
 )
 ```

 ### Pattern 2: Multi-Query Retrieval
+
 ```python
 from langchain.retrievers.multi_query import MultiQueryRetriever

-# Generate multiple query perspectives
-retriever = MultiQueryRetriever.from_llm(
-    retriever=vectorstore.as_retriever(),
-    llm=OpenAI()
+# Generate multiple query perspectives for better recall
+multi_query_retriever = MultiQueryRetriever.from_llm(
+    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
+    llm=llm
 )

 # Single query → multiple variations → combined results
-results = retriever.get_relevant_documents("What is the main topic?")
+results = await multi_query_retriever.ainvoke("What is the main topic?")
 ```

 ### Pattern 3: Contextual Compression
+
 ```python
 from langchain.retrievers import ContextualCompressionRetriever
 from langchain.retrievers.document_compressors import LLMChainExtractor

+# Compressor extracts only relevant portions
 compressor = LLMChainExtractor.from_llm(llm)

 compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
-    base_retriever=vectorstore.as_retriever()
+    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
 )

 # Returns only relevant parts of documents
-compressed_docs = compression_retriever.get_relevant_documents("query")
+compressed_docs = await compression_retriever.ainvoke("specific query")
 ```

 ### Pattern 4: Parent Document Retriever
+
 ```python
 from langchain.retrievers import ParentDocumentRetriever
 from langchain.storage import InMemoryStore
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+
+# Small chunks for precise retrieval, large chunks for context
+child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
+parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

 # Store for parent documents
-store = InMemoryStore()
+docstore = InMemoryStore()

-# Small chunks for retrieval, large chunks for context
-child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
-parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
-
-retriever = ParentDocumentRetriever(
+parent_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
-    docstore=store,
+    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
 )
+
+# Add documents (splits children, stores parents)
+await parent_retriever.aadd_documents(documents)
+
+# Retrieval returns parent documents with full context
+results = await parent_retriever.ainvoke("query")
+```
+
+### Pattern 5: HyDE (Hypothetical Document Embeddings)
+
+```python
+from langchain_core.prompts import ChatPromptTemplate
+
+class HyDEState(TypedDict):
+    question: str
+    hypothetical_doc: str
+    context: list[Document]
+    answer: str
+
+hyde_prompt = ChatPromptTemplate.from_template(
+    """Write a detailed passage that would answer this question:
+
+    Question: {question}
+
+    Passage:"""
+)
+
+async def generate_hypothetical(state: HyDEState) -> HyDEState:
+    """Generate hypothetical document for better retrieval."""
+    messages = hyde_prompt.format_messages(question=state["question"])
+    response = await llm.ainvoke(messages)
+    return {"hypothetical_doc": response.content}
+
+async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
+    """Retrieve using hypothetical document."""
+    # Use hypothetical doc for retrieval instead of original query
+    docs = await retriever.ainvoke(state["hypothetical_doc"])
+    return {"context": docs}
+
+# Build HyDE RAG graph
+builder = StateGraph(HyDEState)
+builder.add_node("hypothetical", generate_hypothetical)
+builder.add_node("retrieve", retrieve_with_hyde)
+builder.add_node("generate", generate)
+builder.add_edge(START, "hypothetical")
+builder.add_edge("hypothetical", "retrieve")
+builder.add_edge("retrieve", "generate")
+builder.add_edge("generate", END)
+
+hyde_rag = builder.compile()
 ```

 ## Document Chunking Strategies

 ### Recursive Character Text Splitter
 ```python
-from langchain.text_splitters import RecursiveCharacterTextSplitter
+from langchain_text_splitters import RecursiveCharacterTextSplitter

 splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
-    separators=["\n\n", "\n", " ", ""]  # Try these in order
+    separators=["\n\n", "\n", ". ", " ", ""]  # Try in order
 )
+
+chunks = splitter.split_documents(documents)
 ```

 ### Token-Based Splitting
 ```python
-from langchain.text_splitters import TokenTextSplitter
+from langchain_text_splitters import TokenTextSplitter

 splitter = TokenTextSplitter(
    chunk_size=512,
-    chunk_overlap=50
+    chunk_overlap=50,
+    encoding_name="cl100k_base"  # OpenAI tiktoken encoding
 )
 ```

 ### Semantic Chunking
 ```python
-from langchain.text_splitters import SemanticChunker
+from langchain_experimental.text_splitter import SemanticChunker

 splitter = SemanticChunker(
-    embeddings=OpenAIEmbeddings(),
-    breakpoint_threshold_type="percentile"
+    embeddings=embeddings,
+    breakpoint_threshold_type="percentile",
+    breakpoint_threshold_amount=95
 )
 ```

 ### Markdown Header Splitter
 ```python
-from langchain.text_splitters import MarkdownHeaderTextSplitter
+from langchain_text_splitters import MarkdownHeaderTextSplitter

 headers_to_split_on = [
    ("#", "Header 1"),
@@ -211,36 +300,54 @@ headers_to_split_on = [
    ("###", "Header 3"),
 ]

-splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
+splitter = MarkdownHeaderTextSplitter(
+    headers_to_split_on=headers_to_split_on,
+    strip_headers=False
+)
 ```

 ## Vector Store Configurations

-### Pinecone
+### Pinecone (Serverless)
 ```python
-import pinecone
-from langchain.vectorstores import Pinecone
+from pinecone import Pinecone, ServerlessSpec
+from langchain_pinecone import PineconeVectorStore

-pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
+# Initialize Pinecone client
+pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

-index = pinecone.Index("your-index-name")
+# Create index if needed
+if "my-index" not in pc.list_indexes().names():
+    pc.create_index(
+        name="my-index",
+        dimension=1024,  # voyage-3-large dimensions
+        metric="cosine",
+        spec=ServerlessSpec(cloud="aws", region="us-east-1")
+    )

-vectorstore = Pinecone(index, embeddings.embed_query, "text")
+# Create vector store
+index = pc.Index("my-index")
+vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
 ```

 ### Weaviate
 ```python
 import weaviate
-from langchain.vectorstores import Weaviate
+from langchain_weaviate import WeaviateVectorStore

-client = weaviate.Client("http://localhost:8080")
+client = weaviate.connect_to_local()  # or connect_to_weaviate_cloud()

-vectorstore = Weaviate(client, "Document", "content", embeddings)
+vectorstore = WeaviateVectorStore(
+    client=client,
+    index_name="Documents",
+    text_key="content",
+    embedding=embeddings
+)
 ```

-### Chroma (Local)
+### Chroma (Local Development)
 ```python
-from langchain.vectorstores import Chroma
+from langchain_chroma import Chroma

 vectorstore = Chroma(
    collection_name="my_collection",
@@ -249,32 +356,47 @@ vectorstore = Chroma(
 )
 ```

+### pgvector (PostgreSQL)
+```python
+from langchain_postgres.vectorstores import PGVector
+
+connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"
+
+vectorstore = PGVector(
+    embeddings=embeddings,
+    collection_name="documents",
+    connection=connection_string,
+)
+```
+
 ## Retrieval Optimization

 ### 1. Metadata Filtering
 ```python
+from langchain_core.documents import Document
+
 # Add metadata during indexing
-chunks_with_metadata = []
-for i, chunk in enumerate(chunks):
-    chunk.metadata = {
-        "source": chunk.metadata.get("source"),
-        "page": i,
-        "category": determine_category(chunk.page_content)
-    }
-    chunks_with_metadata.append(chunk)
+docs_with_metadata = []
+for doc in documents:
+    doc.metadata.update({
+        "source": doc.metadata.get("source", "unknown"),
+        "category": determine_category(doc.page_content),
+        "date": datetime.now().isoformat()
+    })
+    docs_with_metadata.append(doc)

 # Filter during retrieval
-results = vectorstore.similarity_search(
+results = await vectorstore.asimilarity_search(
    "query",
    filter={"category": "technical"},
    k=5
 )
 ```

-### 2. Maximal Marginal Relevance
+### 2. Maximal Marginal Relevance (MMR)
 ```python
 # Balance relevance with diversity
-results = vectorstore.max_marginal_relevance_search(
+results = await vectorstore.amax_marginal_relevance_search(
    "query",
    k=5,
    fetch_k=20,  # Fetch 20, return top 5 diverse
@@ -288,116 +410,140 @@ from sentence_transformers import CrossEncoder

 reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

-# Get initial results
-candidates = vectorstore.similarity_search("query", k=20)
+async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
+    # Get initial results
+    candidates = await vectorstore.asimilarity_search(query, k=20)

-# Rerank
-pairs = [[query, doc.page_content] for doc in candidates]
-scores = reranker.predict(pairs)
+    # Rerank
+    pairs = [[query, doc.page_content] for doc in candidates]
+    scores = reranker.predict(pairs)

-# Sort by score and take top k
-reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
+    # Sort by score and take top k
+    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
+    return [doc for doc, score in ranked[:k]]
+```
+
+### 4. Cohere Rerank
+```python
+from langchain.retrievers import CohereRerank
+from langchain_cohere import CohereRerank
+
+reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
+
+# Wrap retriever with reranking
+reranked_retriever = ContextualCompressionRetriever(
+    base_compressor=reranker,
+    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
+)
 ```

 ## Prompt Engineering for RAG

-### Contextual Prompt
+### Contextual Prompt with Citations
 ```python
-prompt_template = """Use the following context to answer the question. If you cannot answer based on the context, say "I don't have enough information."
+rag_prompt = ChatPromptTemplate.from_template(
+    """Answer the question based on the context below. Include citations using [1], [2], etc.

-Context:
-{context}
+    If you cannot answer based on the context, say "I don't have enough information."

-Question: {question}
+    Context:
+    {context}

-Answer:"""
+    Question: {question}
+
+    Instructions:
+    1. Use only information from the context
+    2. Cite sources with [1], [2] format
+    3. If uncertain, express uncertainty
+
+    Answer (with citations):"""
+)
 ```

-### With Citations
+### Structured Output for RAG
 ```python
-prompt_template = """Answer the question based on the context below. Include citations using [1], [2], etc.
+from pydantic import BaseModel, Field

-Context:
-{context}
+class RAGResponse(BaseModel):
+    answer: str = Field(description="The answer based on context")
+    confidence: float = Field(description="Confidence score 0-1")
+    sources: list[str] = Field(description="Source document IDs used")
+    reasoning: str = Field(description="Brief reasoning for the answer")

-Question: {question}
-
-Answer (with citations):"""
-```
-
-### With Confidence
-```python
-prompt_template = """Answer the question using the context. Provide a confidence score (0-100%) for your answer.
-
-Context:
-{context}
-
-Question: {question}
-
-Answer:
-Confidence:"""
+# Use with structured output
+structured_llm = llm.with_structured_output(RAGResponse)
 ```

 ## Evaluation Metrics

 ```python
-def evaluate_rag_system(qa_chain, test_cases):
-    metrics = {
-        'accuracy': [],
-        'retrieval_quality': [],
-        'groundedness': []
-    }
+from typing import TypedDict
+
+class RAGEvalMetrics(TypedDict):
+    retrieval_precision: float  # Relevant docs / retrieved docs
+    retrieval_recall: float     # Retrieved relevant / total relevant
+    answer_relevance: float     # Answer addresses question
+    faithfulness: float         # Answer grounded in context
+    context_relevance: float    # Context relevant to question
+
+async def evaluate_rag_system(
+    rag_chain,
+    test_cases: list[dict]
+) -> RAGEvalMetrics:
+    """Evaluate RAG system on test cases."""
+    metrics = {k: [] for k in RAGEvalMetrics.__annotations__}

    for test in test_cases:
-        result = qa_chain({"query": test['question']})
+        result = await rag_chain.ainvoke({"question": test["question"]})

-        # Check if answer matches expected
-        accuracy = calculate_accuracy(result['result'], test['expected'])
-        metrics['accuracy'].append(accuracy)
+        # Retrieval metrics
+        retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
+        relevant_ids = set(test["relevant_doc_ids"])

-        # Check if relevant docs were retrieved
-        retrieval_quality = evaluate_retrieved_docs(
-            result['source_documents'],
-            test['relevant_docs']
+        precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
+        recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
+
+        metrics["retrieval_precision"].append(precision)
+        metrics["retrieval_recall"].append(recall)
+
+        # Use LLM-as-judge for quality metrics
+        quality = await evaluate_answer_quality(
+            question=test["question"],
+            answer=result["answer"],
+            context=result["context"],
+            expected=test.get("expected_answer")
        )
-        metrics['retrieval_quality'].append(retrieval_quality)
+        metrics["answer_relevance"].append(quality["relevance"])
+        metrics["faithfulness"].append(quality["faithfulness"])
+        metrics["context_relevance"].append(quality["context_relevance"])

-        # Check if answer is grounded in context
-        groundedness = check_groundedness(
-            result['result'],
-            result['source_documents']
-        )
-        metrics['groundedness'].append(groundedness)
-
-    return {k: sum(v)/len(v) for k, v in metrics.items()}
+    return {k: sum(v) / len(v) for k, v in metrics.items()}
 ```

 ## Resources

- **references/vector-databases.md**: Detailed comparison of vector DBs
- **references/embeddings.md**: Embedding model selection guide
- **references/retrieval-strategies.md**: Advanced retrieval techniques
- **references/reranking.md**: Reranking methods and when to use them
- **references/context-window.md**: Managing context limits
- **assets/vector-store-config.yaml**: Configuration templates
- **assets/retriever-pipeline.py**: Complete RAG pipeline
- **assets/embedding-models.md**: Model comparison and benchmarks
+- [LangChain RAG Tutorial](https://python.langchain.com/docs/tutorials/rag/)
+- [LangGraph RAG Examples](https://langchain-ai.github.io/langgraph/tutorials/rag/)
+- [Pinecone Best Practices](https://docs.pinecone.io/guides/get-started/overview)
+- [Voyage AI Embeddings](https://docs.voyageai.com/)
+- [RAG Evaluation Guide](https://docs.ragas.io/)

 ## Best Practices

-1. **Chunk Size**: Balance between context and specificity (500-1000 tokens)
+1. **Chunk Size**: Balance between context (larger) and specificity (smaller) - typically 500-1000 tokens
 2. **Overlap**: Use 10-20% overlap to preserve context at boundaries
 3. **Metadata**: Include source, page, timestamp for filtering and debugging
-4. **Hybrid Search**: Combine semantic and keyword search for best results
-5. **Reranking**: Improve top results with cross-encoder
+4. **Hybrid Search**: Combine semantic and keyword search for best recall
+5. **Reranking**: Use cross-encoder reranking for precision-critical applications
 6. **Citations**: Always return source documents for transparency
 7. **Evaluation**: Continuously test retrieval quality and answer accuracy
-8. **Monitoring**: Track retrieval metrics in production
+8. **Monitoring**: Track retrieval metrics and latency in production

 ## Common Issues

 - **Poor Retrieval**: Check embedding quality, chunk size, query formulation
 - **Irrelevant Results**: Add metadata filtering, use hybrid search, rerank
- **Missing Information**: Ensure documents are properly indexed
+- **Missing Information**: Ensure documents are properly indexed, check chunking
 - **Slow Queries**: Optimize vector store, use caching, reduce k
 - **Hallucinations**: Improve grounding prompt, add verification step
+- **Context Too Long**: Use compression or parent document retriever