feat(llm-application-dev): modernize to LangGraph and latest models v2.0.0

- Migrate from LangChain 0.x to LangChain 1.x/LangGraph patterns
- Update model references to Claude 4.5 and GPT-5.2
- Add Voyage AI as primary embedding recommendation
- Add structured outputs with Pydantic
- Replace deprecated initialize_agent() with StateGraph
- Fix security: use AST-based safe math instead of unsafe execution
- Add plugin.json and README.md for consistency
- Bump marketplace version to 1.3.3
This commit is contained in:
Seth Hobson
2026-01-19 15:43:25 -05:00
parent e827cc713a
commit 8be0e8ac7a
12 changed files with 1940 additions and 708 deletions

View File

@@ -23,187 +23,276 @@ Master Retrieval-Augmented Generation (RAG) to build LLM applications that provi
**Purpose**: Store and retrieve document embeddings efficiently
**Options:**
- **Pinecone**: Managed, scalable, fast queries
- **Weaviate**: Open-source, hybrid search
- **Pinecone**: Managed, scalable, serverless
- **Weaviate**: Open-source, hybrid search, GraphQL
- **Milvus**: High performance, on-premise
- **Chroma**: Lightweight, easy to use
- **Qdrant**: Fast, filtered search
- **FAISS**: Meta's library, local deployment
- **Chroma**: Lightweight, easy to use, local development
- **Qdrant**: Fast, filtered search, Rust-based
- **pgvector**: PostgreSQL extension, SQL integration
### 2. Embeddings
**Purpose**: Convert text to numerical vectors for similarity search
**Models:**
- **text-embedding-ada-002** (OpenAI): General purpose, 1536 dims
- **all-MiniLM-L6-v2** (Sentence Transformers): Fast, lightweight
- **e5-large-v2**: High quality, multilingual
- **Instructor**: Task-specific instructions
- **bge-large-en-v1.5**: SOTA performance
**Models (2026):**
| Model | Dimensions | Best For |
|-------|------------|----------|
| **voyage-3-large** | 1024 | Claude apps (Anthropic recommended) |
| **voyage-code-3** | 1024 | Code search |
| **text-embedding-3-large** | 3072 | OpenAI apps, high accuracy |
| **text-embedding-3-small** | 1536 | OpenAI apps, cost-effective |
| **bge-large-en-v1.5** | 1024 | Open source, local deployment |
| **multilingual-e5-large** | 1024 | Multi-language support |
### 3. Retrieval Strategies
**Approaches:**
- **Dense Retrieval**: Semantic similarity via embeddings
- **Sparse Retrieval**: Keyword matching (BM25, TF-IDF)
- **Hybrid Search**: Combine dense + sparse
- **Hybrid Search**: Combine dense + sparse with weighted fusion
- **Multi-Query**: Generate multiple query variations
- **HyDE**: Generate hypothetical documents
- **HyDE**: Generate hypothetical documents for better retrieval
### 4. Reranking
**Purpose**: Improve retrieval quality by reordering results
**Methods:**
- **Cross-Encoders**: BERT-based reranking
- **Cross-Encoders**: BERT-based reranking (ms-marco-MiniLM)
- **Cohere Rerank**: API-based reranking
- **Maximal Marginal Relevance (MMR)**: Diversity + relevance
- **LLM-based**: Use LLM to score relevance
## Quick Start
## Quick Start with LangGraph
```python
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langgraph.graph import StateGraph, START, END
from langchain_anthropic import ChatAnthropic
from langchain_voyageai import VoyageAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import TypedDict, Annotated
# 1. Load documents
loader = DirectoryLoader('./docs', glob="**/*.txt")
documents = loader.load()
class RAGState(TypedDict):
question: str
context: list[Document]
answer: str
# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_documents(documents)
# Initialize components
llm = ChatAnthropic(model="claude-sonnet-4-5")
embeddings = VoyageAIEmbeddings(model="voyage-3-large")
vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# RAG prompt
rag_prompt = ChatPromptTemplate.from_template(
"""Answer based on the context below. If you cannot answer, say so.
# 4. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(),
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True
Context:
{context}
Question: {question}
Answer:"""
)
# 5. Query
result = qa_chain({"query": "What are the main features?"})
print(result['result'])
print(result['source_documents'])
async def retrieve(state: RAGState) -> RAGState:
"""Retrieve relevant documents."""
docs = await retriever.ainvoke(state["question"])
return {"context": docs}
async def generate(state: RAGState) -> RAGState:
"""Generate answer from context."""
context_text = "\n\n".join(doc.page_content for doc in state["context"])
messages = rag_prompt.format_messages(
context=context_text,
question=state["question"]
)
response = await llm.ainvoke(messages)
return {"answer": response.content}
# Build RAG graph
builder = StateGraph(RAGState)
builder.add_node("retrieve", retrieve)
builder.add_node("generate", generate)
builder.add_edge(START, "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)
rag_chain = builder.compile()
# Use
result = await rag_chain.ainvoke({"question": "What are the main features?"})
print(result["answer"])
```
## Advanced RAG Patterns
### Pattern 1: Hybrid Search
### Pattern 1: Hybrid Search with RRF
```python
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# Sparse retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Sparse retriever (BM25 for keyword matching)
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 10
# Dense retriever (embeddings)
embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Dense retriever (embeddings for semantic search)
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
# Combine with weights
# Combine with Reciprocal Rank Fusion weights
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, embedding_retriever],
weights=[0.3, 0.7]
retrievers=[bm25_retriever, dense_retriever],
weights=[0.3, 0.7] # 30% keyword, 70% semantic
)
```
### Pattern 2: Multi-Query Retrieval
```python
from langchain.retrievers.multi_query import MultiQueryRetriever
# Generate multiple query perspectives
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=OpenAI()
# Generate multiple query perspectives for better recall
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
llm=llm
)
# Single query → multiple variations → combined results
results = retriever.get_relevant_documents("What is the main topic?")
results = await multi_query_retriever.ainvoke("What is the main topic?")
```
### Pattern 3: Contextual Compression
```python
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
# Compressor extracts only relevant portions
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever()
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)
# Returns only relevant parts of documents
compressed_docs = compression_retriever.get_relevant_documents("query")
compressed_docs = await compression_retriever.ainvoke("specific query")
```
### Pattern 4: Parent Document Retriever
```python
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Small chunks for precise retrieval, large chunks for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
# Store for parent documents
store = InMemoryStore()
docstore = InMemoryStore()
# Small chunks for retrieval, large chunks for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
retriever = ParentDocumentRetriever(
parent_retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter
)
# Add documents (splits children, stores parents)
await parent_retriever.aadd_documents(documents)
# Retrieval returns parent documents with full context
results = await parent_retriever.ainvoke("query")
```
### Pattern 5: HyDE (Hypothetical Document Embeddings)
```python
from langchain_core.prompts import ChatPromptTemplate
class HyDEState(TypedDict):
question: str
hypothetical_doc: str
context: list[Document]
answer: str
hyde_prompt = ChatPromptTemplate.from_template(
"""Write a detailed passage that would answer this question:
Question: {question}
Passage:"""
)
async def generate_hypothetical(state: HyDEState) -> HyDEState:
"""Generate hypothetical document for better retrieval."""
messages = hyde_prompt.format_messages(question=state["question"])
response = await llm.ainvoke(messages)
return {"hypothetical_doc": response.content}
async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
"""Retrieve using hypothetical document."""
# Use hypothetical doc for retrieval instead of original query
docs = await retriever.ainvoke(state["hypothetical_doc"])
return {"context": docs}
# Build HyDE RAG graph
builder = StateGraph(HyDEState)
builder.add_node("hypothetical", generate_hypothetical)
builder.add_node("retrieve", retrieve_with_hyde)
builder.add_node("generate", generate)
builder.add_edge(START, "hypothetical")
builder.add_edge("hypothetical", "retrieve")
builder.add_edge("retrieve", "generate")
builder.add_edge("generate", END)
hyde_rag = builder.compile()
```
## Document Chunking Strategies
### Recursive Character Text Splitter
```python
from langchain.text_splitters import RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""] # Try these in order
separators=["\n\n", "\n", ". ", " ", ""] # Try in order
)
chunks = splitter.split_documents(documents)
```
### Token-Based Splitting
```python
from langchain.text_splitters import TokenTextSplitter
from langchain_text_splitters import TokenTextSplitter
splitter = TokenTextSplitter(
chunk_size=512,
chunk_overlap=50
chunk_overlap=50,
encoding_name="cl100k_base" # OpenAI tiktoken encoding
)
```
### Semantic Chunking
```python
from langchain.text_splitters import SemanticChunker
from langchain_experimental.text_splitter import SemanticChunker
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95
)
```
### Markdown Header Splitter
```python
from langchain.text_splitters import MarkdownHeaderTextSplitter
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "Header 1"),
@@ -211,36 +300,54 @@ headers_to_split_on = [
("###", "Header 3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
strip_headers=False
)
```
## Vector Store Configurations
### Pinecone
### Pinecone (Serverless)
```python
import pinecone
from langchain.vectorstores import Pinecone
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
# Initialize Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pinecone.Index("your-index-name")
# Create index if needed
if "my-index" not in pc.list_indexes().names():
pc.create_index(
name="my-index",
dimension=1024, # voyage-3-large dimensions
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
vectorstore = Pinecone(index, embeddings.embed_query, "text")
# Create vector store
index = pc.Index("my-index")
vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
```
### Weaviate
```python
import weaviate
from langchain.vectorstores import Weaviate
from langchain_weaviate import WeaviateVectorStore
client = weaviate.Client("http://localhost:8080")
client = weaviate.connect_to_local() # or connect_to_weaviate_cloud()
vectorstore = Weaviate(client, "Document", "content", embeddings)
vectorstore = WeaviateVectorStore(
client=client,
index_name="Documents",
text_key="content",
embedding=embeddings
)
```
### Chroma (Local)
### Chroma (Local Development)
```python
from langchain.vectorstores import Chroma
from langchain_chroma import Chroma
vectorstore = Chroma(
collection_name="my_collection",
@@ -249,32 +356,47 @@ vectorstore = Chroma(
)
```
### pgvector (PostgreSQL)
```python
from langchain_postgres.vectorstores import PGVector
connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"
vectorstore = PGVector(
embeddings=embeddings,
collection_name="documents",
connection=connection_string,
)
```
## Retrieval Optimization
### 1. Metadata Filtering
```python
from langchain_core.documents import Document
# Add metadata during indexing
chunks_with_metadata = []
for i, chunk in enumerate(chunks):
chunk.metadata = {
"source": chunk.metadata.get("source"),
"page": i,
"category": determine_category(chunk.page_content)
}
chunks_with_metadata.append(chunk)
docs_with_metadata = []
for doc in documents:
doc.metadata.update({
"source": doc.metadata.get("source", "unknown"),
"category": determine_category(doc.page_content),
"date": datetime.now().isoformat()
})
docs_with_metadata.append(doc)
# Filter during retrieval
results = vectorstore.similarity_search(
results = await vectorstore.asimilarity_search(
"query",
filter={"category": "technical"},
k=5
)
```
### 2. Maximal Marginal Relevance
### 2. Maximal Marginal Relevance (MMR)
```python
# Balance relevance with diversity
results = vectorstore.max_marginal_relevance_search(
results = await vectorstore.amax_marginal_relevance_search(
"query",
k=5,
fetch_k=20, # Fetch 20, return top 5 diverse
@@ -288,116 +410,140 @@ from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Get initial results
candidates = vectorstore.similarity_search("query", k=20)
async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
# Get initial results
candidates = await vectorstore.asimilarity_search(query, k=20)
# Rerank
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)
# Rerank
pairs = [[query, doc.page_content] for doc in candidates]
scores = reranker.predict(pairs)
# Sort by score and take top k
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
# Sort by score and take top k
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:k]]
```
### 4. Cohere Rerank
```python
from langchain.retrievers import CohereRerank
from langchain_cohere import CohereRerank
reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
# Wrap retriever with reranking
reranked_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
)
```
## Prompt Engineering for RAG
### Contextual Prompt
### Contextual Prompt with Citations
```python
prompt_template = """Use the following context to answer the question. If you cannot answer based on the context, say "I don't have enough information."
rag_prompt = ChatPromptTemplate.from_template(
"""Answer the question based on the context below. Include citations using [1], [2], etc.
Context:
{context}
If you cannot answer based on the context, say "I don't have enough information."
Question: {question}
Context:
{context}
Answer:"""
Question: {question}
Instructions:
1. Use only information from the context
2. Cite sources with [1], [2] format
3. If uncertain, express uncertainty
Answer (with citations):"""
)
```
### With Citations
### Structured Output for RAG
```python
prompt_template = """Answer the question based on the context below. Include citations using [1], [2], etc.
from pydantic import BaseModel, Field
Context:
{context}
class RAGResponse(BaseModel):
answer: str = Field(description="The answer based on context")
confidence: float = Field(description="Confidence score 0-1")
sources: list[str] = Field(description="Source document IDs used")
reasoning: str = Field(description="Brief reasoning for the answer")
Question: {question}
Answer (with citations):"""
```
### With Confidence
```python
prompt_template = """Answer the question using the context. Provide a confidence score (0-100%) for your answer.
Context:
{context}
Question: {question}
Answer:
Confidence:"""
# Use with structured output
structured_llm = llm.with_structured_output(RAGResponse)
```
## Evaluation Metrics
```python
def evaluate_rag_system(qa_chain, test_cases):
metrics = {
'accuracy': [],
'retrieval_quality': [],
'groundedness': []
}
from typing import TypedDict
class RAGEvalMetrics(TypedDict):
retrieval_precision: float # Relevant docs / retrieved docs
retrieval_recall: float # Retrieved relevant / total relevant
answer_relevance: float # Answer addresses question
faithfulness: float # Answer grounded in context
context_relevance: float # Context relevant to question
async def evaluate_rag_system(
rag_chain,
test_cases: list[dict]
) -> RAGEvalMetrics:
"""Evaluate RAG system on test cases."""
metrics = {k: [] for k in RAGEvalMetrics.__annotations__}
for test in test_cases:
result = qa_chain({"query": test['question']})
result = await rag_chain.ainvoke({"question": test["question"]})
# Check if answer matches expected
accuracy = calculate_accuracy(result['result'], test['expected'])
metrics['accuracy'].append(accuracy)
# Retrieval metrics
retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
relevant_ids = set(test["relevant_doc_ids"])
# Check if relevant docs were retrieved
retrieval_quality = evaluate_retrieved_docs(
result['source_documents'],
test['relevant_docs']
precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
metrics["retrieval_precision"].append(precision)
metrics["retrieval_recall"].append(recall)
# Use LLM-as-judge for quality metrics
quality = await evaluate_answer_quality(
question=test["question"],
answer=result["answer"],
context=result["context"],
expected=test.get("expected_answer")
)
metrics['retrieval_quality'].append(retrieval_quality)
metrics["answer_relevance"].append(quality["relevance"])
metrics["faithfulness"].append(quality["faithfulness"])
metrics["context_relevance"].append(quality["context_relevance"])
# Check if answer is grounded in context
groundedness = check_groundedness(
result['result'],
result['source_documents']
)
metrics['groundedness'].append(groundedness)
return {k: sum(v)/len(v) for k, v in metrics.items()}
return {k: sum(v) / len(v) for k, v in metrics.items()}
```
## Resources
- **references/vector-databases.md**: Detailed comparison of vector DBs
- **references/embeddings.md**: Embedding model selection guide
- **references/retrieval-strategies.md**: Advanced retrieval techniques
- **references/reranking.md**: Reranking methods and when to use them
- **references/context-window.md**: Managing context limits
- **assets/vector-store-config.yaml**: Configuration templates
- **assets/retriever-pipeline.py**: Complete RAG pipeline
- **assets/embedding-models.md**: Model comparison and benchmarks
- [LangChain RAG Tutorial](https://python.langchain.com/docs/tutorials/rag/)
- [LangGraph RAG Examples](https://langchain-ai.github.io/langgraph/tutorials/rag/)
- [Pinecone Best Practices](https://docs.pinecone.io/guides/get-started/overview)
- [Voyage AI Embeddings](https://docs.voyageai.com/)
- [RAG Evaluation Guide](https://docs.ragas.io/)
## Best Practices
1. **Chunk Size**: Balance between context and specificity (500-1000 tokens)
1. **Chunk Size**: Balance between context (larger) and specificity (smaller) - typically 500-1000 tokens
2. **Overlap**: Use 10-20% overlap to preserve context at boundaries
3. **Metadata**: Include source, page, timestamp for filtering and debugging
4. **Hybrid Search**: Combine semantic and keyword search for best results
5. **Reranking**: Improve top results with cross-encoder
4. **Hybrid Search**: Combine semantic and keyword search for best recall
5. **Reranking**: Use cross-encoder reranking for precision-critical applications
6. **Citations**: Always return source documents for transparency
7. **Evaluation**: Continuously test retrieval quality and answer accuracy
8. **Monitoring**: Track retrieval metrics in production
8. **Monitoring**: Track retrieval metrics and latency in production
## Common Issues
- **Poor Retrieval**: Check embedding quality, chunk size, query formulation
- **Irrelevant Results**: Add metadata filtering, use hybrid search, rerank
- **Missing Information**: Ensure documents are properly indexed
- **Missing Information**: Ensure documents are properly indexed, check chunking
- **Slow Queries**: Optimize vector store, use caching, reduce k
- **Hallucinations**: Improve grounding prompt, add verification step
- **Context Too Long**: Use compression or parent document retriever