mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
feat(llm-application-dev): modernize to LangGraph and latest models v2.0.0
- Migrate from LangChain 0.x to LangChain 1.x/LangGraph patterns - Update model references to Claude 4.5 and GPT-5.2 - Add Voyage AI as primary embedding recommendation - Add structured outputs with Pydantic - Replace deprecated initialize_agent() with StateGraph - Fix security: use AST-based safe math instead of unsafe execution - Add plugin.json and README.md for consistency - Bump marketplace version to 1.3.3
This commit is contained in:
@@ -18,14 +18,18 @@ Guide to selecting and optimizing embedding models for vector search application
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Embedding Model Comparison
|
||||
### 1. Embedding Model Comparison (2026)
|
||||
|
||||
| Model | Dimensions | Max Tokens | Best For |
|
||||
|-------|------------|------------|----------|
|
||||
| **text-embedding-3-large** | 3072 | 8191 | High accuracy |
|
||||
| **text-embedding-3-small** | 1536 | 8191 | Cost-effective |
|
||||
| **voyage-2** | 1024 | 4000 | Code, legal |
|
||||
| **bge-large-en-v1.5** | 1024 | 512 | Open source |
|
||||
| **voyage-3-large** | 1024 | 32000 | Claude apps (Anthropic recommended) |
|
||||
| **voyage-3** | 1024 | 32000 | Claude apps, cost-effective |
|
||||
| **voyage-code-3** | 1024 | 32000 | Code search |
|
||||
| **voyage-finance-2** | 1024 | 32000 | Financial documents |
|
||||
| **voyage-law-2** | 1024 | 32000 | Legal documents |
|
||||
| **text-embedding-3-large** | 3072 | 8191 | OpenAI apps, high accuracy |
|
||||
| **text-embedding-3-small** | 1536 | 8191 | OpenAI apps, cost-effective |
|
||||
| **bge-large-en-v1.5** | 1024 | 512 | Open source, local deployment |
|
||||
| **all-MiniLM-L6-v2** | 384 | 256 | Fast, lightweight |
|
||||
| **multilingual-e5-large** | 1024 | 512 | Multi-language |
|
||||
|
||||
@@ -39,7 +43,34 @@ Document → Chunking → Preprocessing → Embedding Model → Vector
|
||||
|
||||
## Templates
|
||||
|
||||
### Template 1: OpenAI Embeddings
|
||||
### Template 1: Voyage AI Embeddings (Recommended for Claude)
|
||||
|
||||
```python
|
||||
from langchain_voyageai import VoyageAIEmbeddings
|
||||
from typing import List
|
||||
import os
|
||||
|
||||
# Initialize Voyage AI embeddings (recommended by Anthropic for Claude)
|
||||
embeddings = VoyageAIEmbeddings(
|
||||
model="voyage-3-large",
|
||||
voyage_api_key=os.environ.get("VOYAGE_API_KEY")
|
||||
)
|
||||
|
||||
def get_embeddings(texts: List[str]) -> List[List[float]]:
|
||||
"""Get embeddings from Voyage AI."""
|
||||
return embeddings.embed_documents(texts)
|
||||
|
||||
def get_query_embedding(query: str) -> List[float]:
|
||||
"""Get single query embedding."""
|
||||
return embeddings.embed_query(query)
|
||||
|
||||
# Specialized models for domains
|
||||
code_embeddings = VoyageAIEmbeddings(model="voyage-code-3")
|
||||
finance_embeddings = VoyageAIEmbeddings(model="voyage-finance-2")
|
||||
legal_embeddings = VoyageAIEmbeddings(model="voyage-law-2")
|
||||
```
|
||||
|
||||
### Template 2: OpenAI Embeddings
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@@ -53,7 +84,7 @@ def get_embeddings(
|
||||
model: str = "text-embedding-3-small",
|
||||
dimensions: int = None
|
||||
) -> List[List[float]]:
|
||||
"""Get embeddings from OpenAI."""
|
||||
"""Get embeddings from OpenAI with optional dimension reduction."""
|
||||
# Handle batching for large lists
|
||||
batch_size = 100
|
||||
all_embeddings = []
|
||||
@@ -63,6 +94,7 @@ def get_embeddings(
|
||||
|
||||
kwargs = {"input": batch, "model": model}
|
||||
if dimensions:
|
||||
# Matryoshka dimensionality reduction
|
||||
kwargs["dimensions"] = dimensions
|
||||
|
||||
response = client.embeddings.create(**kwargs)
|
||||
@@ -77,7 +109,7 @@ def get_embedding(text: str, **kwargs) -> List[float]:
|
||||
return get_embeddings([text], **kwargs)[0]
|
||||
|
||||
|
||||
# Dimension reduction with OpenAI
|
||||
# Dimension reduction with Matryoshka embeddings
|
||||
def get_reduced_embedding(text: str, dimensions: int = 512) -> List[float]:
|
||||
"""Get embedding with reduced dimensions (Matryoshka)."""
|
||||
return get_embedding(
|
||||
@@ -87,7 +119,7 @@ def get_reduced_embedding(text: str, dimensions: int = 512) -> List[float]:
|
||||
)
|
||||
```
|
||||
|
||||
### Template 2: Local Embeddings with Sentence Transformers
|
||||
### Template 3: Local Embeddings with Sentence Transformers
|
||||
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
@@ -103,6 +135,7 @@ class LocalEmbedder:
|
||||
device: str = "cuda"
|
||||
):
|
||||
self.model = SentenceTransformer(model_name, device=device)
|
||||
self.model_name = model_name
|
||||
|
||||
def embed(
|
||||
self,
|
||||
@@ -120,9 +153,9 @@ class LocalEmbedder:
|
||||
return embeddings
|
||||
|
||||
def embed_query(self, query: str) -> np.ndarray:
|
||||
"""Embed a query with BGE-style prefix."""
|
||||
# BGE models benefit from query prefix
|
||||
if "bge" in self.model.get_sentence_embedding_dimension():
|
||||
"""Embed a query with appropriate prefix for retrieval models."""
|
||||
# BGE and similar models benefit from query prefix
|
||||
if "bge" in self.model_name.lower():
|
||||
query = f"Represent this sentence for searching relevant passages: {query}"
|
||||
return self.embed([query])[0]
|
||||
|
||||
@@ -137,13 +170,15 @@ class E5Embedder:
|
||||
self.model = SentenceTransformer(model_name)
|
||||
|
||||
def embed_query(self, query: str) -> np.ndarray:
|
||||
"""E5 requires 'query:' prefix for queries."""
|
||||
return self.model.encode(f"query: {query}")
|
||||
|
||||
def embed_document(self, document: str) -> np.ndarray:
|
||||
"""E5 requires 'passage:' prefix for documents."""
|
||||
return self.model.encode(f"passage: {document}")
|
||||
```
|
||||
|
||||
### Template 3: Chunking Strategies
|
||||
### Template 4: Chunking Strategies
|
||||
|
||||
```python
|
||||
from typing import List, Tuple
|
||||
@@ -288,20 +323,33 @@ def recursive_character_splitter(
|
||||
return split_text(text, separators)
|
||||
```
|
||||
|
||||
### Template 4: Domain-Specific Embedding Pipeline
|
||||
### Template 5: Domain-Specific Embedding Pipeline
|
||||
|
||||
```python
|
||||
import re
|
||||
from typing import List, Optional
|
||||
from dataclasses import dataclass
|
||||
|
||||
@dataclass
|
||||
class EmbeddedDocument:
|
||||
id: str
|
||||
document_id: str
|
||||
chunk_index: int
|
||||
text: str
|
||||
embedding: List[float]
|
||||
metadata: dict
|
||||
|
||||
class DomainEmbeddingPipeline:
|
||||
"""Pipeline for domain-specific embeddings."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
embedding_model: str = "text-embedding-3-small",
|
||||
embedding_model: str = "voyage-3-large",
|
||||
chunk_size: int = 512,
|
||||
chunk_overlap: int = 50,
|
||||
preprocessing_fn=None
|
||||
):
|
||||
self.embedding_model = embedding_model
|
||||
self.embeddings = VoyageAIEmbeddings(model=embedding_model)
|
||||
self.chunk_size = chunk_size
|
||||
self.chunk_overlap = chunk_overlap
|
||||
self.preprocess = preprocessing_fn or self._default_preprocess
|
||||
@@ -310,7 +358,7 @@ class DomainEmbeddingPipeline:
|
||||
"""Default preprocessing."""
|
||||
# Remove excessive whitespace
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
# Remove special characters
|
||||
# Remove special characters (customize for your domain)
|
||||
text = re.sub(r'[^\w\s.,!?-]', '', text)
|
||||
return text.strip()
|
||||
|
||||
@@ -319,8 +367,8 @@ class DomainEmbeddingPipeline:
|
||||
documents: List[dict],
|
||||
id_field: str = "id",
|
||||
content_field: str = "content",
|
||||
metadata_fields: List[str] = None
|
||||
) -> List[dict]:
|
||||
metadata_fields: Optional[List[str]] = None
|
||||
) -> List[EmbeddedDocument]:
|
||||
"""Process documents for vector storage."""
|
||||
processed = []
|
||||
|
||||
@@ -339,25 +387,26 @@ class DomainEmbeddingPipeline:
|
||||
)
|
||||
|
||||
# Create embeddings
|
||||
embeddings = get_embeddings(chunks, self.embedding_model)
|
||||
embeddings = await self.embeddings.aembed_documents(chunks)
|
||||
|
||||
# Create records
|
||||
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
|
||||
record = {
|
||||
"id": f"{doc_id}_chunk_{i}",
|
||||
"document_id": doc_id,
|
||||
"chunk_index": i,
|
||||
"text": chunk,
|
||||
"embedding": embedding
|
||||
}
|
||||
metadata = {"document_id": doc_id, "chunk_index": i}
|
||||
|
||||
# Add metadata
|
||||
# Add specified metadata fields
|
||||
if metadata_fields:
|
||||
for field in metadata_fields:
|
||||
if field in doc:
|
||||
record[field] = doc[field]
|
||||
metadata[field] = doc[field]
|
||||
|
||||
processed.append(record)
|
||||
processed.append(EmbeddedDocument(
|
||||
id=f"{doc_id}_chunk_{i}",
|
||||
document_id=doc_id,
|
||||
chunk_index=i,
|
||||
text=chunk,
|
||||
embedding=embedding,
|
||||
metadata=metadata
|
||||
))
|
||||
|
||||
return processed
|
||||
|
||||
@@ -366,42 +415,77 @@ class DomainEmbeddingPipeline:
|
||||
class CodeEmbeddingPipeline:
|
||||
"""Specialized pipeline for code embeddings."""
|
||||
|
||||
def __init__(self, model: str = "voyage-code-2"):
|
||||
self.model = model
|
||||
def __init__(self):
|
||||
# Use Voyage's code-specific model
|
||||
self.embeddings = VoyageAIEmbeddings(model="voyage-code-3")
|
||||
|
||||
def chunk_code(self, code: str, language: str) -> List[dict]:
|
||||
"""Chunk code by functions/classes."""
|
||||
import tree_sitter
|
||||
"""Chunk code by functions/classes using tree-sitter."""
|
||||
try:
|
||||
import tree_sitter_languages
|
||||
parser = tree_sitter_languages.get_parser(language)
|
||||
tree = parser.parse(bytes(code, "utf8"))
|
||||
|
||||
# Parse with tree-sitter
|
||||
# Extract functions, classes, methods
|
||||
# Return chunks with context
|
||||
pass
|
||||
chunks = []
|
||||
# Extract function and class definitions
|
||||
self._extract_nodes(tree.root_node, code, chunks)
|
||||
return chunks
|
||||
except ImportError:
|
||||
# Fallback to simple chunking
|
||||
return [{"text": code, "type": "module"}]
|
||||
|
||||
def embed_with_context(self, chunk: str, context: str) -> List[float]:
|
||||
def _extract_nodes(self, node, source_code: str, chunks: list):
|
||||
"""Recursively extract function/class definitions."""
|
||||
if node.type in ['function_definition', 'class_definition', 'method_definition']:
|
||||
text = source_code[node.start_byte:node.end_byte]
|
||||
chunks.append({
|
||||
"text": text,
|
||||
"type": node.type,
|
||||
"name": self._get_name(node),
|
||||
"start_line": node.start_point[0],
|
||||
"end_line": node.end_point[0]
|
||||
})
|
||||
for child in node.children:
|
||||
self._extract_nodes(child, source_code, chunks)
|
||||
|
||||
def _get_name(self, node) -> str:
|
||||
"""Extract name from function/class node."""
|
||||
for child in node.children:
|
||||
if child.type == 'identifier' or child.type == 'name':
|
||||
return child.text.decode('utf8')
|
||||
return "unknown"
|
||||
|
||||
async def embed_with_context(
|
||||
self,
|
||||
chunk: str,
|
||||
context: str = ""
|
||||
) -> List[float]:
|
||||
"""Embed code with surrounding context."""
|
||||
combined = f"Context: {context}\n\nCode:\n{chunk}"
|
||||
return get_embedding(combined, model=self.model)
|
||||
if context:
|
||||
combined = f"Context: {context}\n\nCode:\n{chunk}"
|
||||
else:
|
||||
combined = chunk
|
||||
return await self.embeddings.aembed_query(combined)
|
||||
```
|
||||
|
||||
### Template 5: Embedding Quality Evaluation
|
||||
### Template 6: Embedding Quality Evaluation
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
from typing import List, Tuple
|
||||
from typing import List, Dict
|
||||
|
||||
def evaluate_retrieval_quality(
|
||||
queries: List[str],
|
||||
relevant_docs: List[List[str]], # List of relevant doc IDs per query
|
||||
retrieved_docs: List[List[str]], # List of retrieved doc IDs per query
|
||||
k: int = 10
|
||||
) -> dict:
|
||||
) -> Dict[str, float]:
|
||||
"""Evaluate embedding quality for retrieval."""
|
||||
|
||||
def precision_at_k(relevant: set, retrieved: List[str], k: int) -> float:
|
||||
retrieved_k = retrieved[:k]
|
||||
relevant_retrieved = len(set(retrieved_k) & relevant)
|
||||
return relevant_retrieved / k
|
||||
return relevant_retrieved / k if k > 0 else 0
|
||||
|
||||
def recall_at_k(relevant: set, retrieved: List[str], k: int) -> float:
|
||||
retrieved_k = retrieved[:k]
|
||||
@@ -446,7 +530,7 @@ def compute_embedding_similarity(
|
||||
) -> np.ndarray:
|
||||
"""Compute similarity matrix between embedding sets."""
|
||||
if metric == "cosine":
|
||||
# Normalize
|
||||
# Normalize and compute dot product
|
||||
norm1 = embeddings1 / np.linalg.norm(embeddings1, axis=1, keepdims=True)
|
||||
norm2 = embeddings2 / np.linalg.norm(embeddings2, axis=1, keepdims=True)
|
||||
return norm1 @ norm2.T
|
||||
@@ -455,25 +539,68 @@ def compute_embedding_similarity(
|
||||
return -cdist(embeddings1, embeddings2, metric='euclidean')
|
||||
elif metric == "dot":
|
||||
return embeddings1 @ embeddings2.T
|
||||
else:
|
||||
raise ValueError(f"Unknown metric: {metric}")
|
||||
|
||||
|
||||
def compare_embedding_models(
|
||||
texts: List[str],
|
||||
models: Dict[str, callable],
|
||||
queries: List[str],
|
||||
relevant_indices: List[List[int]],
|
||||
k: int = 5
|
||||
) -> Dict[str, Dict[str, float]]:
|
||||
"""Compare multiple embedding models on retrieval quality."""
|
||||
results = {}
|
||||
|
||||
for model_name, embed_fn in models.items():
|
||||
# Embed all texts
|
||||
doc_embeddings = np.array(embed_fn(texts))
|
||||
|
||||
retrieved_per_query = []
|
||||
for query in queries:
|
||||
query_embedding = np.array(embed_fn([query])[0])
|
||||
# Compute similarities
|
||||
similarities = compute_embedding_similarity(
|
||||
query_embedding.reshape(1, -1),
|
||||
doc_embeddings,
|
||||
metric="cosine"
|
||||
)[0]
|
||||
# Get top-k indices
|
||||
top_k_indices = np.argsort(similarities)[::-1][:k]
|
||||
retrieved_per_query.append([str(i) for i in top_k_indices])
|
||||
|
||||
# Convert relevant indices to string IDs
|
||||
relevant_docs = [[str(i) for i in indices] for indices in relevant_indices]
|
||||
|
||||
results[model_name] = evaluate_retrieval_quality(
|
||||
queries, relevant_docs, retrieved_per_query, k
|
||||
)
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Do's
|
||||
- **Match model to use case** - Code vs prose vs multilingual
|
||||
- **Chunk thoughtfully** - Preserve semantic boundaries
|
||||
- **Normalize embeddings** - For cosine similarity
|
||||
- **Batch requests** - More efficient than one-by-one
|
||||
- **Cache embeddings** - Avoid recomputing
|
||||
- **Match model to use case**: Code vs prose vs multilingual
|
||||
- **Chunk thoughtfully**: Preserve semantic boundaries
|
||||
- **Normalize embeddings**: For cosine similarity search
|
||||
- **Batch requests**: More efficient than one-by-one
|
||||
- **Cache embeddings**: Avoid recomputing for static content
|
||||
- **Use Voyage AI for Claude apps**: Recommended by Anthropic
|
||||
|
||||
### Don'ts
|
||||
- **Don't ignore token limits** - Truncation loses info
|
||||
- **Don't mix embedding models** - Incompatible spaces
|
||||
- **Don't skip preprocessing** - Garbage in, garbage out
|
||||
- **Don't over-chunk** - Lose context
|
||||
- **Don't ignore token limits**: Truncation loses information
|
||||
- **Don't mix embedding models**: Incompatible vector spaces
|
||||
- **Don't skip preprocessing**: Garbage in, garbage out
|
||||
- **Don't over-chunk**: Lose important context
|
||||
- **Don't forget metadata**: Essential for filtering and debugging
|
||||
|
||||
## Resources
|
||||
|
||||
- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)
|
||||
- [Voyage AI Documentation](https://docs.voyageai.com/)
|
||||
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
|
||||
- [Sentence Transformers](https://www.sbert.net/)
|
||||
- [MTEB Benchmark](https://huggingface.co/spaces/mteb/leaderboard)
|
||||
- [LangChain Embedding Models](https://python.langchain.com/docs/integrations/text_embedding/)
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
---
|
||||
name: langchain-architecture
|
||||
description: Design LLM applications using the LangChain framework with agents, memory, and tool integration patterns. Use when building LangChain applications, implementing AI agents, or creating complex LLM workflows.
|
||||
description: Design LLM applications using LangChain 1.x and LangGraph for agents, memory, and tool integration. Use when building LangChain applications, implementing AI agents, or creating complex LLM workflows.
|
||||
---
|
||||
|
||||
# LangChain Architecture
|
||||
# LangChain & LangGraph Architecture
|
||||
|
||||
Master the LangChain framework for building sophisticated LLM applications with agents, chains, memory, and tool integration.
|
||||
Master modern LangChain 1.x and LangGraph for building sophisticated LLM applications with agents, state management, memory, and tool integration.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
@@ -17,126 +17,100 @@ Master the LangChain framework for building sophisticated LLM applications with
|
||||
- Implementing document processing pipelines
|
||||
- Building production-grade LLM applications
|
||||
|
||||
## Package Structure (LangChain 1.x)
|
||||
|
||||
```
|
||||
langchain (1.2.x) # High-level orchestration
|
||||
langchain-core (1.2.x) # Core abstractions (messages, prompts, tools)
|
||||
langchain-community # Third-party integrations
|
||||
langgraph # Agent orchestration and state management
|
||||
langchain-openai # OpenAI integrations
|
||||
langchain-anthropic # Anthropic/Claude integrations
|
||||
langchain-voyageai # Voyage AI embeddings
|
||||
langchain-pinecone # Pinecone vector store
|
||||
```
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### 1. Agents
|
||||
Autonomous systems that use LLMs to decide which actions to take.
|
||||
### 1. LangGraph Agents
|
||||
LangGraph is the standard for building agents in 2026. It provides:
|
||||
|
||||
**Agent Types:**
|
||||
- **ReAct**: Reasoning + Acting in interleaved manner
|
||||
- **OpenAI Functions**: Leverages function calling API
|
||||
- **Structured Chat**: Handles multi-input tools
|
||||
- **Conversational**: Optimized for chat interfaces
|
||||
- **Self-Ask with Search**: Decomposes complex queries
|
||||
**Key Features:**
|
||||
- **StateGraph**: Explicit state management with typed state
|
||||
- **Durable Execution**: Agents persist through failures
|
||||
- **Human-in-the-Loop**: Inspect and modify state at any point
|
||||
- **Memory**: Short-term and long-term memory across sessions
|
||||
- **Checkpointing**: Save and resume agent state
|
||||
|
||||
### 2. Chains
|
||||
Sequences of calls to LLMs or other utilities.
|
||||
**Agent Patterns:**
|
||||
- **ReAct**: Reasoning + Acting with `create_react_agent`
|
||||
- **Plan-and-Execute**: Separate planning and execution nodes
|
||||
- **Multi-Agent**: Supervisor routing between specialized agents
|
||||
- **Tool-Calling**: Structured tool invocation with Pydantic schemas
|
||||
|
||||
**Chain Types:**
|
||||
- **LLMChain**: Basic prompt + LLM combination
|
||||
- **SequentialChain**: Multiple chains in sequence
|
||||
- **RouterChain**: Routes inputs to specialized chains
|
||||
- **TransformChain**: Data transformations between steps
|
||||
- **MapReduceChain**: Parallel processing with aggregation
|
||||
### 2. State Management
|
||||
LangGraph uses TypedDict for explicit state:
|
||||
|
||||
### 3. Memory
|
||||
Systems for maintaining context across interactions.
|
||||
```python
|
||||
from typing import Annotated, TypedDict
|
||||
from langgraph.graph import MessagesState
|
||||
|
||||
**Memory Types:**
|
||||
- **ConversationBufferMemory**: Stores all messages
|
||||
- **ConversationSummaryMemory**: Summarizes older messages
|
||||
- **ConversationBufferWindowMemory**: Keeps last N messages
|
||||
- **EntityMemory**: Tracks information about entities
|
||||
- **VectorStoreMemory**: Semantic similarity retrieval
|
||||
# Simple message-based state
|
||||
class AgentState(MessagesState):
|
||||
"""Extends MessagesState with custom fields."""
|
||||
context: Annotated[list, "retrieved documents"]
|
||||
|
||||
# Custom state for complex agents
|
||||
class CustomState(TypedDict):
|
||||
messages: Annotated[list, "conversation history"]
|
||||
context: Annotated[dict, "retrieved context"]
|
||||
current_step: str
|
||||
results: list
|
||||
```
|
||||
|
||||
### 3. Memory Systems
|
||||
Modern memory implementations:
|
||||
|
||||
- **ConversationBufferMemory**: Stores all messages (short conversations)
|
||||
- **ConversationSummaryMemory**: Summarizes older messages (long conversations)
|
||||
- **ConversationTokenBufferMemory**: Token-based windowing
|
||||
- **VectorStoreRetrieverMemory**: Semantic similarity retrieval
|
||||
- **LangGraph Checkpointers**: Persistent state across sessions
|
||||
|
||||
### 4. Document Processing
|
||||
Loading, transforming, and storing documents for retrieval.
|
||||
Loading, transforming, and storing documents:
|
||||
|
||||
**Components:**
|
||||
- **Document Loaders**: Load from various sources
|
||||
- **Text Splitters**: Chunk documents intelligently
|
||||
- **Vector Stores**: Store and retrieve embeddings
|
||||
- **Retrievers**: Fetch relevant documents
|
||||
- **Indexes**: Organize documents for efficient access
|
||||
|
||||
### 5. Callbacks
|
||||
Hooks for logging, monitoring, and debugging.
|
||||
### 5. Callbacks & Tracing
|
||||
LangSmith is the standard for observability:
|
||||
|
||||
**Use Cases:**
|
||||
- Request/response logging
|
||||
- Token usage tracking
|
||||
- Latency monitoring
|
||||
- Error handling
|
||||
- Custom metrics collection
|
||||
- Error tracking
|
||||
- Trace visualization
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Modern ReAct Agent with LangGraph
|
||||
|
||||
```python
|
||||
from langchain.agents import AgentType, initialize_agent, load_tools
|
||||
from langchain.llms import OpenAI
|
||||
from langchain.memory import ConversationBufferMemory
|
||||
from langgraph.prebuilt import create_react_agent
|
||||
from langgraph.checkpoint.memory import MemorySaver
|
||||
from langchain_anthropic import ChatAnthropic
|
||||
from langchain_core.tools import tool
|
||||
import ast
|
||||
import operator
|
||||
|
||||
# Initialize LLM
|
||||
llm = OpenAI(temperature=0)
|
||||
|
||||
# Load tools
|
||||
tools = load_tools(["serpapi", "llm-math"], llm=llm)
|
||||
|
||||
# Add memory
|
||||
memory = ConversationBufferMemory(memory_key="chat_history")
|
||||
|
||||
# Create agent
|
||||
agent = initialize_agent(
|
||||
tools,
|
||||
llm,
|
||||
agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
|
||||
memory=memory,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Run agent
|
||||
result = agent.run("What's the weather in SF? Then calculate 25 * 4")
|
||||
```
|
||||
|
||||
## Architecture Patterns
|
||||
|
||||
### Pattern 1: RAG with LangChain
|
||||
```python
|
||||
from langchain.chains import RetrievalQA
|
||||
from langchain.document_loaders import TextLoader
|
||||
from langchain.text_splitter import CharacterTextSplitter
|
||||
from langchain.vectorstores import Chroma
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
|
||||
# Load and process documents
|
||||
loader = TextLoader('documents.txt')
|
||||
documents = loader.load()
|
||||
|
||||
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
|
||||
texts = text_splitter.split_documents(documents)
|
||||
|
||||
# Create vector store
|
||||
embeddings = OpenAIEmbeddings()
|
||||
vectorstore = Chroma.from_documents(texts, embeddings)
|
||||
|
||||
# Create retrieval chain
|
||||
qa_chain = RetrievalQA.from_chain_type(
|
||||
llm=llm,
|
||||
chain_type="stuff",
|
||||
retriever=vectorstore.as_retriever(),
|
||||
return_source_documents=True
|
||||
)
|
||||
|
||||
# Query
|
||||
result = qa_chain({"query": "What is the main topic?"})
|
||||
```
|
||||
|
||||
### Pattern 2: Custom Agent with Tools
|
||||
```python
|
||||
from langchain.agents import Tool, AgentExecutor
|
||||
from langchain.agents.react.base import ReActDocstoreAgent
|
||||
from langchain.tools import tool
|
||||
# Initialize LLM (Claude Sonnet 4.5 recommended)
|
||||
llm = ChatAnthropic(model="claude-sonnet-4-5", temperature=0)
|
||||
|
||||
# Define tools with Pydantic schemas
|
||||
@tool
|
||||
def search_database(query: str) -> str:
|
||||
"""Search internal database for information."""
|
||||
@@ -144,195 +118,541 @@ def search_database(query: str) -> str:
|
||||
return f"Results for: {query}"
|
||||
|
||||
@tool
|
||||
def send_email(recipient: str, content: str) -> str:
|
||||
def calculate(expression: str) -> str:
|
||||
"""Safely evaluate a mathematical expression.
|
||||
|
||||
Supports: +, -, *, /, **, %, parentheses
|
||||
Example: '(2 + 3) * 4' returns '20'
|
||||
"""
|
||||
# Safe math evaluation using ast
|
||||
allowed_operators = {
|
||||
ast.Add: operator.add,
|
||||
ast.Sub: operator.sub,
|
||||
ast.Mult: operator.mul,
|
||||
ast.Div: operator.truediv,
|
||||
ast.Pow: operator.pow,
|
||||
ast.Mod: operator.mod,
|
||||
ast.USub: operator.neg,
|
||||
}
|
||||
|
||||
def _eval(node):
|
||||
if isinstance(node, ast.Constant):
|
||||
return node.value
|
||||
elif isinstance(node, ast.BinOp):
|
||||
left = _eval(node.left)
|
||||
right = _eval(node.right)
|
||||
return allowed_operators[type(node.op)](left, right)
|
||||
elif isinstance(node, ast.UnaryOp):
|
||||
operand = _eval(node.operand)
|
||||
return allowed_operators[type(node.op)](operand)
|
||||
else:
|
||||
raise ValueError(f"Unsupported operation: {type(node)}")
|
||||
|
||||
try:
|
||||
tree = ast.parse(expression, mode='eval')
|
||||
return str(_eval(tree.body))
|
||||
except Exception as e:
|
||||
return f"Error: {e}"
|
||||
|
||||
tools = [search_database, calculate]
|
||||
|
||||
# Create checkpointer for memory persistence
|
||||
checkpointer = MemorySaver()
|
||||
|
||||
# Create ReAct agent
|
||||
agent = create_react_agent(
|
||||
llm,
|
||||
tools,
|
||||
checkpointer=checkpointer
|
||||
)
|
||||
|
||||
# Run agent with thread ID for memory
|
||||
config = {"configurable": {"thread_id": "user-123"}}
|
||||
result = await agent.ainvoke(
|
||||
{"messages": [("user", "Search for Python tutorials and calculate 25 * 4")]},
|
||||
config=config
|
||||
)
|
||||
```
|
||||
|
||||
## Architecture Patterns
|
||||
|
||||
### Pattern 1: RAG with LangGraph
|
||||
|
||||
```python
|
||||
from langgraph.graph import StateGraph, START, END
|
||||
from langchain_anthropic import ChatAnthropic
|
||||
from langchain_voyageai import VoyageAIEmbeddings
|
||||
from langchain_pinecone import PineconeVectorStore
|
||||
from langchain_core.documents import Document
|
||||
from langchain_core.prompts import ChatPromptTemplate
|
||||
from typing import TypedDict, Annotated
|
||||
|
||||
class RAGState(TypedDict):
|
||||
question: str
|
||||
context: Annotated[list[Document], "retrieved documents"]
|
||||
answer: str
|
||||
|
||||
# Initialize components
|
||||
llm = ChatAnthropic(model="claude-sonnet-4-5")
|
||||
embeddings = VoyageAIEmbeddings(model="voyage-3-large")
|
||||
vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
|
||||
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
|
||||
|
||||
# Define nodes
|
||||
async def retrieve(state: RAGState) -> RAGState:
|
||||
"""Retrieve relevant documents."""
|
||||
docs = await retriever.ainvoke(state["question"])
|
||||
return {"context": docs}
|
||||
|
||||
async def generate(state: RAGState) -> RAGState:
|
||||
"""Generate answer from context."""
|
||||
prompt = ChatPromptTemplate.from_template(
|
||||
"""Answer based on the context below. If you cannot answer, say so.
|
||||
|
||||
Context: {context}
|
||||
|
||||
Question: {question}
|
||||
|
||||
Answer:"""
|
||||
)
|
||||
context_text = "\n\n".join(doc.page_content for doc in state["context"])
|
||||
response = await llm.ainvoke(
|
||||
prompt.format(context=context_text, question=state["question"])
|
||||
)
|
||||
return {"answer": response.content}
|
||||
|
||||
# Build graph
|
||||
builder = StateGraph(RAGState)
|
||||
builder.add_node("retrieve", retrieve)
|
||||
builder.add_node("generate", generate)
|
||||
builder.add_edge(START, "retrieve")
|
||||
builder.add_edge("retrieve", "generate")
|
||||
builder.add_edge("generate", END)
|
||||
|
||||
rag_chain = builder.compile()
|
||||
|
||||
# Use the chain
|
||||
result = await rag_chain.ainvoke({"question": "What is the main topic?"})
|
||||
```
|
||||
|
||||
### Pattern 2: Custom Agent with Structured Tools
|
||||
|
||||
```python
|
||||
from langchain_core.tools import StructuredTool
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class SearchInput(BaseModel):
|
||||
"""Input for database search."""
|
||||
query: str = Field(description="Search query")
|
||||
filters: dict = Field(default={}, description="Optional filters")
|
||||
|
||||
class EmailInput(BaseModel):
|
||||
"""Input for sending email."""
|
||||
recipient: str = Field(description="Email recipient")
|
||||
subject: str = Field(description="Email subject")
|
||||
content: str = Field(description="Email body")
|
||||
|
||||
async def search_database(query: str, filters: dict = {}) -> str:
|
||||
"""Search internal database for information."""
|
||||
# Your database search logic
|
||||
return f"Results for '{query}' with filters {filters}"
|
||||
|
||||
async def send_email(recipient: str, subject: str, content: str) -> str:
|
||||
"""Send an email to specified recipient."""
|
||||
# Email sending logic
|
||||
return f"Email sent to {recipient}"
|
||||
|
||||
tools = [search_database, send_email]
|
||||
tools = [
|
||||
StructuredTool.from_function(
|
||||
coroutine=search_database,
|
||||
name="search_database",
|
||||
description="Search internal database",
|
||||
args_schema=SearchInput
|
||||
),
|
||||
StructuredTool.from_function(
|
||||
coroutine=send_email,
|
||||
name="send_email",
|
||||
description="Send an email",
|
||||
args_schema=EmailInput
|
||||
)
|
||||
]
|
||||
|
||||
agent = initialize_agent(
|
||||
tools,
|
||||
llm,
|
||||
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
|
||||
verbose=True
|
||||
)
|
||||
agent = create_react_agent(llm, tools)
|
||||
```
|
||||
|
||||
### Pattern 3: Multi-Step Chain
|
||||
### Pattern 3: Multi-Step Workflow with StateGraph
|
||||
|
||||
```python
|
||||
from langchain.chains import LLMChain, SequentialChain
|
||||
from langchain.prompts import PromptTemplate
|
||||
from langgraph.graph import StateGraph, START, END
|
||||
from typing import TypedDict, Literal
|
||||
|
||||
# Step 1: Extract key information
|
||||
extract_prompt = PromptTemplate(
|
||||
input_variables=["text"],
|
||||
template="Extract key entities from: {text}\n\nEntities:"
|
||||
)
|
||||
extract_chain = LLMChain(llm=llm, prompt=extract_prompt, output_key="entities")
|
||||
class WorkflowState(TypedDict):
|
||||
text: str
|
||||
entities: list
|
||||
analysis: str
|
||||
summary: str
|
||||
current_step: str
|
||||
|
||||
# Step 2: Analyze entities
|
||||
analyze_prompt = PromptTemplate(
|
||||
input_variables=["entities"],
|
||||
template="Analyze these entities: {entities}\n\nAnalysis:"
|
||||
)
|
||||
analyze_chain = LLMChain(llm=llm, prompt=analyze_prompt, output_key="analysis")
|
||||
async def extract_entities(state: WorkflowState) -> WorkflowState:
|
||||
"""Extract key entities from text."""
|
||||
prompt = f"Extract key entities from: {state['text']}\n\nReturn as JSON list."
|
||||
response = await llm.ainvoke(prompt)
|
||||
return {"entities": response.content, "current_step": "analyze"}
|
||||
|
||||
# Step 3: Generate summary
|
||||
summary_prompt = PromptTemplate(
|
||||
input_variables=["entities", "analysis"],
|
||||
template="Summarize:\nEntities: {entities}\nAnalysis: {analysis}\n\nSummary:"
|
||||
)
|
||||
summary_chain = LLMChain(llm=llm, prompt=summary_prompt, output_key="summary")
|
||||
async def analyze_entities(state: WorkflowState) -> WorkflowState:
|
||||
"""Analyze extracted entities."""
|
||||
prompt = f"Analyze these entities: {state['entities']}\n\nProvide insights."
|
||||
response = await llm.ainvoke(prompt)
|
||||
return {"analysis": response.content, "current_step": "summarize"}
|
||||
|
||||
# Combine into sequential chain
|
||||
overall_chain = SequentialChain(
|
||||
chains=[extract_chain, analyze_chain, summary_chain],
|
||||
input_variables=["text"],
|
||||
output_variables=["entities", "analysis", "summary"],
|
||||
verbose=True
|
||||
)
|
||||
async def generate_summary(state: WorkflowState) -> WorkflowState:
|
||||
"""Generate final summary."""
|
||||
prompt = f"""Summarize:
|
||||
Entities: {state['entities']}
|
||||
Analysis: {state['analysis']}
|
||||
|
||||
Provide a concise summary."""
|
||||
response = await llm.ainvoke(prompt)
|
||||
return {"summary": response.content, "current_step": "complete"}
|
||||
|
||||
def route_step(state: WorkflowState) -> Literal["analyze", "summarize", "end"]:
|
||||
"""Route to next step based on current state."""
|
||||
step = state.get("current_step", "extract")
|
||||
if step == "analyze":
|
||||
return "analyze"
|
||||
elif step == "summarize":
|
||||
return "summarize"
|
||||
return "end"
|
||||
|
||||
# Build workflow
|
||||
builder = StateGraph(WorkflowState)
|
||||
builder.add_node("extract", extract_entities)
|
||||
builder.add_node("analyze", analyze_entities)
|
||||
builder.add_node("summarize", generate_summary)
|
||||
|
||||
builder.add_edge(START, "extract")
|
||||
builder.add_conditional_edges("extract", route_step, {
|
||||
"analyze": "analyze",
|
||||
"summarize": "summarize",
|
||||
"end": END
|
||||
})
|
||||
builder.add_conditional_edges("analyze", route_step, {
|
||||
"summarize": "summarize",
|
||||
"end": END
|
||||
})
|
||||
builder.add_edge("summarize", END)
|
||||
|
||||
workflow = builder.compile()
|
||||
```
|
||||
|
||||
## Memory Management Best Practices
|
||||
### Pattern 4: Multi-Agent Orchestration
|
||||
|
||||
### Choosing the Right Memory Type
|
||||
```python
|
||||
# For short conversations (< 10 messages)
|
||||
from langchain.memory import ConversationBufferMemory
|
||||
memory = ConversationBufferMemory()
|
||||
from langgraph.graph import StateGraph, START, END
|
||||
from langgraph.prebuilt import create_react_agent
|
||||
from langchain_core.messages import HumanMessage
|
||||
from typing import Literal
|
||||
|
||||
# For long conversations (summarize old messages)
|
||||
from langchain.memory import ConversationSummaryMemory
|
||||
memory = ConversationSummaryMemory(llm=llm)
|
||||
class MultiAgentState(TypedDict):
|
||||
messages: list
|
||||
next_agent: str
|
||||
|
||||
# For sliding window (last N messages)
|
||||
from langchain.memory import ConversationBufferWindowMemory
|
||||
memory = ConversationBufferWindowMemory(k=5)
|
||||
# Create specialized agents
|
||||
researcher = create_react_agent(llm, research_tools)
|
||||
writer = create_react_agent(llm, writing_tools)
|
||||
reviewer = create_react_agent(llm, review_tools)
|
||||
|
||||
# For entity tracking
|
||||
from langchain.memory import ConversationEntityMemory
|
||||
memory = ConversationEntityMemory(llm=llm)
|
||||
async def supervisor(state: MultiAgentState) -> MultiAgentState:
|
||||
"""Route to appropriate agent based on task."""
|
||||
prompt = f"""Based on the conversation, which agent should handle this?
|
||||
|
||||
# For semantic retrieval of relevant history
|
||||
from langchain.memory import VectorStoreRetrieverMemory
|
||||
memory = VectorStoreRetrieverMemory(retriever=retriever)
|
||||
Options:
|
||||
- researcher: For finding information
|
||||
- writer: For creating content
|
||||
- reviewer: For reviewing and editing
|
||||
- FINISH: Task is complete
|
||||
|
||||
Messages: {state['messages']}
|
||||
|
||||
Respond with just the agent name."""
|
||||
|
||||
response = await llm.ainvoke(prompt)
|
||||
return {"next_agent": response.content.strip().lower()}
|
||||
|
||||
def route_to_agent(state: MultiAgentState) -> Literal["researcher", "writer", "reviewer", "end"]:
|
||||
"""Route based on supervisor decision."""
|
||||
next_agent = state.get("next_agent", "").lower()
|
||||
if next_agent == "finish":
|
||||
return "end"
|
||||
return next_agent if next_agent in ["researcher", "writer", "reviewer"] else "end"
|
||||
|
||||
# Build multi-agent graph
|
||||
builder = StateGraph(MultiAgentState)
|
||||
builder.add_node("supervisor", supervisor)
|
||||
builder.add_node("researcher", researcher)
|
||||
builder.add_node("writer", writer)
|
||||
builder.add_node("reviewer", reviewer)
|
||||
|
||||
builder.add_edge(START, "supervisor")
|
||||
builder.add_conditional_edges("supervisor", route_to_agent, {
|
||||
"researcher": "researcher",
|
||||
"writer": "writer",
|
||||
"reviewer": "reviewer",
|
||||
"end": END
|
||||
})
|
||||
|
||||
# Each agent returns to supervisor
|
||||
for agent in ["researcher", "writer", "reviewer"]:
|
||||
builder.add_edge(agent, "supervisor")
|
||||
|
||||
multi_agent = builder.compile()
|
||||
```
|
||||
|
||||
## Callback System
|
||||
## Memory Management
|
||||
|
||||
### Token-Based Memory with LangGraph
|
||||
|
||||
```python
|
||||
from langgraph.checkpoint.memory import MemorySaver
|
||||
from langgraph.prebuilt import create_react_agent
|
||||
|
||||
# In-memory checkpointer (development)
|
||||
checkpointer = MemorySaver()
|
||||
|
||||
# Create agent with persistent memory
|
||||
agent = create_react_agent(llm, tools, checkpointer=checkpointer)
|
||||
|
||||
# Each thread_id maintains separate conversation
|
||||
config = {"configurable": {"thread_id": "session-abc123"}}
|
||||
|
||||
# Messages persist across invocations with same thread_id
|
||||
result1 = await agent.ainvoke({"messages": [("user", "My name is Alice")]}, config)
|
||||
result2 = await agent.ainvoke({"messages": [("user", "What's my name?")]}, config)
|
||||
# Agent remembers: "Your name is Alice"
|
||||
```
|
||||
|
||||
### Production Memory with PostgreSQL
|
||||
|
||||
```python
|
||||
from langgraph.checkpoint.postgres import PostgresSaver
|
||||
|
||||
# Production checkpointer
|
||||
checkpointer = PostgresSaver.from_conn_string(
|
||||
"postgresql://user:pass@localhost/langgraph"
|
||||
)
|
||||
|
||||
agent = create_react_agent(llm, tools, checkpointer=checkpointer)
|
||||
```
|
||||
|
||||
### Vector Store Memory for Long-Term Context
|
||||
|
||||
```python
|
||||
from langchain_community.vectorstores import Chroma
|
||||
from langchain_voyageai import VoyageAIEmbeddings
|
||||
|
||||
embeddings = VoyageAIEmbeddings(model="voyage-3-large")
|
||||
memory_store = Chroma(
|
||||
collection_name="conversation_memory",
|
||||
embedding_function=embeddings,
|
||||
persist_directory="./memory_db"
|
||||
)
|
||||
|
||||
async def retrieve_relevant_memory(query: str, k: int = 5) -> list:
|
||||
"""Retrieve relevant past conversations."""
|
||||
docs = await memory_store.asimilarity_search(query, k=k)
|
||||
return [doc.page_content for doc in docs]
|
||||
|
||||
async def store_memory(content: str, metadata: dict = {}):
|
||||
"""Store conversation in long-term memory."""
|
||||
await memory_store.aadd_texts([content], metadatas=[metadata])
|
||||
```
|
||||
|
||||
## Callback System & LangSmith
|
||||
|
||||
### LangSmith Tracing
|
||||
|
||||
```python
|
||||
import os
|
||||
from langchain_anthropic import ChatAnthropic
|
||||
|
||||
# Enable LangSmith tracing
|
||||
os.environ["LANGCHAIN_TRACING_V2"] = "true"
|
||||
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
|
||||
os.environ["LANGCHAIN_PROJECT"] = "my-project"
|
||||
|
||||
# All LangChain/LangGraph operations are automatically traced
|
||||
llm = ChatAnthropic(model="claude-sonnet-4-5")
|
||||
```
|
||||
|
||||
### Custom Callback Handler
|
||||
|
||||
```python
|
||||
from langchain.callbacks.base import BaseCallbackHandler
|
||||
from langchain_core.callbacks import BaseCallbackHandler
|
||||
from typing import Any, Dict, List
|
||||
|
||||
class CustomCallbackHandler(BaseCallbackHandler):
|
||||
def on_llm_start(self, serialized, prompts, **kwargs):
|
||||
print(f"LLM started with prompts: {prompts}")
|
||||
def on_llm_start(
|
||||
self, serialized: Dict[str, Any], prompts: List[str], **kwargs
|
||||
) -> None:
|
||||
print(f"LLM started with {len(prompts)} prompts")
|
||||
|
||||
def on_llm_end(self, response, **kwargs):
|
||||
print(f"LLM ended with response: {response}")
|
||||
def on_llm_end(self, response, **kwargs) -> None:
|
||||
print(f"LLM completed: {len(response.generations)} generations")
|
||||
|
||||
def on_llm_error(self, error, **kwargs):
|
||||
def on_llm_error(self, error: Exception, **kwargs) -> None:
|
||||
print(f"LLM error: {error}")
|
||||
|
||||
def on_chain_start(self, serialized, inputs, **kwargs):
|
||||
print(f"Chain started with inputs: {inputs}")
|
||||
def on_tool_start(
|
||||
self, serialized: Dict[str, Any], input_str: str, **kwargs
|
||||
) -> None:
|
||||
print(f"Tool started: {serialized.get('name')}")
|
||||
|
||||
def on_agent_action(self, action, **kwargs):
|
||||
print(f"Agent taking action: {action}")
|
||||
def on_tool_end(self, output: str, **kwargs) -> None:
|
||||
print(f"Tool completed: {output[:100]}...")
|
||||
|
||||
# Use callback
|
||||
agent.run("query", callbacks=[CustomCallbackHandler()])
|
||||
# Use callbacks
|
||||
result = await agent.ainvoke(
|
||||
{"messages": [("user", "query")]},
|
||||
config={"callbacks": [CustomCallbackHandler()]}
|
||||
)
|
||||
```
|
||||
|
||||
## Streaming Responses
|
||||
|
||||
```python
|
||||
from langchain_anthropic import ChatAnthropic
|
||||
|
||||
llm = ChatAnthropic(model="claude-sonnet-4-5", streaming=True)
|
||||
|
||||
# Stream tokens
|
||||
async for chunk in llm.astream("Tell me a story"):
|
||||
print(chunk.content, end="", flush=True)
|
||||
|
||||
# Stream agent events
|
||||
async for event in agent.astream_events(
|
||||
{"messages": [("user", "Search and summarize")]},
|
||||
version="v2"
|
||||
):
|
||||
if event["event"] == "on_chat_model_stream":
|
||||
print(event["data"]["chunk"].content, end="")
|
||||
elif event["event"] == "on_tool_start":
|
||||
print(f"\n[Using tool: {event['name']}]")
|
||||
```
|
||||
|
||||
## Testing Strategies
|
||||
|
||||
```python
|
||||
import pytest
|
||||
from unittest.mock import Mock
|
||||
from unittest.mock import AsyncMock, patch
|
||||
|
||||
def test_agent_tool_selection():
|
||||
# Mock LLM to return specific tool selection
|
||||
mock_llm = Mock()
|
||||
mock_llm.predict.return_value = "Action: search_database\nAction Input: test query"
|
||||
@pytest.mark.asyncio
|
||||
async def test_agent_tool_selection():
|
||||
"""Test agent selects correct tool."""
|
||||
with patch.object(llm, 'ainvoke') as mock_llm:
|
||||
mock_llm.return_value = AsyncMock(content="Using search_database")
|
||||
|
||||
agent = initialize_agent(tools, mock_llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)
|
||||
result = await agent.ainvoke({
|
||||
"messages": [("user", "search for documents")]
|
||||
})
|
||||
|
||||
result = agent.run("test query")
|
||||
# Verify tool was called
|
||||
assert "search_database" in str(result)
|
||||
|
||||
# Verify correct tool was selected
|
||||
assert "search_database" in str(mock_llm.predict.call_args)
|
||||
@pytest.mark.asyncio
|
||||
async def test_memory_persistence():
|
||||
"""Test memory persists across invocations."""
|
||||
config = {"configurable": {"thread_id": "test-thread"}}
|
||||
|
||||
def test_memory_persistence():
|
||||
memory = ConversationBufferMemory()
|
||||
# First message
|
||||
await agent.ainvoke(
|
||||
{"messages": [("user", "Remember: the code is 12345")]},
|
||||
config
|
||||
)
|
||||
|
||||
memory.save_context({"input": "Hi"}, {"output": "Hello!"})
|
||||
# Second message should remember
|
||||
result = await agent.ainvoke(
|
||||
{"messages": [("user", "What was the code?")]},
|
||||
config
|
||||
)
|
||||
|
||||
assert "Hi" in memory.load_memory_variables({})['history']
|
||||
assert "Hello!" in memory.load_memory_variables({})['history']
|
||||
assert "12345" in result["messages"][-1].content
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### 1. Caching
|
||||
```python
|
||||
from langchain.cache import InMemoryCache
|
||||
import langchain
|
||||
### 1. Caching with Redis
|
||||
|
||||
langchain.llm_cache = InMemoryCache()
|
||||
```python
|
||||
from langchain_community.cache import RedisCache
|
||||
from langchain_core.globals import set_llm_cache
|
||||
import redis
|
||||
|
||||
redis_client = redis.Redis.from_url("redis://localhost:6379")
|
||||
set_llm_cache(RedisCache(redis_client))
|
||||
```
|
||||
|
||||
### 2. Batch Processing
|
||||
### 2. Async Batch Processing
|
||||
|
||||
```python
|
||||
# Process multiple documents in parallel
|
||||
from langchain.document_loaders import DirectoryLoader
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
import asyncio
|
||||
from langchain_core.documents import Document
|
||||
|
||||
loader = DirectoryLoader('./docs')
|
||||
docs = loader.load()
|
||||
async def process_documents(documents: list[Document]) -> list:
|
||||
"""Process documents in parallel."""
|
||||
tasks = [process_single(doc) for doc in documents]
|
||||
return await asyncio.gather(*tasks)
|
||||
|
||||
def process_doc(doc):
|
||||
return text_splitter.split_documents([doc])
|
||||
|
||||
with ThreadPoolExecutor(max_workers=4) as executor:
|
||||
split_docs = list(executor.map(process_doc, docs))
|
||||
async def process_single(doc: Document) -> dict:
|
||||
"""Process a single document."""
|
||||
chunks = text_splitter.split_documents([doc])
|
||||
embeddings = await embeddings_model.aembed_documents(
|
||||
[c.page_content for c in chunks]
|
||||
)
|
||||
return {"doc_id": doc.metadata.get("id"), "embeddings": embeddings}
|
||||
```
|
||||
|
||||
### 3. Streaming Responses
|
||||
```python
|
||||
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
|
||||
### 3. Connection Pooling
|
||||
|
||||
llm = OpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()])
|
||||
```python
|
||||
from langchain_pinecone import PineconeVectorStore
|
||||
from pinecone import Pinecone
|
||||
|
||||
# Reuse Pinecone client
|
||||
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
|
||||
index = pc.Index("my-index")
|
||||
|
||||
# Create vector store with existing index
|
||||
vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- **references/agents.md**: Deep dive on agent architectures
|
||||
- **references/memory.md**: Memory system patterns
|
||||
- **references/chains.md**: Chain composition strategies
|
||||
- **references/document-processing.md**: Document loading and indexing
|
||||
- **references/callbacks.md**: Monitoring and observability
|
||||
- **assets/agent-template.py**: Production-ready agent template
|
||||
- **assets/memory-config.yaml**: Memory configuration examples
|
||||
- **assets/chain-example.py**: Complex chain examples
|
||||
- [LangChain Documentation](https://python.langchain.com/docs/)
|
||||
- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
|
||||
- [LangSmith Platform](https://smith.langchain.com/)
|
||||
- [LangChain GitHub](https://github.com/langchain-ai/langchain)
|
||||
- [LangGraph GitHub](https://github.com/langchain-ai/langgraph)
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Memory Overflow**: Not managing conversation history length
|
||||
2. **Tool Selection Errors**: Poor tool descriptions confuse agents
|
||||
3. **Context Window Exceeded**: Exceeding LLM token limits
|
||||
4. **No Error Handling**: Not catching and handling agent failures
|
||||
5. **Inefficient Retrieval**: Not optimizing vector store queries
|
||||
1. **Using Deprecated APIs**: Use LangGraph for agents, not `initialize_agent`
|
||||
2. **Memory Overflow**: Use checkpointers with TTL for long-running agents
|
||||
3. **Poor Tool Descriptions**: Clear descriptions help LLM select correct tools
|
||||
4. **Context Window Exceeded**: Use summarization or sliding window memory
|
||||
5. **No Error Handling**: Wrap tool functions with try/except
|
||||
6. **Blocking Operations**: Use async methods (`ainvoke`, `astream`)
|
||||
7. **Missing Observability**: Always enable LangSmith tracing in production
|
||||
|
||||
## Production Checklist
|
||||
|
||||
- [ ] Implement proper error handling
|
||||
- [ ] Add request/response logging
|
||||
- [ ] Monitor token usage and costs
|
||||
- [ ] Set timeout limits for agent execution
|
||||
- [ ] Use LangGraph StateGraph for agent orchestration
|
||||
- [ ] Implement async patterns throughout (`ainvoke`, `astream`)
|
||||
- [ ] Add production checkpointer (PostgreSQL, Redis)
|
||||
- [ ] Enable LangSmith tracing
|
||||
- [ ] Implement structured tools with Pydantic schemas
|
||||
- [ ] Add timeout limits for agent execution
|
||||
- [ ] Implement rate limiting
|
||||
- [ ] Add input validation
|
||||
- [ ] Test with edge cases
|
||||
- [ ] Set up observability (callbacks)
|
||||
- [ ] Implement fallback strategies
|
||||
- [ ] Add comprehensive error handling
|
||||
- [ ] Set up health checks
|
||||
- [ ] Version control prompts and configurations
|
||||
- [ ] Write integration tests for agent workflows
|
||||
|
||||
@@ -64,34 +64,71 @@ Use stronger LLMs to evaluate weaker model outputs.
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from llm_eval import EvaluationSuite, Metric
|
||||
from dataclasses import dataclass
|
||||
from typing import Callable
|
||||
import numpy as np
|
||||
|
||||
# Define evaluation suite
|
||||
@dataclass
|
||||
class Metric:
|
||||
name: str
|
||||
fn: Callable
|
||||
|
||||
@staticmethod
|
||||
def accuracy():
|
||||
return Metric("accuracy", calculate_accuracy)
|
||||
|
||||
@staticmethod
|
||||
def bleu():
|
||||
return Metric("bleu", calculate_bleu)
|
||||
|
||||
@staticmethod
|
||||
def bertscore():
|
||||
return Metric("bertscore", calculate_bertscore)
|
||||
|
||||
@staticmethod
|
||||
def custom(name: str, fn: Callable):
|
||||
return Metric(name, fn)
|
||||
|
||||
class EvaluationSuite:
|
||||
def __init__(self, metrics: list[Metric]):
|
||||
self.metrics = metrics
|
||||
|
||||
async def evaluate(self, model, test_cases: list[dict]) -> dict:
|
||||
results = {m.name: [] for m in self.metrics}
|
||||
|
||||
for test in test_cases:
|
||||
prediction = await model.predict(test["input"])
|
||||
|
||||
for metric in self.metrics:
|
||||
score = metric.fn(
|
||||
prediction=prediction,
|
||||
reference=test.get("expected"),
|
||||
context=test.get("context")
|
||||
)
|
||||
results[metric.name].append(score)
|
||||
|
||||
return {
|
||||
"metrics": {k: np.mean(v) for k, v in results.items()},
|
||||
"raw_scores": results
|
||||
}
|
||||
|
||||
# Usage
|
||||
suite = EvaluationSuite([
|
||||
Metric.accuracy(),
|
||||
Metric.bleu(),
|
||||
Metric.bertscore(),
|
||||
Metric.custom(name="groundedness", fn=check_groundedness)
|
||||
Metric.custom("groundedness", check_groundedness)
|
||||
])
|
||||
|
||||
# Prepare test cases
|
||||
test_cases = [
|
||||
{
|
||||
"input": "What is the capital of France?",
|
||||
"expected": "Paris",
|
||||
"context": "France is a country in Europe. Paris is its capital."
|
||||
},
|
||||
# ... more test cases
|
||||
]
|
||||
|
||||
# Run evaluation
|
||||
results = suite.evaluate(
|
||||
model=your_model,
|
||||
test_cases=test_cases
|
||||
)
|
||||
|
||||
print(f"Overall Accuracy: {results.metrics['accuracy']}")
|
||||
print(f"BLEU Score: {results.metrics['bleu']}")
|
||||
results = await suite.evaluate(model=your_model, test_cases=test_cases)
|
||||
```
|
||||
|
||||
## Automated Metrics Implementation
|
||||
@@ -100,7 +137,7 @@ print(f"BLEU Score: {results.metrics['bleu']}")
|
||||
```python
|
||||
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
|
||||
|
||||
def calculate_bleu(reference, hypothesis):
|
||||
def calculate_bleu(reference: str, hypothesis: str, **kwargs) -> float:
|
||||
"""Calculate BLEU score between reference and hypothesis."""
|
||||
smoothie = SmoothingFunction().method4
|
||||
|
||||
@@ -109,21 +146,18 @@ def calculate_bleu(reference, hypothesis):
|
||||
hypothesis.split(),
|
||||
smoothing_function=smoothie
|
||||
)
|
||||
|
||||
# Usage
|
||||
bleu = calculate_bleu(
|
||||
reference="The cat sat on the mat",
|
||||
hypothesis="A cat is sitting on the mat"
|
||||
)
|
||||
```
|
||||
|
||||
### ROUGE Score
|
||||
```python
|
||||
from rouge_score import rouge_scorer
|
||||
|
||||
def calculate_rouge(reference, hypothesis):
|
||||
def calculate_rouge(reference: str, hypothesis: str, **kwargs) -> dict:
|
||||
"""Calculate ROUGE scores."""
|
||||
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
|
||||
scorer = rouge_scorer.RougeScorer(
|
||||
['rouge1', 'rouge2', 'rougeL'],
|
||||
use_stemmer=True
|
||||
)
|
||||
scores = scorer.score(reference, hypothesis)
|
||||
|
||||
return {
|
||||
@@ -137,8 +171,12 @@ def calculate_rouge(reference, hypothesis):
|
||||
```python
|
||||
from bert_score import score
|
||||
|
||||
def calculate_bertscore(references, hypotheses):
|
||||
"""Calculate BERTScore using pre-trained BERT."""
|
||||
def calculate_bertscore(
|
||||
references: list[str],
|
||||
hypotheses: list[str],
|
||||
**kwargs
|
||||
) -> dict:
|
||||
"""Calculate BERTScore using pre-trained model."""
|
||||
P, R, F1 = score(
|
||||
hypotheses,
|
||||
references,
|
||||
@@ -155,44 +193,72 @@ def calculate_bertscore(references, hypotheses):
|
||||
|
||||
### Custom Metrics
|
||||
```python
|
||||
def calculate_groundedness(response, context):
|
||||
def calculate_groundedness(response: str, context: str, **kwargs) -> float:
|
||||
"""Check if response is grounded in provided context."""
|
||||
# Use NLI model to check entailment
|
||||
from transformers import pipeline
|
||||
|
||||
nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
|
||||
nli = pipeline(
|
||||
"text-classification",
|
||||
model="microsoft/deberta-large-mnli"
|
||||
)
|
||||
|
||||
result = nli(f"{context} [SEP] {response}")[0]
|
||||
|
||||
# Return confidence that response is entailed by context
|
||||
return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
|
||||
|
||||
def calculate_toxicity(text):
|
||||
def calculate_toxicity(text: str, **kwargs) -> float:
|
||||
"""Measure toxicity in generated text."""
|
||||
from detoxify import Detoxify
|
||||
|
||||
results = Detoxify('original').predict(text)
|
||||
return max(results.values()) # Return highest toxicity score
|
||||
|
||||
def calculate_factuality(claim, knowledge_base):
|
||||
"""Verify factual claims against knowledge base."""
|
||||
# Implementation depends on your knowledge base
|
||||
# Could use retrieval + NLI, or fact-checking API
|
||||
pass
|
||||
def calculate_factuality(claim: str, sources: list[str], **kwargs) -> float:
|
||||
"""Verify factual claims against sources."""
|
||||
from transformers import pipeline
|
||||
|
||||
nli = pipeline("text-classification", model="facebook/bart-large-mnli")
|
||||
|
||||
scores = []
|
||||
for source in sources:
|
||||
result = nli(f"{source}</s></s>{claim}")[0]
|
||||
if result['label'] == 'entailment':
|
||||
scores.append(result['score'])
|
||||
|
||||
return max(scores) if scores else 0.0
|
||||
```
|
||||
|
||||
## LLM-as-Judge Patterns
|
||||
|
||||
### Single Output Evaluation
|
||||
```python
|
||||
def llm_judge_quality(response, question):
|
||||
"""Use GPT-5 to judge response quality."""
|
||||
prompt = f"""Rate the following response on a scale of 1-10 for:
|
||||
1. Accuracy (factually correct)
|
||||
2. Helpfulness (answers the question)
|
||||
3. Clarity (well-written and understandable)
|
||||
from anthropic import Anthropic
|
||||
from pydantic import BaseModel, Field
|
||||
import json
|
||||
|
||||
class QualityRating(BaseModel):
|
||||
accuracy: int = Field(ge=1, le=10, description="Factual correctness")
|
||||
helpfulness: int = Field(ge=1, le=10, description="Answers the question")
|
||||
clarity: int = Field(ge=1, le=10, description="Well-written and understandable")
|
||||
reasoning: str = Field(description="Brief explanation")
|
||||
|
||||
async def llm_judge_quality(
|
||||
response: str,
|
||||
question: str,
|
||||
context: str = None
|
||||
) -> QualityRating:
|
||||
"""Use Claude to judge response quality."""
|
||||
client = Anthropic()
|
||||
|
||||
system = """You are an expert evaluator of AI responses.
|
||||
Rate responses on accuracy, helpfulness, and clarity (1-10 scale).
|
||||
Provide brief reasoning for your ratings."""
|
||||
|
||||
prompt = f"""Rate the following response:
|
||||
|
||||
Question: {question}
|
||||
{f'Context: {context}' if context else ''}
|
||||
Response: {response}
|
||||
|
||||
Provide ratings in JSON format:
|
||||
@@ -201,23 +267,37 @@ Provide ratings in JSON format:
|
||||
"helpfulness": <1-10>,
|
||||
"clarity": <1-10>,
|
||||
"reasoning": "<brief explanation>"
|
||||
}}
|
||||
"""
|
||||
}}"""
|
||||
|
||||
result = openai.ChatCompletion.create(
|
||||
model="gpt-5",
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
temperature=0
|
||||
message = client.messages.create(
|
||||
model="claude-sonnet-4-5",
|
||||
max_tokens=500,
|
||||
system=system,
|
||||
messages=[{"role": "user", "content": prompt}]
|
||||
)
|
||||
|
||||
return json.loads(result.choices[0].message.content)
|
||||
return QualityRating(**json.loads(message.content[0].text))
|
||||
```
|
||||
|
||||
### Pairwise Comparison
|
||||
```python
|
||||
def compare_responses(question, response_a, response_b):
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import Literal
|
||||
|
||||
class ComparisonResult(BaseModel):
|
||||
winner: Literal["A", "B", "tie"]
|
||||
reasoning: str
|
||||
confidence: int = Field(ge=1, le=10)
|
||||
|
||||
async def compare_responses(
|
||||
question: str,
|
||||
response_a: str,
|
||||
response_b: str
|
||||
) -> ComparisonResult:
|
||||
"""Compare two responses using LLM judge."""
|
||||
prompt = f"""Compare these two responses to the question and determine which is better.
|
||||
client = Anthropic()
|
||||
|
||||
prompt = f"""Compare these two responses and determine which is better.
|
||||
|
||||
Question: {question}
|
||||
|
||||
@@ -225,38 +305,84 @@ Response A: {response_a}
|
||||
|
||||
Response B: {response_b}
|
||||
|
||||
Which response is better and why? Consider accuracy, helpfulness, and clarity.
|
||||
Consider accuracy, helpfulness, and clarity.
|
||||
|
||||
Answer with JSON:
|
||||
{{
|
||||
"winner": "A" or "B" or "tie",
|
||||
"reasoning": "<explanation>",
|
||||
"confidence": <1-10>
|
||||
}}
|
||||
"""
|
||||
}}"""
|
||||
|
||||
result = openai.ChatCompletion.create(
|
||||
model="gpt-5",
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
temperature=0
|
||||
message = client.messages.create(
|
||||
model="claude-sonnet-4-5",
|
||||
max_tokens=500,
|
||||
messages=[{"role": "user", "content": prompt}]
|
||||
)
|
||||
|
||||
return json.loads(result.choices[0].message.content)
|
||||
return ComparisonResult(**json.loads(message.content[0].text))
|
||||
```
|
||||
|
||||
### Reference-Based Evaluation
|
||||
```python
|
||||
class ReferenceEvaluation(BaseModel):
|
||||
semantic_similarity: float = Field(ge=0, le=1)
|
||||
factual_accuracy: float = Field(ge=0, le=1)
|
||||
completeness: float = Field(ge=0, le=1)
|
||||
issues: list[str]
|
||||
|
||||
async def evaluate_against_reference(
|
||||
response: str,
|
||||
reference: str,
|
||||
question: str
|
||||
) -> ReferenceEvaluation:
|
||||
"""Evaluate response against gold standard reference."""
|
||||
client = Anthropic()
|
||||
|
||||
prompt = f"""Compare the response to the reference answer.
|
||||
|
||||
Question: {question}
|
||||
Reference Answer: {reference}
|
||||
Response to Evaluate: {response}
|
||||
|
||||
Evaluate:
|
||||
1. Semantic similarity (0-1): How similar is the meaning?
|
||||
2. Factual accuracy (0-1): Are all facts correct?
|
||||
3. Completeness (0-1): Does it cover all key points?
|
||||
4. List any specific issues or errors.
|
||||
|
||||
Respond in JSON:
|
||||
{{
|
||||
"semantic_similarity": <0-1>,
|
||||
"factual_accuracy": <0-1>,
|
||||
"completeness": <0-1>,
|
||||
"issues": ["issue1", "issue2"]
|
||||
}}"""
|
||||
|
||||
message = client.messages.create(
|
||||
model="claude-sonnet-4-5",
|
||||
max_tokens=500,
|
||||
messages=[{"role": "user", "content": prompt}]
|
||||
)
|
||||
|
||||
return ReferenceEvaluation(**json.loads(message.content[0].text))
|
||||
```
|
||||
|
||||
## Human Evaluation Frameworks
|
||||
|
||||
### Annotation Guidelines
|
||||
```python
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Optional
|
||||
|
||||
@dataclass
|
||||
class AnnotationTask:
|
||||
"""Structure for human annotation task."""
|
||||
response: str
|
||||
question: str
|
||||
context: Optional[str] = None
|
||||
|
||||
def __init__(self, response, question, context=None):
|
||||
self.response = response
|
||||
self.question = question
|
||||
self.context = context
|
||||
|
||||
def get_annotation_form(self):
|
||||
def get_annotation_form(self) -> dict:
|
||||
return {
|
||||
"question": self.question,
|
||||
"context": self.context,
|
||||
@@ -289,22 +415,29 @@ class AnnotationTask:
|
||||
```python
|
||||
from sklearn.metrics import cohen_kappa_score
|
||||
|
||||
def calculate_agreement(rater1_scores, rater2_scores):
|
||||
def calculate_agreement(
|
||||
rater1_scores: list[int],
|
||||
rater2_scores: list[int]
|
||||
) -> dict:
|
||||
"""Calculate inter-rater agreement."""
|
||||
kappa = cohen_kappa_score(rater1_scores, rater2_scores)
|
||||
|
||||
interpretation = {
|
||||
kappa < 0: "Poor",
|
||||
kappa < 0.2: "Slight",
|
||||
kappa < 0.4: "Fair",
|
||||
kappa < 0.6: "Moderate",
|
||||
kappa < 0.8: "Substantial",
|
||||
kappa <= 1.0: "Almost Perfect"
|
||||
}
|
||||
if kappa < 0:
|
||||
interpretation = "Poor"
|
||||
elif kappa < 0.2:
|
||||
interpretation = "Slight"
|
||||
elif kappa < 0.4:
|
||||
interpretation = "Fair"
|
||||
elif kappa < 0.6:
|
||||
interpretation = "Moderate"
|
||||
elif kappa < 0.8:
|
||||
interpretation = "Substantial"
|
||||
else:
|
||||
interpretation = "Almost Perfect"
|
||||
|
||||
return {
|
||||
"kappa": kappa,
|
||||
"interpretation": interpretation[True]
|
||||
"interpretation": interpretation
|
||||
}
|
||||
```
|
||||
|
||||
@@ -314,23 +447,26 @@ def calculate_agreement(rater1_scores, rater2_scores):
|
||||
```python
|
||||
from scipy import stats
|
||||
import numpy as np
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
@dataclass
|
||||
class ABTest:
|
||||
def __init__(self, variant_a_name="A", variant_b_name="B"):
|
||||
self.variant_a = {"name": variant_a_name, "scores": []}
|
||||
self.variant_b = {"name": variant_b_name, "scores": []}
|
||||
variant_a_name: str = "A"
|
||||
variant_b_name: str = "B"
|
||||
variant_a_scores: list[float] = field(default_factory=list)
|
||||
variant_b_scores: list[float] = field(default_factory=list)
|
||||
|
||||
def add_result(self, variant, score):
|
||||
def add_result(self, variant: str, score: float):
|
||||
"""Add evaluation result for a variant."""
|
||||
if variant == "A":
|
||||
self.variant_a["scores"].append(score)
|
||||
self.variant_a_scores.append(score)
|
||||
else:
|
||||
self.variant_b["scores"].append(score)
|
||||
self.variant_b_scores.append(score)
|
||||
|
||||
def analyze(self, alpha=0.05):
|
||||
def analyze(self, alpha: float = 0.05) -> dict:
|
||||
"""Perform statistical analysis."""
|
||||
a_scores = self.variant_a["scores"]
|
||||
b_scores = self.variant_b["scores"]
|
||||
a_scores = np.array(self.variant_a_scores)
|
||||
b_scores = np.array(self.variant_b_scores)
|
||||
|
||||
# T-test
|
||||
t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
|
||||
@@ -347,12 +483,12 @@ class ABTest:
|
||||
"p_value": p_value,
|
||||
"statistically_significant": p_value < alpha,
|
||||
"cohens_d": cohens_d,
|
||||
"effect_size": self.interpret_cohens_d(cohens_d),
|
||||
"winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
|
||||
"effect_size": self._interpret_cohens_d(cohens_d),
|
||||
"winner": self.variant_b_name if np.mean(b_scores) > np.mean(a_scores) else self.variant_a_name
|
||||
}
|
||||
|
||||
@staticmethod
|
||||
def interpret_cohens_d(d):
|
||||
def _interpret_cohens_d(d: float) -> str:
|
||||
"""Interpret Cohen's d effect size."""
|
||||
abs_d = abs(d)
|
||||
if abs_d < 0.2:
|
||||
@@ -369,12 +505,22 @@ class ABTest:
|
||||
|
||||
### Regression Detection
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
|
||||
@dataclass
|
||||
class RegressionResult:
|
||||
metric: str
|
||||
baseline: float
|
||||
current: float
|
||||
change: float
|
||||
is_regression: bool
|
||||
|
||||
class RegressionDetector:
|
||||
def __init__(self, baseline_results, threshold=0.05):
|
||||
def __init__(self, baseline_results: dict, threshold: float = 0.05):
|
||||
self.baseline = baseline_results
|
||||
self.threshold = threshold
|
||||
|
||||
def check_for_regression(self, new_results):
|
||||
def check_for_regression(self, new_results: dict) -> dict:
|
||||
"""Detect if new results show regression."""
|
||||
regressions = []
|
||||
|
||||
@@ -389,39 +535,97 @@ class RegressionDetector:
|
||||
relative_change = (new_score - baseline_score) / baseline_score
|
||||
|
||||
# Flag if significant decrease
|
||||
if relative_change < -self.threshold:
|
||||
regressions.append({
|
||||
"metric": metric,
|
||||
"baseline": baseline_score,
|
||||
"current": new_score,
|
||||
"change": relative_change
|
||||
})
|
||||
is_regression = relative_change < -self.threshold
|
||||
if is_regression:
|
||||
regressions.append(RegressionResult(
|
||||
metric=metric,
|
||||
baseline=baseline_score,
|
||||
current=new_score,
|
||||
change=relative_change,
|
||||
is_regression=True
|
||||
))
|
||||
|
||||
return {
|
||||
"has_regression": len(regressions) > 0,
|
||||
"regressions": regressions
|
||||
"regressions": regressions,
|
||||
"summary": f"{len(regressions)} metric(s) regressed"
|
||||
}
|
||||
```
|
||||
|
||||
## LangSmith Evaluation Integration
|
||||
|
||||
```python
|
||||
from langsmith import Client
|
||||
from langsmith.evaluation import evaluate, LangChainStringEvaluator
|
||||
|
||||
# Initialize LangSmith client
|
||||
client = Client()
|
||||
|
||||
# Create dataset
|
||||
dataset = client.create_dataset("qa_test_cases")
|
||||
client.create_examples(
|
||||
inputs=[{"question": q} for q in questions],
|
||||
outputs=[{"answer": a} for a in expected_answers],
|
||||
dataset_id=dataset.id
|
||||
)
|
||||
|
||||
# Define evaluators
|
||||
evaluators = [
|
||||
LangChainStringEvaluator("qa"), # QA correctness
|
||||
LangChainStringEvaluator("context_qa"), # Context-grounded QA
|
||||
LangChainStringEvaluator("cot_qa"), # Chain-of-thought QA
|
||||
]
|
||||
|
||||
# Run evaluation
|
||||
async def target_function(inputs: dict) -> dict:
|
||||
result = await your_chain.ainvoke(inputs)
|
||||
return {"answer": result}
|
||||
|
||||
experiment_results = await evaluate(
|
||||
target_function,
|
||||
data=dataset.name,
|
||||
evaluators=evaluators,
|
||||
experiment_prefix="v1.0.0",
|
||||
metadata={"model": "claude-sonnet-4-5", "version": "1.0.0"}
|
||||
)
|
||||
|
||||
print(f"Mean score: {experiment_results.aggregate_metrics['qa']['mean']}")
|
||||
```
|
||||
|
||||
## Benchmarking
|
||||
|
||||
### Running Benchmarks
|
||||
```python
|
||||
from dataclasses import dataclass
|
||||
import numpy as np
|
||||
|
||||
@dataclass
|
||||
class BenchmarkResult:
|
||||
metric: str
|
||||
mean: float
|
||||
std: float
|
||||
min: float
|
||||
max: float
|
||||
|
||||
class BenchmarkRunner:
|
||||
def __init__(self, benchmark_dataset):
|
||||
def __init__(self, benchmark_dataset: list[dict]):
|
||||
self.dataset = benchmark_dataset
|
||||
|
||||
def run_benchmark(self, model, metrics):
|
||||
async def run_benchmark(
|
||||
self,
|
||||
model,
|
||||
metrics: list[Metric]
|
||||
) -> dict[str, BenchmarkResult]:
|
||||
"""Run model on benchmark and calculate metrics."""
|
||||
results = {metric.name: [] for metric in metrics}
|
||||
|
||||
for example in self.dataset:
|
||||
# Generate prediction
|
||||
prediction = model.predict(example["input"])
|
||||
prediction = await model.predict(example["input"])
|
||||
|
||||
# Calculate each metric
|
||||
for metric in metrics:
|
||||
score = metric.calculate(
|
||||
score = metric.fn(
|
||||
prediction=prediction,
|
||||
reference=example["reference"],
|
||||
context=example.get("context")
|
||||
@@ -430,26 +634,24 @@ class BenchmarkRunner:
|
||||
|
||||
# Aggregate results
|
||||
return {
|
||||
metric: {
|
||||
"mean": np.mean(scores),
|
||||
"std": np.std(scores),
|
||||
"min": min(scores),
|
||||
"max": max(scores)
|
||||
}
|
||||
metric: BenchmarkResult(
|
||||
metric=metric,
|
||||
mean=np.mean(scores),
|
||||
std=np.std(scores),
|
||||
min=min(scores),
|
||||
max=max(scores)
|
||||
)
|
||||
for metric, scores in results.items()
|
||||
}
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- **references/metrics.md**: Comprehensive metric guide
|
||||
- **references/human-evaluation.md**: Annotation best practices
|
||||
- **references/benchmarking.md**: Standard benchmarks
|
||||
- **references/a-b-testing.md**: Statistical testing guide
|
||||
- **references/regression-testing.md**: CI/CD integration
|
||||
- **assets/evaluation-framework.py**: Complete evaluation harness
|
||||
- **assets/benchmark-dataset.jsonl**: Example datasets
|
||||
- **scripts/evaluate-model.py**: Automated evaluation runner
|
||||
- [LangSmith Evaluation Guide](https://docs.smith.langchain.com/evaluation)
|
||||
- [RAGAS Framework](https://docs.ragas.io/)
|
||||
- [DeepEval Library](https://docs.deepeval.com/)
|
||||
- [Arize Phoenix](https://docs.arize.com/phoenix/)
|
||||
- [HELM Benchmark](https://crfm.stanford.edu/helm/)
|
||||
|
||||
## Best Practices
|
||||
|
||||
@@ -469,3 +671,5 @@ class BenchmarkRunner:
|
||||
- **Data Contamination**: Testing on training data
|
||||
- **Ignoring Variance**: Not accounting for statistical uncertainty
|
||||
- **Metric Mismatch**: Using metrics not aligned with business goals
|
||||
- **Position Bias**: In pairwise evals, randomize order
|
||||
- **Overfitting Prompts**: Optimizing for test set instead of real use
|
||||
|
||||
@@ -16,6 +16,7 @@ Master advanced prompt engineering techniques to maximize LLM performance, relia
|
||||
- Creating reusable prompt templates with variable interpolation
|
||||
- Debugging and refining prompts that produce inconsistent outputs
|
||||
- Implementing system prompts for specialized AI assistants
|
||||
- Using structured outputs (JSON mode) for reliable parsing
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
@@ -33,21 +34,27 @@ Master advanced prompt engineering techniques to maximize LLM performance, relia
|
||||
- Self-consistency techniques (sampling multiple reasoning paths)
|
||||
- Verification and validation steps
|
||||
|
||||
### 3. Prompt Optimization
|
||||
### 3. Structured Outputs
|
||||
- JSON mode for reliable parsing
|
||||
- Pydantic schema enforcement
|
||||
- Type-safe response handling
|
||||
- Error handling for malformed outputs
|
||||
|
||||
### 4. Prompt Optimization
|
||||
- Iterative refinement workflows
|
||||
- A/B testing prompt variations
|
||||
- Measuring prompt performance metrics (accuracy, consistency, latency)
|
||||
- Reducing token usage while maintaining quality
|
||||
- Handling edge cases and failure modes
|
||||
|
||||
### 4. Template Systems
|
||||
### 5. Template Systems
|
||||
- Variable interpolation and formatting
|
||||
- Conditional prompt sections
|
||||
- Multi-turn conversation templates
|
||||
- Role-based prompt composition
|
||||
- Modular prompt components
|
||||
|
||||
### 5. System Prompt Design
|
||||
### 6. System Prompt Design
|
||||
- Setting model behavior and constraints
|
||||
- Defining output formats and structure
|
||||
- Establishing role and expertise
|
||||
@@ -57,68 +64,385 @@ Master advanced prompt engineering techniques to maximize LLM performance, relia
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from prompt_optimizer import PromptTemplate, FewShotSelector
|
||||
from langchain_anthropic import ChatAnthropic
|
||||
from langchain_core.prompts import ChatPromptTemplate
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
# Define a structured prompt template
|
||||
template = PromptTemplate(
|
||||
system="You are an expert SQL developer. Generate efficient, secure SQL queries.",
|
||||
instruction="Convert the following natural language query to SQL:\n{query}",
|
||||
few_shot_examples=True,
|
||||
output_format="SQL code block with explanatory comments"
|
||||
)
|
||||
# Define structured output schema
|
||||
class SQLQuery(BaseModel):
|
||||
query: str = Field(description="The SQL query")
|
||||
explanation: str = Field(description="Brief explanation of what the query does")
|
||||
tables_used: list[str] = Field(description="List of tables referenced")
|
||||
|
||||
# Configure few-shot learning
|
||||
selector = FewShotSelector(
|
||||
examples_db="sql_examples.jsonl",
|
||||
selection_strategy="semantic_similarity",
|
||||
max_examples=3
|
||||
)
|
||||
# Initialize model with structured output
|
||||
llm = ChatAnthropic(model="claude-sonnet-4-5")
|
||||
structured_llm = llm.with_structured_output(SQLQuery)
|
||||
|
||||
# Generate optimized prompt
|
||||
prompt = template.render(
|
||||
query="Find all users who registered in the last 30 days",
|
||||
examples=selector.select(query="user registration date filter")
|
||||
)
|
||||
# Create prompt template
|
||||
prompt = ChatPromptTemplate.from_messages([
|
||||
("system", """You are an expert SQL developer. Generate efficient, secure SQL queries.
|
||||
Always use parameterized queries to prevent SQL injection.
|
||||
Explain your reasoning briefly."""),
|
||||
("user", "Convert this to SQL: {query}")
|
||||
])
|
||||
|
||||
# Create chain
|
||||
chain = prompt | structured_llm
|
||||
|
||||
# Use
|
||||
result = await chain.ainvoke({
|
||||
"query": "Find all users who registered in the last 30 days"
|
||||
})
|
||||
print(result.query)
|
||||
print(result.explanation)
|
||||
```
|
||||
|
||||
## Key Patterns
|
||||
|
||||
### Progressive Disclosure
|
||||
### Pattern 1: Structured Output with Pydantic
|
||||
|
||||
```python
|
||||
from anthropic import Anthropic
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import Literal
|
||||
import json
|
||||
|
||||
class SentimentAnalysis(BaseModel):
|
||||
sentiment: Literal["positive", "negative", "neutral"]
|
||||
confidence: float = Field(ge=0, le=1)
|
||||
key_phrases: list[str]
|
||||
reasoning: str
|
||||
|
||||
async def analyze_sentiment(text: str) -> SentimentAnalysis:
|
||||
"""Analyze sentiment with structured output."""
|
||||
client = Anthropic()
|
||||
|
||||
message = client.messages.create(
|
||||
model="claude-sonnet-4-5",
|
||||
max_tokens=500,
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": f"""Analyze the sentiment of this text.
|
||||
|
||||
Text: {text}
|
||||
|
||||
Respond with JSON matching this schema:
|
||||
{{
|
||||
"sentiment": "positive" | "negative" | "neutral",
|
||||
"confidence": 0.0-1.0,
|
||||
"key_phrases": ["phrase1", "phrase2"],
|
||||
"reasoning": "brief explanation"
|
||||
}}"""
|
||||
}]
|
||||
)
|
||||
|
||||
return SentimentAnalysis(**json.loads(message.content[0].text))
|
||||
```
|
||||
|
||||
### Pattern 2: Chain-of-Thought with Self-Verification
|
||||
|
||||
```python
|
||||
from langchain_core.prompts import ChatPromptTemplate
|
||||
|
||||
cot_prompt = ChatPromptTemplate.from_template("""
|
||||
Solve this problem step by step.
|
||||
|
||||
Problem: {problem}
|
||||
|
||||
Instructions:
|
||||
1. Break down the problem into clear steps
|
||||
2. Work through each step showing your reasoning
|
||||
3. State your final answer
|
||||
4. Verify your answer by checking it against the original problem
|
||||
|
||||
Format your response as:
|
||||
## Steps
|
||||
[Your step-by-step reasoning]
|
||||
|
||||
## Answer
|
||||
[Your final answer]
|
||||
|
||||
## Verification
|
||||
[Check that your answer is correct]
|
||||
""")
|
||||
```
|
||||
|
||||
### Pattern 3: Few-Shot with Dynamic Example Selection
|
||||
|
||||
```python
|
||||
from langchain_voyageai import VoyageAIEmbeddings
|
||||
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
|
||||
from langchain_chroma import Chroma
|
||||
|
||||
# Create example selector with semantic similarity
|
||||
example_selector = SemanticSimilarityExampleSelector.from_examples(
|
||||
examples=[
|
||||
{"input": "How do I reset my password?", "output": "Go to Settings > Security > Reset Password"},
|
||||
{"input": "Where can I see my order history?", "output": "Navigate to Account > Orders"},
|
||||
{"input": "How do I contact support?", "output": "Click Help > Contact Us or email support@example.com"},
|
||||
],
|
||||
embeddings=VoyageAIEmbeddings(model="voyage-3-large"),
|
||||
vectorstore_cls=Chroma,
|
||||
k=2 # Select 2 most similar examples
|
||||
)
|
||||
|
||||
async def get_few_shot_prompt(query: str) -> str:
|
||||
"""Build prompt with dynamically selected examples."""
|
||||
examples = await example_selector.aselect_examples({"input": query})
|
||||
|
||||
examples_text = "\n".join(
|
||||
f"User: {ex['input']}\nAssistant: {ex['output']}"
|
||||
for ex in examples
|
||||
)
|
||||
|
||||
return f"""You are a helpful customer support assistant.
|
||||
|
||||
Here are some example interactions:
|
||||
{examples_text}
|
||||
|
||||
Now respond to this query:
|
||||
User: {query}
|
||||
Assistant:"""
|
||||
```
|
||||
|
||||
### Pattern 4: Progressive Disclosure
|
||||
|
||||
Start with simple prompts, add complexity only when needed:
|
||||
|
||||
1. **Level 1**: Direct instruction
|
||||
- "Summarize this article"
|
||||
```python
|
||||
PROMPT_LEVELS = {
|
||||
# Level 1: Direct instruction
|
||||
"simple": "Summarize this article: {text}",
|
||||
|
||||
2. **Level 2**: Add constraints
|
||||
- "Summarize this article in 3 bullet points, focusing on key findings"
|
||||
# Level 2: Add constraints
|
||||
"constrained": """Summarize this article in 3 bullet points, focusing on:
|
||||
- Key findings
|
||||
- Main conclusions
|
||||
- Practical implications
|
||||
|
||||
3. **Level 3**: Add reasoning
|
||||
- "Read this article, identify the main findings, then summarize in 3 bullet points"
|
||||
Article: {text}""",
|
||||
|
||||
4. **Level 4**: Add examples
|
||||
- Include 2-3 example summaries with input-output pairs
|
||||
# Level 3: Add reasoning
|
||||
"reasoning": """Read this article carefully.
|
||||
1. First, identify the main topic and thesis
|
||||
2. Then, extract the key supporting points
|
||||
3. Finally, summarize in 3 bullet points
|
||||
|
||||
### Instruction Hierarchy
|
||||
```
|
||||
[System Context] → [Task Instruction] → [Examples] → [Input Data] → [Output Format]
|
||||
Article: {text}
|
||||
|
||||
Summary:""",
|
||||
|
||||
# Level 4: Add examples
|
||||
"few_shot": """Read articles and provide concise summaries.
|
||||
|
||||
Example:
|
||||
Article: "New research shows that regular exercise can reduce anxiety by up to 40%..."
|
||||
Summary:
|
||||
• Regular exercise reduces anxiety by up to 40%
|
||||
• 30 minutes of moderate activity 3x/week is sufficient
|
||||
• Benefits appear within 2 weeks of starting
|
||||
|
||||
Now summarize this article:
|
||||
Article: {text}
|
||||
|
||||
Summary:"""
|
||||
}
|
||||
```
|
||||
|
||||
### Error Recovery
|
||||
Build prompts that gracefully handle failures:
|
||||
- Include fallback instructions
|
||||
- Request confidence scores
|
||||
- Ask for alternative interpretations when uncertain
|
||||
- Specify how to indicate missing information
|
||||
### Pattern 5: Error Recovery and Fallback
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel, ValidationError
|
||||
import json
|
||||
|
||||
class ResponseWithConfidence(BaseModel):
|
||||
answer: str
|
||||
confidence: float
|
||||
sources: list[str]
|
||||
alternative_interpretations: list[str] = []
|
||||
|
||||
ERROR_RECOVERY_PROMPT = """
|
||||
Answer the question based on the context provided.
|
||||
|
||||
Context: {context}
|
||||
Question: {question}
|
||||
|
||||
Instructions:
|
||||
1. If you can answer confidently (>0.8), provide a direct answer
|
||||
2. If you're somewhat confident (0.5-0.8), provide your best answer with caveats
|
||||
3. If you're uncertain (<0.5), explain what information is missing
|
||||
4. Always provide alternative interpretations if the question is ambiguous
|
||||
|
||||
Respond in JSON:
|
||||
{{
|
||||
"answer": "your answer or 'I cannot determine this from the context'",
|
||||
"confidence": 0.0-1.0,
|
||||
"sources": ["relevant context excerpts"],
|
||||
"alternative_interpretations": ["if question is ambiguous"]
|
||||
}}
|
||||
"""
|
||||
|
||||
async def answer_with_fallback(
|
||||
context: str,
|
||||
question: str,
|
||||
llm
|
||||
) -> ResponseWithConfidence:
|
||||
"""Answer with error recovery and fallback."""
|
||||
prompt = ERROR_RECOVERY_PROMPT.format(context=context, question=question)
|
||||
|
||||
try:
|
||||
response = await llm.ainvoke(prompt)
|
||||
return ResponseWithConfidence(**json.loads(response.content))
|
||||
except (json.JSONDecodeError, ValidationError) as e:
|
||||
# Fallback: try to extract answer without structure
|
||||
simple_prompt = f"Based on: {context}\n\nAnswer: {question}"
|
||||
simple_response = await llm.ainvoke(simple_prompt)
|
||||
return ResponseWithConfidence(
|
||||
answer=simple_response.content,
|
||||
confidence=0.5,
|
||||
sources=["fallback extraction"],
|
||||
alternative_interpretations=[]
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 6: Role-Based System Prompts
|
||||
|
||||
```python
|
||||
SYSTEM_PROMPTS = {
|
||||
"analyst": """You are a senior data analyst with expertise in SQL, Python, and business intelligence.
|
||||
|
||||
Your responsibilities:
|
||||
- Write efficient, well-documented queries
|
||||
- Explain your analysis methodology
|
||||
- Highlight key insights and recommendations
|
||||
- Flag any data quality concerns
|
||||
|
||||
Communication style:
|
||||
- Be precise and technical when discussing methodology
|
||||
- Translate technical findings into business impact
|
||||
- Use clear visualizations when helpful""",
|
||||
|
||||
"assistant": """You are a helpful AI assistant focused on accuracy and clarity.
|
||||
|
||||
Core principles:
|
||||
- Always cite sources when making factual claims
|
||||
- Acknowledge uncertainty rather than guessing
|
||||
- Ask clarifying questions when the request is ambiguous
|
||||
- Provide step-by-step explanations for complex topics
|
||||
|
||||
Constraints:
|
||||
- Do not provide medical, legal, or financial advice
|
||||
- Redirect harmful requests appropriately
|
||||
- Protect user privacy""",
|
||||
|
||||
"code_reviewer": """You are a senior software engineer conducting code reviews.
|
||||
|
||||
Review criteria:
|
||||
- Correctness: Does the code work as intended?
|
||||
- Security: Are there any vulnerabilities?
|
||||
- Performance: Are there efficiency concerns?
|
||||
- Maintainability: Is the code readable and well-structured?
|
||||
- Best practices: Does it follow language idioms?
|
||||
|
||||
Output format:
|
||||
1. Summary assessment (approve/request changes)
|
||||
2. Critical issues (must fix)
|
||||
3. Suggestions (nice to have)
|
||||
4. Positive feedback (what's done well)"""
|
||||
}
|
||||
```
|
||||
|
||||
## Integration Patterns
|
||||
|
||||
### With RAG Systems
|
||||
|
||||
```python
|
||||
RAG_PROMPT = """You are a knowledgeable assistant that answers questions based on provided context.
|
||||
|
||||
Context (retrieved from knowledge base):
|
||||
{context}
|
||||
|
||||
Instructions:
|
||||
1. Answer ONLY based on the provided context
|
||||
2. If the context doesn't contain the answer, say "I don't have information about that in my knowledge base"
|
||||
3. Cite specific passages using [1], [2] notation
|
||||
4. If the question is ambiguous, ask for clarification
|
||||
|
||||
Question: {question}
|
||||
|
||||
Answer:"""
|
||||
```
|
||||
|
||||
### With Validation and Verification
|
||||
|
||||
```python
|
||||
VALIDATED_PROMPT = """Complete the following task:
|
||||
|
||||
Task: {task}
|
||||
|
||||
After generating your response, verify it meets ALL these criteria:
|
||||
✓ Directly addresses the original request
|
||||
✓ Contains no factual errors
|
||||
✓ Is appropriately detailed (not too brief, not too verbose)
|
||||
✓ Uses proper formatting
|
||||
✓ Is safe and appropriate
|
||||
|
||||
If verification fails on any criterion, revise before responding.
|
||||
|
||||
Response:"""
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Token Efficiency
|
||||
```python
|
||||
# Before: Verbose prompt (150+ tokens)
|
||||
verbose_prompt = """
|
||||
I would like you to please take the following text and provide me with a comprehensive
|
||||
summary of the main points. The summary should capture the key ideas and important details
|
||||
while being concise and easy to understand.
|
||||
"""
|
||||
|
||||
# After: Concise prompt (30 tokens)
|
||||
concise_prompt = """Summarize the key points concisely:
|
||||
|
||||
{text}
|
||||
|
||||
Summary:"""
|
||||
```
|
||||
|
||||
### Caching Common Prefixes
|
||||
|
||||
```python
|
||||
from anthropic import Anthropic
|
||||
|
||||
client = Anthropic()
|
||||
|
||||
# Use prompt caching for repeated system prompts
|
||||
response = client.messages.create(
|
||||
model="claude-sonnet-4-5",
|
||||
max_tokens=1000,
|
||||
system=[
|
||||
{
|
||||
"type": "text",
|
||||
"text": LONG_SYSTEM_PROMPT,
|
||||
"cache_control": {"type": "ephemeral"}
|
||||
}
|
||||
],
|
||||
messages=[{"role": "user", "content": user_query}]
|
||||
)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Be Specific**: Vague prompts produce inconsistent results
|
||||
2. **Show, Don't Tell**: Examples are more effective than descriptions
|
||||
3. **Test Extensively**: Evaluate on diverse, representative inputs
|
||||
4. **Iterate Rapidly**: Small changes can have large impacts
|
||||
5. **Monitor Performance**: Track metrics in production
|
||||
6. **Version Control**: Treat prompts as code with proper versioning
|
||||
7. **Document Intent**: Explain why prompts are structured as they are
|
||||
3. **Use Structured Outputs**: Enforce schemas with Pydantic for reliability
|
||||
4. **Test Extensively**: Evaluate on diverse, representative inputs
|
||||
5. **Iterate Rapidly**: Small changes can have large impacts
|
||||
6. **Monitor Performance**: Track metrics in production
|
||||
7. **Version Control**: Treat prompts as code with proper versioning
|
||||
8. **Document Intent**: Explain why prompts are structured as they are
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
@@ -127,60 +451,8 @@ Build prompts that gracefully handle failures:
|
||||
- **Context overflow**: Exceeding token limits with excessive examples
|
||||
- **Ambiguous instructions**: Leaving room for multiple interpretations
|
||||
- **Ignoring edge cases**: Not testing on unusual or boundary inputs
|
||||
|
||||
## Integration Patterns
|
||||
|
||||
### With RAG Systems
|
||||
```python
|
||||
# Combine retrieved context with prompt engineering
|
||||
prompt = f"""Given the following context:
|
||||
{retrieved_context}
|
||||
|
||||
{few_shot_examples}
|
||||
|
||||
Question: {user_question}
|
||||
|
||||
Provide a detailed answer based solely on the context above. If the context doesn't contain enough information, explicitly state what's missing."""
|
||||
```
|
||||
|
||||
### With Validation
|
||||
```python
|
||||
# Add self-verification step
|
||||
prompt = f"""{main_task_prompt}
|
||||
|
||||
After generating your response, verify it meets these criteria:
|
||||
1. Answers the question directly
|
||||
2. Uses only information from provided context
|
||||
3. Cites specific sources
|
||||
4. Acknowledges any uncertainty
|
||||
|
||||
If verification fails, revise your response."""
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Token Efficiency
|
||||
- Remove redundant words and phrases
|
||||
- Use abbreviations consistently after first definition
|
||||
- Consolidate similar instructions
|
||||
- Move stable content to system prompts
|
||||
|
||||
### Latency Reduction
|
||||
- Minimize prompt length without sacrificing quality
|
||||
- Use streaming for long-form outputs
|
||||
- Cache common prompt prefixes
|
||||
- Batch similar requests when possible
|
||||
|
||||
## Resources
|
||||
|
||||
- **references/few-shot-learning.md**: Deep dive on example selection and construction
|
||||
- **references/chain-of-thought.md**: Advanced reasoning elicitation techniques
|
||||
- **references/prompt-optimization.md**: Systematic refinement workflows
|
||||
- **references/prompt-templates.md**: Reusable template patterns
|
||||
- **references/system-prompts.md**: System-level prompt design
|
||||
- **assets/prompt-template-library.md**: Battle-tested prompt templates
|
||||
- **assets/few-shot-examples.json**: Curated example datasets
|
||||
- **scripts/optimize-prompt.py**: Automated prompt optimization tool
|
||||
- **No error handling**: Assuming outputs will always be well-formed
|
||||
- **Hardcoded values**: Not parameterizing prompts for reuse
|
||||
|
||||
## Success Metrics
|
||||
|
||||
@@ -189,13 +461,12 @@ Track these KPIs for your prompts:
|
||||
- **Consistency**: Reproducibility across similar inputs
|
||||
- **Latency**: Response time (P50, P95, P99)
|
||||
- **Token Usage**: Average tokens per request
|
||||
- **Success Rate**: Percentage of valid outputs
|
||||
- **Success Rate**: Percentage of valid, parseable outputs
|
||||
- **User Satisfaction**: Ratings and feedback
|
||||
|
||||
## Next Steps
|
||||
## Resources
|
||||
|
||||
1. Review the prompt template library for common patterns
|
||||
2. Experiment with few-shot learning for your specific use case
|
||||
3. Implement prompt versioning and A/B testing
|
||||
4. Set up automated evaluation pipelines
|
||||
5. Document your prompt engineering decisions and learnings
|
||||
- [Anthropic Prompt Engineering Guide](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering)
|
||||
- [Claude Prompt Caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
|
||||
- [OpenAI Prompt Engineering](https://platform.openai.com/docs/guides/prompt-engineering)
|
||||
- [LangChain Prompts](https://python.langchain.com/docs/concepts/prompts/)
|
||||
|
||||
@@ -23,187 +23,276 @@ Master Retrieval-Augmented Generation (RAG) to build LLM applications that provi
|
||||
**Purpose**: Store and retrieve document embeddings efficiently
|
||||
|
||||
**Options:**
|
||||
- **Pinecone**: Managed, scalable, fast queries
|
||||
- **Weaviate**: Open-source, hybrid search
|
||||
- **Pinecone**: Managed, scalable, serverless
|
||||
- **Weaviate**: Open-source, hybrid search, GraphQL
|
||||
- **Milvus**: High performance, on-premise
|
||||
- **Chroma**: Lightweight, easy to use
|
||||
- **Qdrant**: Fast, filtered search
|
||||
- **FAISS**: Meta's library, local deployment
|
||||
- **Chroma**: Lightweight, easy to use, local development
|
||||
- **Qdrant**: Fast, filtered search, Rust-based
|
||||
- **pgvector**: PostgreSQL extension, SQL integration
|
||||
|
||||
### 2. Embeddings
|
||||
**Purpose**: Convert text to numerical vectors for similarity search
|
||||
|
||||
**Models:**
|
||||
- **text-embedding-ada-002** (OpenAI): General purpose, 1536 dims
|
||||
- **all-MiniLM-L6-v2** (Sentence Transformers): Fast, lightweight
|
||||
- **e5-large-v2**: High quality, multilingual
|
||||
- **Instructor**: Task-specific instructions
|
||||
- **bge-large-en-v1.5**: SOTA performance
|
||||
**Models (2026):**
|
||||
| Model | Dimensions | Best For |
|
||||
|-------|------------|----------|
|
||||
| **voyage-3-large** | 1024 | Claude apps (Anthropic recommended) |
|
||||
| **voyage-code-3** | 1024 | Code search |
|
||||
| **text-embedding-3-large** | 3072 | OpenAI apps, high accuracy |
|
||||
| **text-embedding-3-small** | 1536 | OpenAI apps, cost-effective |
|
||||
| **bge-large-en-v1.5** | 1024 | Open source, local deployment |
|
||||
| **multilingual-e5-large** | 1024 | Multi-language support |
|
||||
|
||||
### 3. Retrieval Strategies
|
||||
**Approaches:**
|
||||
- **Dense Retrieval**: Semantic similarity via embeddings
|
||||
- **Sparse Retrieval**: Keyword matching (BM25, TF-IDF)
|
||||
- **Hybrid Search**: Combine dense + sparse
|
||||
- **Hybrid Search**: Combine dense + sparse with weighted fusion
|
||||
- **Multi-Query**: Generate multiple query variations
|
||||
- **HyDE**: Generate hypothetical documents
|
||||
- **HyDE**: Generate hypothetical documents for better retrieval
|
||||
|
||||
### 4. Reranking
|
||||
**Purpose**: Improve retrieval quality by reordering results
|
||||
|
||||
**Methods:**
|
||||
- **Cross-Encoders**: BERT-based reranking
|
||||
- **Cross-Encoders**: BERT-based reranking (ms-marco-MiniLM)
|
||||
- **Cohere Rerank**: API-based reranking
|
||||
- **Maximal Marginal Relevance (MMR)**: Diversity + relevance
|
||||
- **LLM-based**: Use LLM to score relevance
|
||||
|
||||
## Quick Start
|
||||
## Quick Start with LangGraph
|
||||
|
||||
```python
|
||||
from langchain.document_loaders import DirectoryLoader
|
||||
from langchain.text_splitters import RecursiveCharacterTextSplitter
|
||||
from langchain.embeddings import OpenAIEmbeddings
|
||||
from langchain.vectorstores import Chroma
|
||||
from langchain.chains import RetrievalQA
|
||||
from langchain.llms import OpenAI
|
||||
from langgraph.graph import StateGraph, START, END
|
||||
from langchain_anthropic import ChatAnthropic
|
||||
from langchain_voyageai import VoyageAIEmbeddings
|
||||
from langchain_pinecone import PineconeVectorStore
|
||||
from langchain_core.documents import Document
|
||||
from langchain_core.prompts import ChatPromptTemplate
|
||||
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
||||
from typing import TypedDict, Annotated
|
||||
|
||||
# 1. Load documents
|
||||
loader = DirectoryLoader('./docs', glob="**/*.txt")
|
||||
documents = loader.load()
|
||||
class RAGState(TypedDict):
|
||||
question: str
|
||||
context: list[Document]
|
||||
answer: str
|
||||
|
||||
# 2. Split into chunks
|
||||
text_splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=200,
|
||||
length_function=len
|
||||
)
|
||||
chunks = text_splitter.split_documents(documents)
|
||||
# Initialize components
|
||||
llm = ChatAnthropic(model="claude-sonnet-4-5")
|
||||
embeddings = VoyageAIEmbeddings(model="voyage-3-large")
|
||||
vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
|
||||
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
|
||||
|
||||
# 3. Create embeddings and vector store
|
||||
embeddings = OpenAIEmbeddings()
|
||||
vectorstore = Chroma.from_documents(chunks, embeddings)
|
||||
# RAG prompt
|
||||
rag_prompt = ChatPromptTemplate.from_template(
|
||||
"""Answer based on the context below. If you cannot answer, say so.
|
||||
|
||||
# 4. Create retrieval chain
|
||||
qa_chain = RetrievalQA.from_chain_type(
|
||||
llm=OpenAI(),
|
||||
chain_type="stuff",
|
||||
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
|
||||
return_source_documents=True
|
||||
Context:
|
||||
{context}
|
||||
|
||||
Question: {question}
|
||||
|
||||
Answer:"""
|
||||
)
|
||||
|
||||
# 5. Query
|
||||
result = qa_chain({"query": "What are the main features?"})
|
||||
print(result['result'])
|
||||
print(result['source_documents'])
|
||||
async def retrieve(state: RAGState) -> RAGState:
|
||||
"""Retrieve relevant documents."""
|
||||
docs = await retriever.ainvoke(state["question"])
|
||||
return {"context": docs}
|
||||
|
||||
async def generate(state: RAGState) -> RAGState:
|
||||
"""Generate answer from context."""
|
||||
context_text = "\n\n".join(doc.page_content for doc in state["context"])
|
||||
messages = rag_prompt.format_messages(
|
||||
context=context_text,
|
||||
question=state["question"]
|
||||
)
|
||||
response = await llm.ainvoke(messages)
|
||||
return {"answer": response.content}
|
||||
|
||||
# Build RAG graph
|
||||
builder = StateGraph(RAGState)
|
||||
builder.add_node("retrieve", retrieve)
|
||||
builder.add_node("generate", generate)
|
||||
builder.add_edge(START, "retrieve")
|
||||
builder.add_edge("retrieve", "generate")
|
||||
builder.add_edge("generate", END)
|
||||
|
||||
rag_chain = builder.compile()
|
||||
|
||||
# Use
|
||||
result = await rag_chain.ainvoke({"question": "What are the main features?"})
|
||||
print(result["answer"])
|
||||
```
|
||||
|
||||
## Advanced RAG Patterns
|
||||
|
||||
### Pattern 1: Hybrid Search
|
||||
### Pattern 1: Hybrid Search with RRF
|
||||
|
||||
```python
|
||||
from langchain.retrievers import BM25Retriever, EnsembleRetriever
|
||||
from langchain_community.retrievers import BM25Retriever
|
||||
from langchain.retrievers import EnsembleRetriever
|
||||
|
||||
# Sparse retriever (BM25)
|
||||
bm25_retriever = BM25Retriever.from_documents(chunks)
|
||||
bm25_retriever.k = 5
|
||||
# Sparse retriever (BM25 for keyword matching)
|
||||
bm25_retriever = BM25Retriever.from_documents(documents)
|
||||
bm25_retriever.k = 10
|
||||
|
||||
# Dense retriever (embeddings)
|
||||
embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
|
||||
# Dense retriever (embeddings for semantic search)
|
||||
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
|
||||
|
||||
# Combine with weights
|
||||
# Combine with Reciprocal Rank Fusion weights
|
||||
ensemble_retriever = EnsembleRetriever(
|
||||
retrievers=[bm25_retriever, embedding_retriever],
|
||||
weights=[0.3, 0.7]
|
||||
retrievers=[bm25_retriever, dense_retriever],
|
||||
weights=[0.3, 0.7] # 30% keyword, 70% semantic
|
||||
)
|
||||
```
|
||||
|
||||
### Pattern 2: Multi-Query Retrieval
|
||||
|
||||
```python
|
||||
from langchain.retrievers.multi_query import MultiQueryRetriever
|
||||
|
||||
# Generate multiple query perspectives
|
||||
retriever = MultiQueryRetriever.from_llm(
|
||||
retriever=vectorstore.as_retriever(),
|
||||
llm=OpenAI()
|
||||
# Generate multiple query perspectives for better recall
|
||||
multi_query_retriever = MultiQueryRetriever.from_llm(
|
||||
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
|
||||
llm=llm
|
||||
)
|
||||
|
||||
# Single query → multiple variations → combined results
|
||||
results = retriever.get_relevant_documents("What is the main topic?")
|
||||
results = await multi_query_retriever.ainvoke("What is the main topic?")
|
||||
```
|
||||
|
||||
### Pattern 3: Contextual Compression
|
||||
|
||||
```python
|
||||
from langchain.retrievers import ContextualCompressionRetriever
|
||||
from langchain.retrievers.document_compressors import LLMChainExtractor
|
||||
|
||||
# Compressor extracts only relevant portions
|
||||
compressor = LLMChainExtractor.from_llm(llm)
|
||||
|
||||
compression_retriever = ContextualCompressionRetriever(
|
||||
base_compressor=compressor,
|
||||
base_retriever=vectorstore.as_retriever()
|
||||
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
|
||||
)
|
||||
|
||||
# Returns only relevant parts of documents
|
||||
compressed_docs = compression_retriever.get_relevant_documents("query")
|
||||
compressed_docs = await compression_retriever.ainvoke("specific query")
|
||||
```
|
||||
|
||||
### Pattern 4: Parent Document Retriever
|
||||
|
||||
```python
|
||||
from langchain.retrievers import ParentDocumentRetriever
|
||||
from langchain.storage import InMemoryStore
|
||||
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
||||
|
||||
# Small chunks for precise retrieval, large chunks for context
|
||||
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
|
||||
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
|
||||
|
||||
# Store for parent documents
|
||||
store = InMemoryStore()
|
||||
docstore = InMemoryStore()
|
||||
|
||||
# Small chunks for retrieval, large chunks for context
|
||||
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
|
||||
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
|
||||
|
||||
retriever = ParentDocumentRetriever(
|
||||
parent_retriever = ParentDocumentRetriever(
|
||||
vectorstore=vectorstore,
|
||||
docstore=store,
|
||||
docstore=docstore,
|
||||
child_splitter=child_splitter,
|
||||
parent_splitter=parent_splitter
|
||||
)
|
||||
|
||||
# Add documents (splits children, stores parents)
|
||||
await parent_retriever.aadd_documents(documents)
|
||||
|
||||
# Retrieval returns parent documents with full context
|
||||
results = await parent_retriever.ainvoke("query")
|
||||
```
|
||||
|
||||
### Pattern 5: HyDE (Hypothetical Document Embeddings)
|
||||
|
||||
```python
|
||||
from langchain_core.prompts import ChatPromptTemplate
|
||||
|
||||
class HyDEState(TypedDict):
|
||||
question: str
|
||||
hypothetical_doc: str
|
||||
context: list[Document]
|
||||
answer: str
|
||||
|
||||
hyde_prompt = ChatPromptTemplate.from_template(
|
||||
"""Write a detailed passage that would answer this question:
|
||||
|
||||
Question: {question}
|
||||
|
||||
Passage:"""
|
||||
)
|
||||
|
||||
async def generate_hypothetical(state: HyDEState) -> HyDEState:
|
||||
"""Generate hypothetical document for better retrieval."""
|
||||
messages = hyde_prompt.format_messages(question=state["question"])
|
||||
response = await llm.ainvoke(messages)
|
||||
return {"hypothetical_doc": response.content}
|
||||
|
||||
async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
|
||||
"""Retrieve using hypothetical document."""
|
||||
# Use hypothetical doc for retrieval instead of original query
|
||||
docs = await retriever.ainvoke(state["hypothetical_doc"])
|
||||
return {"context": docs}
|
||||
|
||||
# Build HyDE RAG graph
|
||||
builder = StateGraph(HyDEState)
|
||||
builder.add_node("hypothetical", generate_hypothetical)
|
||||
builder.add_node("retrieve", retrieve_with_hyde)
|
||||
builder.add_node("generate", generate)
|
||||
builder.add_edge(START, "hypothetical")
|
||||
builder.add_edge("hypothetical", "retrieve")
|
||||
builder.add_edge("retrieve", "generate")
|
||||
builder.add_edge("generate", END)
|
||||
|
||||
hyde_rag = builder.compile()
|
||||
```
|
||||
|
||||
## Document Chunking Strategies
|
||||
|
||||
### Recursive Character Text Splitter
|
||||
```python
|
||||
from langchain.text_splitters import RecursiveCharacterTextSplitter
|
||||
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
||||
|
||||
splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=1000,
|
||||
chunk_overlap=200,
|
||||
length_function=len,
|
||||
separators=["\n\n", "\n", " ", ""] # Try these in order
|
||||
separators=["\n\n", "\n", ". ", " ", ""] # Try in order
|
||||
)
|
||||
|
||||
chunks = splitter.split_documents(documents)
|
||||
```
|
||||
|
||||
### Token-Based Splitting
|
||||
```python
|
||||
from langchain.text_splitters import TokenTextSplitter
|
||||
from langchain_text_splitters import TokenTextSplitter
|
||||
|
||||
splitter = TokenTextSplitter(
|
||||
chunk_size=512,
|
||||
chunk_overlap=50
|
||||
chunk_overlap=50,
|
||||
encoding_name="cl100k_base" # OpenAI tiktoken encoding
|
||||
)
|
||||
```
|
||||
|
||||
### Semantic Chunking
|
||||
```python
|
||||
from langchain.text_splitters import SemanticChunker
|
||||
from langchain_experimental.text_splitter import SemanticChunker
|
||||
|
||||
splitter = SemanticChunker(
|
||||
embeddings=OpenAIEmbeddings(),
|
||||
breakpoint_threshold_type="percentile"
|
||||
embeddings=embeddings,
|
||||
breakpoint_threshold_type="percentile",
|
||||
breakpoint_threshold_amount=95
|
||||
)
|
||||
```
|
||||
|
||||
### Markdown Header Splitter
|
||||
```python
|
||||
from langchain.text_splitters import MarkdownHeaderTextSplitter
|
||||
from langchain_text_splitters import MarkdownHeaderTextSplitter
|
||||
|
||||
headers_to_split_on = [
|
||||
("#", "Header 1"),
|
||||
@@ -211,36 +300,54 @@ headers_to_split_on = [
|
||||
("###", "Header 3"),
|
||||
]
|
||||
|
||||
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
|
||||
splitter = MarkdownHeaderTextSplitter(
|
||||
headers_to_split_on=headers_to_split_on,
|
||||
strip_headers=False
|
||||
)
|
||||
```
|
||||
|
||||
## Vector Store Configurations
|
||||
|
||||
### Pinecone
|
||||
### Pinecone (Serverless)
|
||||
```python
|
||||
import pinecone
|
||||
from langchain.vectorstores import Pinecone
|
||||
from pinecone import Pinecone, ServerlessSpec
|
||||
from langchain_pinecone import PineconeVectorStore
|
||||
|
||||
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
|
||||
# Initialize Pinecone client
|
||||
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
|
||||
|
||||
index = pinecone.Index("your-index-name")
|
||||
# Create index if needed
|
||||
if "my-index" not in pc.list_indexes().names():
|
||||
pc.create_index(
|
||||
name="my-index",
|
||||
dimension=1024, # voyage-3-large dimensions
|
||||
metric="cosine",
|
||||
spec=ServerlessSpec(cloud="aws", region="us-east-1")
|
||||
)
|
||||
|
||||
vectorstore = Pinecone(index, embeddings.embed_query, "text")
|
||||
# Create vector store
|
||||
index = pc.Index("my-index")
|
||||
vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
|
||||
```
|
||||
|
||||
### Weaviate
|
||||
```python
|
||||
import weaviate
|
||||
from langchain.vectorstores import Weaviate
|
||||
from langchain_weaviate import WeaviateVectorStore
|
||||
|
||||
client = weaviate.Client("http://localhost:8080")
|
||||
client = weaviate.connect_to_local() # or connect_to_weaviate_cloud()
|
||||
|
||||
vectorstore = Weaviate(client, "Document", "content", embeddings)
|
||||
vectorstore = WeaviateVectorStore(
|
||||
client=client,
|
||||
index_name="Documents",
|
||||
text_key="content",
|
||||
embedding=embeddings
|
||||
)
|
||||
```
|
||||
|
||||
### Chroma (Local)
|
||||
### Chroma (Local Development)
|
||||
```python
|
||||
from langchain.vectorstores import Chroma
|
||||
from langchain_chroma import Chroma
|
||||
|
||||
vectorstore = Chroma(
|
||||
collection_name="my_collection",
|
||||
@@ -249,32 +356,47 @@ vectorstore = Chroma(
|
||||
)
|
||||
```
|
||||
|
||||
### pgvector (PostgreSQL)
|
||||
```python
|
||||
from langchain_postgres.vectorstores import PGVector
|
||||
|
||||
connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"
|
||||
|
||||
vectorstore = PGVector(
|
||||
embeddings=embeddings,
|
||||
collection_name="documents",
|
||||
connection=connection_string,
|
||||
)
|
||||
```
|
||||
|
||||
## Retrieval Optimization
|
||||
|
||||
### 1. Metadata Filtering
|
||||
```python
|
||||
from langchain_core.documents import Document
|
||||
|
||||
# Add metadata during indexing
|
||||
chunks_with_metadata = []
|
||||
for i, chunk in enumerate(chunks):
|
||||
chunk.metadata = {
|
||||
"source": chunk.metadata.get("source"),
|
||||
"page": i,
|
||||
"category": determine_category(chunk.page_content)
|
||||
}
|
||||
chunks_with_metadata.append(chunk)
|
||||
docs_with_metadata = []
|
||||
for doc in documents:
|
||||
doc.metadata.update({
|
||||
"source": doc.metadata.get("source", "unknown"),
|
||||
"category": determine_category(doc.page_content),
|
||||
"date": datetime.now().isoformat()
|
||||
})
|
||||
docs_with_metadata.append(doc)
|
||||
|
||||
# Filter during retrieval
|
||||
results = vectorstore.similarity_search(
|
||||
results = await vectorstore.asimilarity_search(
|
||||
"query",
|
||||
filter={"category": "technical"},
|
||||
k=5
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Maximal Marginal Relevance
|
||||
### 2. Maximal Marginal Relevance (MMR)
|
||||
```python
|
||||
# Balance relevance with diversity
|
||||
results = vectorstore.max_marginal_relevance_search(
|
||||
results = await vectorstore.amax_marginal_relevance_search(
|
||||
"query",
|
||||
k=5,
|
||||
fetch_k=20, # Fetch 20, return top 5 diverse
|
||||
@@ -288,116 +410,140 @@ from sentence_transformers import CrossEncoder
|
||||
|
||||
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
|
||||
|
||||
# Get initial results
|
||||
candidates = vectorstore.similarity_search("query", k=20)
|
||||
async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
|
||||
# Get initial results
|
||||
candidates = await vectorstore.asimilarity_search(query, k=20)
|
||||
|
||||
# Rerank
|
||||
pairs = [[query, doc.page_content] for doc in candidates]
|
||||
scores = reranker.predict(pairs)
|
||||
# Rerank
|
||||
pairs = [[query, doc.page_content] for doc in candidates]
|
||||
scores = reranker.predict(pairs)
|
||||
|
||||
# Sort by score and take top k
|
||||
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
|
||||
# Sort by score and take top k
|
||||
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
|
||||
return [doc for doc, score in ranked[:k]]
|
||||
```
|
||||
|
||||
### 4. Cohere Rerank
|
||||
```python
|
||||
from langchain.retrievers import CohereRerank
|
||||
from langchain_cohere import CohereRerank
|
||||
|
||||
reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
|
||||
|
||||
# Wrap retriever with reranking
|
||||
reranked_retriever = ContextualCompressionRetriever(
|
||||
base_compressor=reranker,
|
||||
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
|
||||
)
|
||||
```
|
||||
|
||||
## Prompt Engineering for RAG
|
||||
|
||||
### Contextual Prompt
|
||||
### Contextual Prompt with Citations
|
||||
```python
|
||||
prompt_template = """Use the following context to answer the question. If you cannot answer based on the context, say "I don't have enough information."
|
||||
rag_prompt = ChatPromptTemplate.from_template(
|
||||
"""Answer the question based on the context below. Include citations using [1], [2], etc.
|
||||
|
||||
Context:
|
||||
{context}
|
||||
If you cannot answer based on the context, say "I don't have enough information."
|
||||
|
||||
Question: {question}
|
||||
Context:
|
||||
{context}
|
||||
|
||||
Answer:"""
|
||||
Question: {question}
|
||||
|
||||
Instructions:
|
||||
1. Use only information from the context
|
||||
2. Cite sources with [1], [2] format
|
||||
3. If uncertain, express uncertainty
|
||||
|
||||
Answer (with citations):"""
|
||||
)
|
||||
```
|
||||
|
||||
### With Citations
|
||||
### Structured Output for RAG
|
||||
```python
|
||||
prompt_template = """Answer the question based on the context below. Include citations using [1], [2], etc.
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
Context:
|
||||
{context}
|
||||
class RAGResponse(BaseModel):
|
||||
answer: str = Field(description="The answer based on context")
|
||||
confidence: float = Field(description="Confidence score 0-1")
|
||||
sources: list[str] = Field(description="Source document IDs used")
|
||||
reasoning: str = Field(description="Brief reasoning for the answer")
|
||||
|
||||
Question: {question}
|
||||
|
||||
Answer (with citations):"""
|
||||
```
|
||||
|
||||
### With Confidence
|
||||
```python
|
||||
prompt_template = """Answer the question using the context. Provide a confidence score (0-100%) for your answer.
|
||||
|
||||
Context:
|
||||
{context}
|
||||
|
||||
Question: {question}
|
||||
|
||||
Answer:
|
||||
Confidence:"""
|
||||
# Use with structured output
|
||||
structured_llm = llm.with_structured_output(RAGResponse)
|
||||
```
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
```python
|
||||
def evaluate_rag_system(qa_chain, test_cases):
|
||||
metrics = {
|
||||
'accuracy': [],
|
||||
'retrieval_quality': [],
|
||||
'groundedness': []
|
||||
}
|
||||
from typing import TypedDict
|
||||
|
||||
class RAGEvalMetrics(TypedDict):
|
||||
retrieval_precision: float # Relevant docs / retrieved docs
|
||||
retrieval_recall: float # Retrieved relevant / total relevant
|
||||
answer_relevance: float # Answer addresses question
|
||||
faithfulness: float # Answer grounded in context
|
||||
context_relevance: float # Context relevant to question
|
||||
|
||||
async def evaluate_rag_system(
|
||||
rag_chain,
|
||||
test_cases: list[dict]
|
||||
) -> RAGEvalMetrics:
|
||||
"""Evaluate RAG system on test cases."""
|
||||
metrics = {k: [] for k in RAGEvalMetrics.__annotations__}
|
||||
|
||||
for test in test_cases:
|
||||
result = qa_chain({"query": test['question']})
|
||||
result = await rag_chain.ainvoke({"question": test["question"]})
|
||||
|
||||
# Check if answer matches expected
|
||||
accuracy = calculate_accuracy(result['result'], test['expected'])
|
||||
metrics['accuracy'].append(accuracy)
|
||||
# Retrieval metrics
|
||||
retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
|
||||
relevant_ids = set(test["relevant_doc_ids"])
|
||||
|
||||
# Check if relevant docs were retrieved
|
||||
retrieval_quality = evaluate_retrieved_docs(
|
||||
result['source_documents'],
|
||||
test['relevant_docs']
|
||||
precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
|
||||
recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
|
||||
|
||||
metrics["retrieval_precision"].append(precision)
|
||||
metrics["retrieval_recall"].append(recall)
|
||||
|
||||
# Use LLM-as-judge for quality metrics
|
||||
quality = await evaluate_answer_quality(
|
||||
question=test["question"],
|
||||
answer=result["answer"],
|
||||
context=result["context"],
|
||||
expected=test.get("expected_answer")
|
||||
)
|
||||
metrics['retrieval_quality'].append(retrieval_quality)
|
||||
metrics["answer_relevance"].append(quality["relevance"])
|
||||
metrics["faithfulness"].append(quality["faithfulness"])
|
||||
metrics["context_relevance"].append(quality["context_relevance"])
|
||||
|
||||
# Check if answer is grounded in context
|
||||
groundedness = check_groundedness(
|
||||
result['result'],
|
||||
result['source_documents']
|
||||
)
|
||||
metrics['groundedness'].append(groundedness)
|
||||
|
||||
return {k: sum(v)/len(v) for k, v in metrics.items()}
|
||||
return {k: sum(v) / len(v) for k, v in metrics.items()}
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- **references/vector-databases.md**: Detailed comparison of vector DBs
|
||||
- **references/embeddings.md**: Embedding model selection guide
|
||||
- **references/retrieval-strategies.md**: Advanced retrieval techniques
|
||||
- **references/reranking.md**: Reranking methods and when to use them
|
||||
- **references/context-window.md**: Managing context limits
|
||||
- **assets/vector-store-config.yaml**: Configuration templates
|
||||
- **assets/retriever-pipeline.py**: Complete RAG pipeline
|
||||
- **assets/embedding-models.md**: Model comparison and benchmarks
|
||||
- [LangChain RAG Tutorial](https://python.langchain.com/docs/tutorials/rag/)
|
||||
- [LangGraph RAG Examples](https://langchain-ai.github.io/langgraph/tutorials/rag/)
|
||||
- [Pinecone Best Practices](https://docs.pinecone.io/guides/get-started/overview)
|
||||
- [Voyage AI Embeddings](https://docs.voyageai.com/)
|
||||
- [RAG Evaluation Guide](https://docs.ragas.io/)
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Chunk Size**: Balance between context and specificity (500-1000 tokens)
|
||||
1. **Chunk Size**: Balance between context (larger) and specificity (smaller) - typically 500-1000 tokens
|
||||
2. **Overlap**: Use 10-20% overlap to preserve context at boundaries
|
||||
3. **Metadata**: Include source, page, timestamp for filtering and debugging
|
||||
4. **Hybrid Search**: Combine semantic and keyword search for best results
|
||||
5. **Reranking**: Improve top results with cross-encoder
|
||||
4. **Hybrid Search**: Combine semantic and keyword search for best recall
|
||||
5. **Reranking**: Use cross-encoder reranking for precision-critical applications
|
||||
6. **Citations**: Always return source documents for transparency
|
||||
7. **Evaluation**: Continuously test retrieval quality and answer accuracy
|
||||
8. **Monitoring**: Track retrieval metrics in production
|
||||
8. **Monitoring**: Track retrieval metrics and latency in production
|
||||
|
||||
## Common Issues
|
||||
|
||||
- **Poor Retrieval**: Check embedding quality, chunk size, query formulation
|
||||
- **Irrelevant Results**: Add metadata filtering, use hybrid search, rerank
|
||||
- **Missing Information**: Ensure documents are properly indexed
|
||||
- **Missing Information**: Ensure documents are properly indexed, check chunking
|
||||
- **Slow Queries**: Optimize vector store, use caching, reduce k
|
||||
- **Hallucinations**: Improve grounding prompt, add verification step
|
||||
- **Context Too Long**: Use compression or parent document retriever
|
||||
|
||||
Reference in New Issue
Block a user