feat(llm-application-dev): modernize to LangGraph and latest models v2.0.0

- Migrate from LangChain 0.x to LangChain 1.x/LangGraph patterns - Update model references to Claude 4.5 and GPT-5.2 - Add Voyage AI as primary embedding recommendation - Add structured outputs with Pydantic - Replace deprecated initialize_agent() with StateGraph - Fix security: use AST-based safe math instead of unsafe execution - Add plugin.json and README.md for consistency - Bump marketplace version to 1.3.3
2026-03-18 09:37:15 +00:00 · 2026-01-19 15:43:25 -05:00
parent e827cc713a
commit 8be0e8ac7a
12 changed files with 1940 additions and 708 deletions
--- a/.claude-plugin/marketplace.json
+++ b/.claude-plugin/marketplace.json
@@ -7,7 +7,7 @@
  },
  "metadata": {
    "description": "Production-ready workflow orchestration with 67 focused plugins, 99 specialized agents, and 107 skills - optimized for granular installation and minimal token usage",
-    "version": "1.3.2"
+    "version": "1.3.3"
  },
  "plugins": [
    {
@@ -341,8 +341,8 @@
    {
      "name": "llm-application-dev",
      "source": "./plugins/llm-application-dev",
-      "description": "LLM application development, prompt engineering, and AI assistant optimization",
-      "version": "1.2.2",
+      "description": "LLM application development with LangGraph, RAG systems, vector search, and AI agent architectures for Claude 4.5 and GPT-5.2",
+      "version": "2.0.0",
      "author": {
        "name": "Seth Hobson",
        "url": "https://github.com/wshobson"
@@ -355,6 +355,10 @@
        "ai",
        "prompt-engineering",
        "langchain",
+        "langgraph",
+        "rag",
+        "vector-search",
+        "voyage-ai",
        "gpt",
        "claude"
      ],
--- a/plugins/llm-application-dev/.claude-plugin/plugin.json
+++ b/plugins/llm-application-dev/.claude-plugin/plugin.json
@@ -0,0 +1,10 @@
+{
+  "name": "llm-application-dev",
+  "version": "2.0.0",
+  "description": "LLM application development with LangGraph, RAG systems, vector search, and AI agent architectures. Updated for LangChain 1.x, Claude 4.5, and GPT-5.2.",
+  "author": {
+    "name": "Seth Hobson",
+    "email": "seth@major7apps.com"
+  },
+  "license": "MIT"
+}
--- a/plugins/llm-application-dev/README.md
+++ b/plugins/llm-application-dev/README.md
@@ -0,0 +1,86 @@
+# LLM Application Development Plugin for Claude Code
+
+Build production-ready LLM applications, advanced RAG systems, and intelligent agents with modern AI patterns.
+
+## Version 2.0.0 Highlights
+
+- **LangGraph Integration**: Updated from deprecated LangChain patterns to LangGraph StateGraph workflows
+- **Modern Model Support**: Claude Opus/Sonnet/Haiku 4.5 and GPT-5.2/GPT-5.2-mini
+- **Voyage AI Embeddings**: Recommended embedding models for Claude applications
+- **Structured Outputs**: Pydantic-based structured output patterns
+
+## Features
+
+### Core Capabilities
+- **RAG Systems**: Production retrieval-augmented generation with hybrid search
+- **Vector Search**: Pinecone, Qdrant, Weaviate, Milvus, pgvector optimization
+- **Agent Architectures**: LangGraph-based agents with memory and tool use
+- **Prompt Engineering**: Advanced prompting techniques with model-specific optimization
+
+### Key Technologies
+- LangChain 1.x / LangGraph for agent workflows
+- Voyage AI, OpenAI, and open-source embedding models
+- HNSW, IVF, and Product Quantization index strategies
+- Async patterns with checkpointers for durable execution
+
+## Agents
+
+| Agent | Description |
+|-------|-------------|
+| `ai-engineer` | Production-grade LLM applications, RAG systems, and agent architectures |
+| `prompt-engineer` | Advanced prompting techniques, constitutional AI, and model optimization |
+| `vector-database-engineer` | Vector search implementation, embedding strategies, and semantic retrieval |
+
+## Skills
+
+| Skill | Description |
+|-------|-------------|
+| `langchain-architecture` | LangGraph StateGraph patterns, memory, and tool integration |
+| `rag-implementation` | RAG systems with hybrid search and reranking |
+| `llm-evaluation` | Evaluation frameworks for LLM applications |
+| `prompt-engineering-patterns` | Chain-of-thought, few-shot, and structured outputs |
+| `embedding-strategies` | Embedding model selection and optimization |
+| `similarity-search-patterns` | Vector similarity search implementation |
+| `vector-index-tuning` | HNSW, IVF, and quantization optimization |
+| `hybrid-search-implementation` | Vector + keyword search fusion |
+
+## Commands
+
+| Command | Description |
+|---------|-------------|
+| `/llm-application-dev:langchain-agent` | Create LangGraph-based agent |
+| `/llm-application-dev:ai-assistant` | Build AI assistant application |
+| `/llm-application-dev:prompt-optimize` | Optimize prompts for production |
+
+## Installation
+
+```bash
+claude --plugin-dir /path/to/llm-application-dev
+```
+
+Or copy to your project's `.claude-plugin/` directory.
+
+## Requirements
+
+- LangChain >= 1.2.0
+- LangGraph >= 0.3.0
+- Python 3.11+
+
+## Changelog
+
+### 2.0.0 (January 2026)
+- **Breaking**: Migrated from LangChain 0.x to LangChain 1.x/LangGraph
+- **Breaking**: Updated model references to Claude 4.5 and GPT-5.2
+- Added Voyage AI as primary embedding recommendation for Claude apps
+- Added LangGraph StateGraph patterns replacing deprecated `initialize_agent()`
+- Added structured outputs with Pydantic
+- Added async patterns with checkpointers
+- Fixed security issue: replaced unsafe code execution with AST-based safe math evaluation
+- Updated hybrid search with modern Pinecone client API
+
+### 1.2.2
+- Minor bug fixes and documentation updates
+
+## License
+
+MIT License - See the plugin configuration for details.
--- a/plugins/llm-application-dev/agents/ai-engineer.md
+++ b/plugins/llm-application-dev/agents/ai-engineer.md
@@ -12,9 +12,9 @@ Expert AI engineer specializing in LLM application development, RAG systems, and
 ## Capabilities

 ### LLM Integration & Model Management
- OpenAI GPT-4o/4o-mini, o1-preview, o1-mini with function calling and structured outputs
- Anthropic Claude 4.5 Sonnet/Haiku, Claude 4.1 Opus with tool use and computer use
- Open-source models: Llama 3.1/3.2, Mixtral 8x7B/8x22B, Qwen 2.5, DeepSeek-V2
+- OpenAI GPT-5.2/GPT-5.2-mini with function calling and structured outputs
+- Anthropic Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5 with tool use and computer use
+- Open-source models: Llama 3.3, Mixtral 8x22B, Qwen 2.5, DeepSeek-V3
 - Local deployment with Ollama, vLLM, TGI (Text Generation Inference)
 - Model serving with TorchServe, MLflow, BentoML for production deployment
 - Multi-model orchestration and model routing strategies
@@ -23,7 +23,7 @@ Expert AI engineer specializing in LLM application development, RAG systems, and
 ### Advanced RAG Systems
 - Production RAG architectures with multi-stage retrieval pipelines
 - Vector databases: Pinecone, Qdrant, Weaviate, Chroma, Milvus, pgvector
- Embedding models: OpenAI text-embedding-3-large/small, Cohere embed-v3, BGE-large
+- Embedding models: Voyage AI voyage-3-large (recommended for Claude), OpenAI text-embedding-3-large/small, Cohere embed-v3, BGE-large
 - Chunking strategies: semantic, recursive, sliding window, and document-structure aware
 - Hybrid search combining vector similarity and keyword matching (BM25)
 - Reranking with Cohere rerank-3, BGE reranker, or cross-encoder models
@@ -32,14 +32,14 @@ Expert AI engineer specializing in LLM application development, RAG systems, and
 - Advanced RAG patterns: GraphRAG, HyDE, RAG-Fusion, self-RAG

 ### Agent Frameworks & Orchestration
- LangChain/LangGraph for complex agent workflows and state management
+- LangGraph (LangChain 1.x) for complex agent workflows with StateGraph and durable execution
 - LlamaIndex for data-centric AI applications and advanced retrieval
 - CrewAI for multi-agent collaboration and specialized agent roles
 - AutoGen for conversational multi-agent systems
- OpenAI Assistants API with function calling and file search
- Agent memory systems: short-term, long-term, and episodic memory
+- Claude Agent SDK for building production Anthropic agents
+- Agent memory systems: checkpointers, short-term, long-term, and vector-based memory
 - Tool integration: web search, code execution, API calls, database queries
- Agent evaluation and monitoring with custom metrics
+- Agent evaluation and monitoring with LangSmith

 ### Vector Search & Embeddings
 - Embedding model selection and fine-tuning for domain-specific tasks
@@ -111,7 +111,7 @@ Expert AI engineer specializing in LLM application development, RAG systems, and
 - Balances cutting-edge techniques with proven, stable solutions

 ## Knowledge Base
- Latest LLM developments and model capabilities (GPT-4o, Claude 4.5, Llama 3.2)
+- Latest LLM developments and model capabilities (GPT-5.2, Claude 4.5, Llama 3.3)
 - Modern vector database architectures and optimization techniques
 - Production AI system design patterns and best practices
 - AI safety and security considerations for enterprise deployments
--- a/plugins/llm-application-dev/agents/prompt-engineer.md
+++ b/plugins/llm-application-dev/agents/prompt-engineer.md
@@ -44,7 +44,7 @@ Expert prompt engineer specializing in advanced prompting methodologies and LLM

 ### Model-Specific Optimization

-#### OpenAI Models (GPT-4o, o1-preview, o1-mini)
+#### OpenAI Models (GPT-5.2, GPT-5.2-mini)
 - Function calling optimization and structured outputs
 - JSON mode utilization for reliable data extraction
 - System message design for consistent behavior
@@ -53,14 +53,14 @@ Expert prompt engineer specializing in advanced prompting methodologies and LLM
 - Multi-turn conversation management
 - Image and multimodal prompt engineering

-#### Anthropic Claude (4.5 Sonnet, Haiku, Opus)
+#### Anthropic Claude (Claude Opus 4.5, Sonnet 4.5, Haiku 4.5)
 - Constitutional AI alignment with Claude's training
 - Tool use optimization for complex workflows
 - Computer use prompting for automation tasks
 - XML tag structuring for clear prompt organization
- Context window optimization for long documents
+- Context window optimization for long documents (200K tokens)
+- Prompt caching for cost optimization
 - Safety considerations specific to Claude's capabilities
- Harmlessness and helpfulness balancing

 #### Open Source Models (Llama, Mixtral, Qwen)
 - Model-specific prompt formatting and special tokens
--- a/plugins/llm-application-dev/agents/vector-database-engineer.md
+++ b/plugins/llm-application-dev/agents/vector-database-engineer.md
@@ -1,43 +1,107 @@
+---
+name: vector-database-engineer
+description: Expert in vector databases, embedding strategies, and semantic search implementation. Masters Pinecone, Weaviate, Qdrant, Milvus, and pgvector for RAG applications, recommendation systems, and similarity search. Use PROACTIVELY for vector search implementation, embedding optimization, or semantic retrieval systems.
+model: inherit
+---
+
 # Vector Database Engineer

-Expert in vector databases, embedding strategies, and semantic search implementation. Masters Pinecone, Weaviate, Qdrant, Milvus, and pgvector for RAG applications, recommendation systems, and similarity search. Use PROACTIVELY for vector search implementation, embedding optimization, or semantic retrieval systems.
+Expert in vector databases, embedding strategies, and semantic search implementation. Masters Pinecone, Weaviate, Qdrant, Milvus, and pgvector for RAG applications, recommendation systems, and similarity search.
+
+## Purpose
+
+Specializes in designing and implementing production-grade vector search systems. Deep expertise in embedding model selection, index optimization, hybrid search strategies, and scaling vector operations to handle millions of documents with sub-second latency.

 ## Capabilities

- Vector database selection and architecture
- Embedding model selection and optimization
- Index configuration (HNSW, IVF, PQ)
- Hybrid search (vector + keyword) implementation
- Chunking strategies for documents
- Metadata filtering and pre/post-filtering
- Performance tuning and scaling
+### Vector Database Selection & Architecture
+- **Pinecone**: Managed serverless, auto-scaling, metadata filtering
+- **Qdrant**: High-performance, Rust-based, complex filtering
+- **Weaviate**: GraphQL API, hybrid search, multi-tenancy
+- **Milvus**: Distributed architecture, GPU acceleration
+- **pgvector**: PostgreSQL extension, SQL integration
+- **Chroma**: Lightweight, local development, embeddings built-in

-## When to Use
+### Embedding Model Selection
+- **Voyage AI**: voyage-3-large (recommended for Claude apps), voyage-code-3, voyage-finance-2, voyage-law-2
+- **OpenAI**: text-embedding-3-large (3072 dims), text-embedding-3-small (1536 dims)
+- **Open Source**: BGE-large-en-v1.5, E5-large-v2, multilingual-e5-large
+- **Local**: Sentence Transformers, Hugging Face models
+- Domain-specific fine-tuning strategies

- Building RAG (Retrieval Augmented Generation) systems
- Implementing semantic search over documents
- Creating recommendation engines
- Building image/audio similarity search
- Optimizing vector search latency and recall
- Scaling vector operations to millions of vectors
+### Index Configuration & Optimization
+- **HNSW**: High recall, adjustable M and efConstruction parameters
+- **IVF**: Large-scale datasets, nlist/nprobe tuning
+- **Product Quantization (PQ)**: Memory optimization for billions of vectors
+- **Scalar Quantization**: INT8/FP16 for reduced memory
+- Index selection based on recall/latency/memory tradeoffs
+
+### Hybrid Search Implementation
+- Vector + BM25 keyword search fusion
+- Reciprocal Rank Fusion (RRF) scoring
+- Weighted combination strategies
+- Query routing for optimal retrieval
+- Reranking with cross-encoders
+
+### Document Processing Pipeline
+- Chunking strategies: recursive, semantic, token-based
+- Metadata extraction and enrichment
+- Embedding batching and async processing
+- Incremental indexing and updates
+- Document versioning and deduplication
+
+### Production Operations
+- Monitoring: latency percentiles, recall metrics
+- Scaling: sharding, replication, auto-scaling
+- Backup and disaster recovery
+- Index rebuilding strategies
+- Cost optimization and resource planning

 ## Workflow

-1. Analyze data characteristics and query patterns
-2. Select appropriate embedding model
-3. Design chunking and preprocessing pipeline
-4. Choose vector database and index type
-5. Configure metadata schema for filtering
-6. Implement hybrid search if needed
-7. Optimize for latency/recall tradeoffs
-8. Set up monitoring and reindexing strategies
+1. **Analyze requirements**: Data volume, query patterns, latency needs
+2. **Select embedding model**: Match model to use case (general, code, domain)
+3. **Design chunking pipeline**: Balance context preservation with retrieval precision
+4. **Choose vector database**: Based on scale, features, operational needs
+5. **Configure index**: Optimize for recall/latency tradeoffs
+6. **Implement hybrid search**: If keyword matching improves results
+7. **Add reranking**: For precision-critical applications
+8. **Set up monitoring**: Track performance and embedding drift

 ## Best Practices

- Choose embedding dimensions based on use case (384-1536)
- Implement proper chunking with overlap
- Use metadata filtering to reduce search space
+### Embedding Selection
+- Use Voyage AI for Claude-based applications (officially recommended by Anthropic)
+- Match embedding dimensions to use case (512-1024 for most, 3072 for maximum quality)
+- Consider domain-specific models for code, legal, finance
+- Test embedding quality on representative queries
+
+### Chunking
+- Chunk size 500-1000 tokens for most use cases
+- 10-20% overlap to preserve context boundaries
+- Use semantic chunking for complex documents
+- Include metadata for filtering and debugging
+
+### Index Tuning
+- Start with HNSW for most use cases (good recall/latency balance)
+- Use IVF+PQ for >10M vectors with memory constraints
+- Benchmark recall@10 vs latency for your specific queries
+- Monitor and re-tune as data grows
+
+### Production
+- Implement metadata filtering to reduce search space
+- Cache frequent queries and embeddings
+- Plan for index rebuilding (blue-green deployments)
 - Monitor embedding drift over time
- Plan for index rebuilding
- Cache frequent queries
- Test recall vs latency tradeoffs
+- Set up alerts for latency degradation
+
+## Example Tasks
+
+- "Design a vector search system for 10M documents with <100ms P95 latency"
+- "Implement hybrid search combining semantic and keyword retrieval"
+- "Optimize embedding costs by selecting the right model and dimensions"
+- "Set up Pinecone with metadata filtering for multi-tenant RAG"
+- "Build a code search system with Voyage code embeddings"
+- "Migrate from Chroma to Qdrant for production workloads"
+- "Configure HNSW parameters for optimal recall/latency tradeoff"
+- "Implement incremental indexing pipeline with async processing"
--- a/plugins/llm-application-dev/commands/prompt-optimize.md
+++ b/plugins/llm-application-dev/commands/prompt-optimize.md
@@ -113,9 +113,9 @@ Final Response: [Refined]

 ### 5. Model-Specific Optimization

-**GPT-5/GPT-4o**
+**GPT-5.2**
 ```python
-gpt4_optimized = """
+gpt5_optimized = """
 ##CONTEXT##
 {structured_context}

@@ -566,7 +566,7 @@ testing_recommendations:
  metrics: ["accuracy", "satisfaction", "cost"]

 deployment_strategy:
-  model: "GPT-5 for quality, Claude for safety"
+  model: "GPT-5.2 for quality, Claude 4.5 for safety"
  temperature: 0.7
  max_tokens: 2000
  monitoring: "Track success, latency, feedback"
--- a/plugins/llm-application-dev/skills/embedding-strategies/SKILL.md
+++ b/plugins/llm-application-dev/skills/embedding-strategies/SKILL.md
@@ -18,14 +18,18 @@ Guide to selecting and optimizing embedding models for vector search application

 ## Core Concepts

-### 1. Embedding Model Comparison
+### 1. Embedding Model Comparison (2026)

 | Model | Dimensions | Max Tokens | Best For |
 |-------|------------|------------|----------|
-| **text-embedding-3-large** | 3072 | 8191 | High accuracy |
-| **text-embedding-3-small** | 1536 | 8191 | Cost-effective |
-| **voyage-2** | 1024 | 4000 | Code, legal |
-| **bge-large-en-v1.5** | 1024 | 512 | Open source |
+| **voyage-3-large** | 1024 | 32000 | Claude apps (Anthropic recommended) |
+| **voyage-3** | 1024 | 32000 | Claude apps, cost-effective |
+| **voyage-code-3** | 1024 | 32000 | Code search |
+| **voyage-finance-2** | 1024 | 32000 | Financial documents |
+| **voyage-law-2** | 1024 | 32000 | Legal documents |
+| **text-embedding-3-large** | 3072 | 8191 | OpenAI apps, high accuracy |
+| **text-embedding-3-small** | 1536 | 8191 | OpenAI apps, cost-effective |
+| **bge-large-en-v1.5** | 1024 | 512 | Open source, local deployment |
 | **all-MiniLM-L6-v2** | 384 | 256 | Fast, lightweight |
 | **multilingual-e5-large** | 1024 | 512 | Multi-language |

@@ -39,7 +43,34 @@ Document → Chunking → Preprocessing → Embedding Model → Vector

 ## Templates

-### Template 1: OpenAI Embeddings
+### Template 1: Voyage AI Embeddings (Recommended for Claude)
+
+```python
+from langchain_voyageai import VoyageAIEmbeddings
+from typing import List
+import os
+
+# Initialize Voyage AI embeddings (recommended by Anthropic for Claude)
+embeddings = VoyageAIEmbeddings(
+    model="voyage-3-large",
+    voyage_api_key=os.environ.get("VOYAGE_API_KEY")
+)
+
+def get_embeddings(texts: List[str]) -> List[List[float]]:
+    """Get embeddings from Voyage AI."""
+    return embeddings.embed_documents(texts)
+
+def get_query_embedding(query: str) -> List[float]:
+    """Get single query embedding."""
+    return embeddings.embed_query(query)
+
+# Specialized models for domains
+code_embeddings = VoyageAIEmbeddings(model="voyage-code-3")
+finance_embeddings = VoyageAIEmbeddings(model="voyage-finance-2")
+legal_embeddings = VoyageAIEmbeddings(model="voyage-law-2")
+```
+
+### Template 2: OpenAI Embeddings

 ```python
 from openai import OpenAI
@@ -53,7 +84,7 @@ def get_embeddings(
    model: str = "text-embedding-3-small",
    dimensions: int = None
 ) -> List[List[float]]:
-    """Get embeddings from OpenAI."""
+    """Get embeddings from OpenAI with optional dimension reduction."""
    # Handle batching for large lists
    batch_size = 100
    all_embeddings = []
@@ -63,6 +94,7 @@ def get_embeddings(

        kwargs = {"input": batch, "model": model}
        if dimensions:
+            # Matryoshka dimensionality reduction
            kwargs["dimensions"] = dimensions

        response = client.embeddings.create(**kwargs)
@@ -77,7 +109,7 @@ def get_embedding(text: str, **kwargs) -> List[float]:
    return get_embeddings([text], **kwargs)[0]


-# Dimension reduction with OpenAI
+# Dimension reduction with Matryoshka embeddings
 def get_reduced_embedding(text: str, dimensions: int = 512) -> List[float]:
    """Get embedding with reduced dimensions (Matryoshka)."""
    return get_embedding(
@@ -87,7 +119,7 @@ def get_reduced_embedding(text: str, dimensions: int = 512) -> List[float]:
    )
 ```

-### Template 2: Local Embeddings with Sentence Transformers
+### Template 3: Local Embeddings with Sentence Transformers

 ```python
 from sentence_transformers import SentenceTransformer
@@ -103,6 +135,7 @@ class LocalEmbedder:
        device: str = "cuda"
    ):
        self.model = SentenceTransformer(model_name, device=device)
+        self.model_name = model_name

    def embed(
        self,
@@ -120,9 +153,9 @@ class LocalEmbedder:
        return embeddings

    def embed_query(self, query: str) -> np.ndarray:
-        """Embed a query with BGE-style prefix."""
-        # BGE models benefit from query prefix
-        if "bge" in self.model.get_sentence_embedding_dimension():
+        """Embed a query with appropriate prefix for retrieval models."""
+        # BGE and similar models benefit from query prefix
+        if "bge" in self.model_name.lower():
            query = f"Represent this sentence for searching relevant passages: {query}"
        return self.embed([query])[0]

@@ -137,13 +170,15 @@ class E5Embedder:
        self.model = SentenceTransformer(model_name)

    def embed_query(self, query: str) -> np.ndarray:
+        """E5 requires 'query:' prefix for queries."""
        return self.model.encode(f"query: {query}")

    def embed_document(self, document: str) -> np.ndarray:
+        """E5 requires 'passage:' prefix for documents."""
        return self.model.encode(f"passage: {document}")
 ```

-### Template 3: Chunking Strategies
+### Template 4: Chunking Strategies

 ```python
 from typing import List, Tuple
@@ -288,20 +323,33 @@ def recursive_character_splitter(
    return split_text(text, separators)
 ```

-### Template 4: Domain-Specific Embedding Pipeline
+### Template 5: Domain-Specific Embedding Pipeline

 ```python
+import re
+from typing import List, Optional
+from dataclasses import dataclass
+
+@dataclass
+class EmbeddedDocument:
+    id: str
+    document_id: str
+    chunk_index: int
+    text: str
+    embedding: List[float]
+    metadata: dict
+
 class DomainEmbeddingPipeline:
    """Pipeline for domain-specific embeddings."""

    def __init__(
        self,
-        embedding_model: str = "text-embedding-3-small",
+        embedding_model: str = "voyage-3-large",
        chunk_size: int = 512,
        chunk_overlap: int = 50,
        preprocessing_fn=None
    ):
-        self.embedding_model = embedding_model
+        self.embeddings = VoyageAIEmbeddings(model=embedding_model)
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.preprocess = preprocessing_fn or self._default_preprocess
@@ -310,7 +358,7 @@ class DomainEmbeddingPipeline:
        """Default preprocessing."""
        # Remove excessive whitespace
        text = re.sub(r'\s+', ' ', text)
-        # Remove special characters
+        # Remove special characters (customize for your domain)
        text = re.sub(r'[^\w\s.,!?-]', '', text)
        return text.strip()

@@ -319,8 +367,8 @@ class DomainEmbeddingPipeline:
        documents: List[dict],
        id_field: str = "id",
        content_field: str = "content",
-        metadata_fields: List[str] = None
-    ) -> List[dict]:
+        metadata_fields: Optional[List[str]] = None
+    ) -> List[EmbeddedDocument]:
        """Process documents for vector storage."""
        processed = []

@@ -339,25 +387,26 @@ class DomainEmbeddingPipeline:
            )

            # Create embeddings
-            embeddings = get_embeddings(chunks, self.embedding_model)
+            embeddings = await self.embeddings.aembed_documents(chunks)

            # Create records
            for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
-                record = {
-                    "id": f"{doc_id}_chunk_{i}",
-                    "document_id": doc_id,
-                    "chunk_index": i,
-                    "text": chunk,
-                    "embedding": embedding
-                }
+                metadata = {"document_id": doc_id, "chunk_index": i}

-                # Add metadata
+                # Add specified metadata fields
                if metadata_fields:
                    for field in metadata_fields:
                        if field in doc:
-                            record[field] = doc[field]
+                            metadata[field] = doc[field]

-                processed.append(record)
+                processed.append(EmbeddedDocument(
+                    id=f"{doc_id}_chunk_{i}",
+                    document_id=doc_id,
+                    chunk_index=i,
+                    text=chunk,
+                    embedding=embedding,
+                    metadata=metadata
+                ))

        return processed

@@ -366,42 +415,77 @@ class DomainEmbeddingPipeline:
 class CodeEmbeddingPipeline:
    """Specialized pipeline for code embeddings."""

-    def __init__(self, model: str = "voyage-code-2"):
-        self.model = model
+    def __init__(self):
+        # Use Voyage's code-specific model
+        self.embeddings = VoyageAIEmbeddings(model="voyage-code-3")

    def chunk_code(self, code: str, language: str) -> List[dict]:
-        """Chunk code by functions/classes."""
-        import tree_sitter
+        """Chunk code by functions/classes using tree-sitter."""
+        try:
+            import tree_sitter_languages
+            parser = tree_sitter_languages.get_parser(language)
+            tree = parser.parse(bytes(code, "utf8"))

-        # Parse with tree-sitter
-        # Extract functions, classes, methods
-        # Return chunks with context
-        pass
+            chunks = []
+            # Extract function and class definitions
+            self._extract_nodes(tree.root_node, code, chunks)
+            return chunks
+        except ImportError:
+            # Fallback to simple chunking
+            return [{"text": code, "type": "module"}]

-    def embed_with_context(self, chunk: str, context: str) -> List[float]:
+    def _extract_nodes(self, node, source_code: str, chunks: list):
+        """Recursively extract function/class definitions."""
+        if node.type in ['function_definition', 'class_definition', 'method_definition']:
+            text = source_code[node.start_byte:node.end_byte]
+            chunks.append({
+                "text": text,
+                "type": node.type,
+                "name": self._get_name(node),
+                "start_line": node.start_point[0],
+                "end_line": node.end_point[0]
+            })
+        for child in node.children:
+            self._extract_nodes(child, source_code, chunks)
+
+    def _get_name(self, node) -> str:
+        """Extract name from function/class node."""
+        for child in node.children:
+            if child.type == 'identifier' or child.type == 'name':
+                return child.text.decode('utf8')
+        return "unknown"
+
+    async def embed_with_context(
+        self,
+        chunk: str,
+        context: str = ""
+    ) -> List[float]:
        """Embed code with surrounding context."""
+        if context:
            combined = f"Context: {context}\n\nCode:\n{chunk}"
-        return get_embedding(combined, model=self.model)
+        else:
+            combined = chunk
+        return await self.embeddings.aembed_query(combined)
 ```

-### Template 5: Embedding Quality Evaluation
+### Template 6: Embedding Quality Evaluation

 ```python
 import numpy as np
-from typing import List, Tuple
+from typing import List, Dict

 def evaluate_retrieval_quality(
    queries: List[str],
    relevant_docs: List[List[str]],  # List of relevant doc IDs per query
    retrieved_docs: List[List[str]],  # List of retrieved doc IDs per query
    k: int = 10
-) -> dict:
+) -> Dict[str, float]:
    """Evaluate embedding quality for retrieval."""

    def precision_at_k(relevant: set, retrieved: List[str], k: int) -> float:
        retrieved_k = retrieved[:k]
        relevant_retrieved = len(set(retrieved_k) & relevant)
-        return relevant_retrieved / k
+        return relevant_retrieved / k if k > 0 else 0

    def recall_at_k(relevant: set, retrieved: List[str], k: int) -> float:
        retrieved_k = retrieved[:k]
@@ -446,7 +530,7 @@ def compute_embedding_similarity(
 ) -> np.ndarray:
    """Compute similarity matrix between embedding sets."""
    if metric == "cosine":
-        # Normalize
+        # Normalize and compute dot product
        norm1 = embeddings1 / np.linalg.norm(embeddings1, axis=1, keepdims=True)
        norm2 = embeddings2 / np.linalg.norm(embeddings2, axis=1, keepdims=True)
        return norm1 @ norm2.T
@@ -455,25 +539,68 @@ def compute_embedding_similarity(
        return -cdist(embeddings1, embeddings2, metric='euclidean')
    elif metric == "dot":
        return embeddings1 @ embeddings2.T
+    else:
+        raise ValueError(f"Unknown metric: {metric}")
+
+
+def compare_embedding_models(
+    texts: List[str],
+    models: Dict[str, callable],
+    queries: List[str],
+    relevant_indices: List[List[int]],
+    k: int = 5
+) -> Dict[str, Dict[str, float]]:
+    """Compare multiple embedding models on retrieval quality."""
+    results = {}
+
+    for model_name, embed_fn in models.items():
+        # Embed all texts
+        doc_embeddings = np.array(embed_fn(texts))
+
+        retrieved_per_query = []
+        for query in queries:
+            query_embedding = np.array(embed_fn([query])[0])
+            # Compute similarities
+            similarities = compute_embedding_similarity(
+                query_embedding.reshape(1, -1),
+                doc_embeddings,
+                metric="cosine"
+            )[0]
+            # Get top-k indices
+            top_k_indices = np.argsort(similarities)[::-1][:k]
+            retrieved_per_query.append([str(i) for i in top_k_indices])
+
+        # Convert relevant indices to string IDs
+        relevant_docs = [[str(i) for i in indices] for indices in relevant_indices]
+
+        results[model_name] = evaluate_retrieval_quality(
+            queries, relevant_docs, retrieved_per_query, k
+        )
+
+    return results
 ```

 ## Best Practices

 ### Do's
- **Match model to use case** - Code vs prose vs multilingual
- **Chunk thoughtfully** - Preserve semantic boundaries
- **Normalize embeddings** - For cosine similarity
- **Batch requests** - More efficient than one-by-one
- **Cache embeddings** - Avoid recomputing
+- **Match model to use case**: Code vs prose vs multilingual
+- **Chunk thoughtfully**: Preserve semantic boundaries
+- **Normalize embeddings**: For cosine similarity search
+- **Batch requests**: More efficient than one-by-one
+- **Cache embeddings**: Avoid recomputing for static content
+- **Use Voyage AI for Claude apps**: Recommended by Anthropic

 ### Don'ts
- **Don't ignore token limits** - Truncation loses info
- **Don't mix embedding models** - Incompatible spaces
- **Don't skip preprocessing** - Garbage in, garbage out
- **Don't over-chunk** - Lose context
+- **Don't ignore token limits**: Truncation loses information
+- **Don't mix embedding models**: Incompatible vector spaces
+- **Don't skip preprocessing**: Garbage in, garbage out
+- **Don't over-chunk**: Lose important context
+- **Don't forget metadata**: Essential for filtering and debugging

 ## Resources

- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)
+- [Voyage AI Documentation](https://docs.voyageai.com/)
+- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
 - [Sentence Transformers](https://www.sbert.net/)
 - [MTEB Benchmark](https://huggingface.co/spaces/mteb/leaderboard)
+- [LangChain Embedding Models](https://python.langchain.com/docs/integrations/text_embedding/)
--- a/plugins/llm-application-dev/skills/langchain-architecture/SKILL.md
+++ b/plugins/llm-application-dev/skills/langchain-architecture/SKILL.md
@@ -1,11 +1,11 @@
 ---
 name: langchain-architecture
-description: Design LLM applications using the LangChain framework with agents, memory, and tool integration patterns. Use when building LangChain applications, implementing AI agents, or creating complex LLM workflows.
+description: Design LLM applications using LangChain 1.x and LangGraph for agents, memory, and tool integration. Use when building LangChain applications, implementing AI agents, or creating complex LLM workflows.
 ---

-# LangChain Architecture
+# LangChain & LangGraph Architecture

-Master the LangChain framework for building sophisticated LLM applications with agents, chains, memory, and tool integration.
+Master modern LangChain 1.x and LangGraph for building sophisticated LLM applications with agents, state management, memory, and tool integration.

 ## When to Use This Skill

@@ -17,126 +17,100 @@ Master the LangChain framework for building sophisticated LLM applications with
 - Implementing document processing pipelines
 - Building production-grade LLM applications

+## Package Structure (LangChain 1.x)
+
+```
+langchain (1.2.x)         # High-level orchestration
+langchain-core (1.2.x)    # Core abstractions (messages, prompts, tools)
+langchain-community       # Third-party integrations
+langgraph                 # Agent orchestration and state management
+langchain-openai          # OpenAI integrations
+langchain-anthropic       # Anthropic/Claude integrations
+langchain-voyageai        # Voyage AI embeddings
+langchain-pinecone        # Pinecone vector store
+```
+
 ## Core Concepts

-### 1. Agents
-Autonomous systems that use LLMs to decide which actions to take.
+### 1. LangGraph Agents
+LangGraph is the standard for building agents in 2026. It provides:

-**Agent Types:**
- **ReAct**: Reasoning + Acting in interleaved manner
- **OpenAI Functions**: Leverages function calling API
- **Structured Chat**: Handles multi-input tools
- **Conversational**: Optimized for chat interfaces
- **Self-Ask with Search**: Decomposes complex queries
+**Key Features:**
+- **StateGraph**: Explicit state management with typed state
+- **Durable Execution**: Agents persist through failures
+- **Human-in-the-Loop**: Inspect and modify state at any point
+- **Memory**: Short-term and long-term memory across sessions
+- **Checkpointing**: Save and resume agent state

-### 2. Chains
-Sequences of calls to LLMs or other utilities.
+**Agent Patterns:**
+- **ReAct**: Reasoning + Acting with `create_react_agent`
+- **Plan-and-Execute**: Separate planning and execution nodes
+- **Multi-Agent**: Supervisor routing between specialized agents
+- **Tool-Calling**: Structured tool invocation with Pydantic schemas

-**Chain Types:**
- **LLMChain**: Basic prompt + LLM combination
- **SequentialChain**: Multiple chains in sequence
- **RouterChain**: Routes inputs to specialized chains
- **TransformChain**: Data transformations between steps
- **MapReduceChain**: Parallel processing with aggregation
+### 2. State Management
+LangGraph uses TypedDict for explicit state:

-### 3. Memory
-Systems for maintaining context across interactions.
+```python
+from typing import Annotated, TypedDict
+from langgraph.graph import MessagesState

-**Memory Types:**
- **ConversationBufferMemory**: Stores all messages
- **ConversationSummaryMemory**: Summarizes older messages
- **ConversationBufferWindowMemory**: Keeps last N messages
- **EntityMemory**: Tracks information about entities
- **VectorStoreMemory**: Semantic similarity retrieval
+# Simple message-based state
+class AgentState(MessagesState):
+    """Extends MessagesState with custom fields."""
+    context: Annotated[list, "retrieved documents"]
+
+# Custom state for complex agents
+class CustomState(TypedDict):
+    messages: Annotated[list, "conversation history"]
+    context: Annotated[dict, "retrieved context"]
+    current_step: str
+    results: list
+```
+
+### 3. Memory Systems
+Modern memory implementations:
+
+- **ConversationBufferMemory**: Stores all messages (short conversations)
+- **ConversationSummaryMemory**: Summarizes older messages (long conversations)
+- **ConversationTokenBufferMemory**: Token-based windowing
+- **VectorStoreRetrieverMemory**: Semantic similarity retrieval
+- **LangGraph Checkpointers**: Persistent state across sessions

 ### 4. Document Processing
-Loading, transforming, and storing documents for retrieval.
+Loading, transforming, and storing documents:

 **Components:**
 - **Document Loaders**: Load from various sources
 - **Text Splitters**: Chunk documents intelligently
 - **Vector Stores**: Store and retrieve embeddings
 - **Retrievers**: Fetch relevant documents
- **Indexes**: Organize documents for efficient access

-### 5. Callbacks
-Hooks for logging, monitoring, and debugging.
+### 5. Callbacks & Tracing
+LangSmith is the standard for observability:

-**Use Cases:**
 - Request/response logging
 - Token usage tracking
 - Latency monitoring
- Error handling
- Custom metrics collection
+- Error tracking
+- Trace visualization

 ## Quick Start

+### Modern ReAct Agent with LangGraph
+
 ```python
-from langchain.agents import AgentType, initialize_agent, load_tools
-from langchain.llms import OpenAI
-from langchain.memory import ConversationBufferMemory
+from langgraph.prebuilt import create_react_agent
+from langgraph.checkpoint.memory import MemorySaver
+from langchain_anthropic import ChatAnthropic
+from langchain_core.tools import tool
+import ast
+import operator

-# Initialize LLM
-llm = OpenAI(temperature=0)
-
-# Load tools
-tools = load_tools(["serpapi", "llm-math"], llm=llm)
-
-# Add memory
-memory = ConversationBufferMemory(memory_key="chat_history")
-
-# Create agent
-agent = initialize_agent(
-    tools,
-    llm,
-    agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
-    memory=memory,
-    verbose=True
-)
-
-# Run agent
-result = agent.run("What's the weather in SF? Then calculate 25 * 4")
-```
-
-## Architecture Patterns
-
-### Pattern 1: RAG with LangChain
-```python
-from langchain.chains import RetrievalQA
-from langchain.document_loaders import TextLoader
-from langchain.text_splitter import CharacterTextSplitter
-from langchain.vectorstores import Chroma
-from langchain.embeddings import OpenAIEmbeddings
-
-# Load and process documents
-loader = TextLoader('documents.txt')
-documents = loader.load()
-
-text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
-texts = text_splitter.split_documents(documents)
-
-# Create vector store
-embeddings = OpenAIEmbeddings()
-vectorstore = Chroma.from_documents(texts, embeddings)
-
-# Create retrieval chain
-qa_chain = RetrievalQA.from_chain_type(
-    llm=llm,
-    chain_type="stuff",
-    retriever=vectorstore.as_retriever(),
-    return_source_documents=True
-)
-
-# Query
-result = qa_chain({"query": "What is the main topic?"})
-```
-
-### Pattern 2: Custom Agent with Tools
-```python
-from langchain.agents import Tool, AgentExecutor
-from langchain.agents.react.base import ReActDocstoreAgent
-from langchain.tools import tool
+# Initialize LLM (Claude Sonnet 4.5 recommended)
+llm = ChatAnthropic(model="claude-sonnet-4-5", temperature=0)

+# Define tools with Pydantic schemas
@tool
 def search_database(query: str) -> str:
    """Search internal database for information."""
@@ -144,195 +118,541 @@ def search_database(query: str) -> str:
    return f"Results for: {query}"

@tool
-def send_email(recipient: str, content: str) -> str:
+def calculate(expression: str) -> str:
+    """Safely evaluate a mathematical expression.
+
+    Supports: +, -, *, /, **, %, parentheses
+    Example: '(2 + 3) * 4' returns '20'
+    """
+    # Safe math evaluation using ast
+    allowed_operators = {
+        ast.Add: operator.add,
+        ast.Sub: operator.sub,
+        ast.Mult: operator.mul,
+        ast.Div: operator.truediv,
+        ast.Pow: operator.pow,
+        ast.Mod: operator.mod,
+        ast.USub: operator.neg,
+    }
+
+    def _eval(node):
+        if isinstance(node, ast.Constant):
+            return node.value
+        elif isinstance(node, ast.BinOp):
+            left = _eval(node.left)
+            right = _eval(node.right)
+            return allowed_operators[type(node.op)](left, right)
+        elif isinstance(node, ast.UnaryOp):
+            operand = _eval(node.operand)
+            return allowed_operators[type(node.op)](operand)
+        else:
+            raise ValueError(f"Unsupported operation: {type(node)}")
+
+    try:
+        tree = ast.parse(expression, mode='eval')
+        return str(_eval(tree.body))
+    except Exception as e:
+        return f"Error: {e}"
+
+tools = [search_database, calculate]
+
+# Create checkpointer for memory persistence
+checkpointer = MemorySaver()
+
+# Create ReAct agent
+agent = create_react_agent(
+    llm,
+    tools,
+    checkpointer=checkpointer
+)
+
+# Run agent with thread ID for memory
+config = {"configurable": {"thread_id": "user-123"}}
+result = await agent.ainvoke(
+    {"messages": [("user", "Search for Python tutorials and calculate 25 * 4")]},
+    config=config
+)
+```
+
+## Architecture Patterns
+
+### Pattern 1: RAG with LangGraph
+
+```python
+from langgraph.graph import StateGraph, START, END
+from langchain_anthropic import ChatAnthropic
+from langchain_voyageai import VoyageAIEmbeddings
+from langchain_pinecone import PineconeVectorStore
+from langchain_core.documents import Document
+from langchain_core.prompts import ChatPromptTemplate
+from typing import TypedDict, Annotated
+
+class RAGState(TypedDict):
+    question: str
+    context: Annotated[list[Document], "retrieved documents"]
+    answer: str
+
+# Initialize components
+llm = ChatAnthropic(model="claude-sonnet-4-5")
+embeddings = VoyageAIEmbeddings(model="voyage-3-large")
+vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
+retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
+
+# Define nodes
+async def retrieve(state: RAGState) -> RAGState:
+    """Retrieve relevant documents."""
+    docs = await retriever.ainvoke(state["question"])
+    return {"context": docs}
+
+async def generate(state: RAGState) -> RAGState:
+    """Generate answer from context."""
+    prompt = ChatPromptTemplate.from_template(
+        """Answer based on the context below. If you cannot answer, say so.
+
+        Context: {context}
+
+        Question: {question}
+
+        Answer:"""
+    )
+    context_text = "\n\n".join(doc.page_content for doc in state["context"])
+    response = await llm.ainvoke(
+        prompt.format(context=context_text, question=state["question"])
+    )
+    return {"answer": response.content}
+
+# Build graph
+builder = StateGraph(RAGState)
+builder.add_node("retrieve", retrieve)
+builder.add_node("generate", generate)
+builder.add_edge(START, "retrieve")
+builder.add_edge("retrieve", "generate")
+builder.add_edge("generate", END)
+
+rag_chain = builder.compile()
+
+# Use the chain
+result = await rag_chain.ainvoke({"question": "What is the main topic?"})
+```
+
+### Pattern 2: Custom Agent with Structured Tools
+
+```python
+from langchain_core.tools import StructuredTool
+from pydantic import BaseModel, Field
+
+class SearchInput(BaseModel):
+    """Input for database search."""
+    query: str = Field(description="Search query")
+    filters: dict = Field(default={}, description="Optional filters")
+
+class EmailInput(BaseModel):
+    """Input for sending email."""
+    recipient: str = Field(description="Email recipient")
+    subject: str = Field(description="Email subject")
+    content: str = Field(description="Email body")
+
+async def search_database(query: str, filters: dict = {}) -> str:
+    """Search internal database for information."""
+    # Your database search logic
+    return f"Results for '{query}' with filters {filters}"
+
+async def send_email(recipient: str, subject: str, content: str) -> str:
    """Send an email to specified recipient."""
    # Email sending logic
    return f"Email sent to {recipient}"

-tools = [search_database, send_email]
-
-agent = initialize_agent(
-    tools,
-    llm,
-    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
-    verbose=True
+tools = [
+    StructuredTool.from_function(
+        coroutine=search_database,
+        name="search_database",
+        description="Search internal database",
+        args_schema=SearchInput
+    ),
+    StructuredTool.from_function(
+        coroutine=send_email,
+        name="send_email",
+        description="Send an email",
+        args_schema=EmailInput
    )
+]
+
+agent = create_react_agent(llm, tools)
 ```

-### Pattern 3: Multi-Step Chain
+### Pattern 3: Multi-Step Workflow with StateGraph
+
 ```python
-from langchain.chains import LLMChain, SequentialChain
-from langchain.prompts import PromptTemplate
+from langgraph.graph import StateGraph, START, END
+from typing import TypedDict, Literal

-# Step 1: Extract key information
-extract_prompt = PromptTemplate(
-    input_variables=["text"],
-    template="Extract key entities from: {text}\n\nEntities:"
-)
-extract_chain = LLMChain(llm=llm, prompt=extract_prompt, output_key="entities")
+class WorkflowState(TypedDict):
+    text: str
+    entities: list
+    analysis: str
+    summary: str
+    current_step: str

-# Step 2: Analyze entities
-analyze_prompt = PromptTemplate(
-    input_variables=["entities"],
-    template="Analyze these entities: {entities}\n\nAnalysis:"
-)
-analyze_chain = LLMChain(llm=llm, prompt=analyze_prompt, output_key="analysis")
+async def extract_entities(state: WorkflowState) -> WorkflowState:
+    """Extract key entities from text."""
+    prompt = f"Extract key entities from: {state['text']}\n\nReturn as JSON list."
+    response = await llm.ainvoke(prompt)
+    return {"entities": response.content, "current_step": "analyze"}

-# Step 3: Generate summary
-summary_prompt = PromptTemplate(
-    input_variables=["entities", "analysis"],
-    template="Summarize:\nEntities: {entities}\nAnalysis: {analysis}\n\nSummary:"
-)
-summary_chain = LLMChain(llm=llm, prompt=summary_prompt, output_key="summary")
+async def analyze_entities(state: WorkflowState) -> WorkflowState:
+    """Analyze extracted entities."""
+    prompt = f"Analyze these entities: {state['entities']}\n\nProvide insights."
+    response = await llm.ainvoke(prompt)
+    return {"analysis": response.content, "current_step": "summarize"}

-# Combine into sequential chain
-overall_chain = SequentialChain(
-    chains=[extract_chain, analyze_chain, summary_chain],
-    input_variables=["text"],
-    output_variables=["entities", "analysis", "summary"],
-    verbose=True
-)
+async def generate_summary(state: WorkflowState) -> WorkflowState:
+    """Generate final summary."""
+    prompt = f"""Summarize:
+    Entities: {state['entities']}
+    Analysis: {state['analysis']}
+
+    Provide a concise summary."""
+    response = await llm.ainvoke(prompt)
+    return {"summary": response.content, "current_step": "complete"}
+
+def route_step(state: WorkflowState) -> Literal["analyze", "summarize", "end"]:
+    """Route to next step based on current state."""
+    step = state.get("current_step", "extract")
+    if step == "analyze":
+        return "analyze"
+    elif step == "summarize":
+        return "summarize"
+    return "end"
+
+# Build workflow
+builder = StateGraph(WorkflowState)
+builder.add_node("extract", extract_entities)
+builder.add_node("analyze", analyze_entities)
+builder.add_node("summarize", generate_summary)
+
+builder.add_edge(START, "extract")
+builder.add_conditional_edges("extract", route_step, {
+    "analyze": "analyze",
+    "summarize": "summarize",
+    "end": END
+})
+builder.add_conditional_edges("analyze", route_step, {
+    "summarize": "summarize",
+    "end": END
+})
+builder.add_edge("summarize", END)
+
+workflow = builder.compile()
 ```

-## Memory Management Best Practices
+### Pattern 4: Multi-Agent Orchestration

-### Choosing the Right Memory Type
 ```python
-# For short conversations (< 10 messages)
-from langchain.memory import ConversationBufferMemory
-memory = ConversationBufferMemory()
+from langgraph.graph import StateGraph, START, END
+from langgraph.prebuilt import create_react_agent
+from langchain_core.messages import HumanMessage
+from typing import Literal

-# For long conversations (summarize old messages)
-from langchain.memory import ConversationSummaryMemory
-memory = ConversationSummaryMemory(llm=llm)
+class MultiAgentState(TypedDict):
+    messages: list
+    next_agent: str

-# For sliding window (last N messages)
-from langchain.memory import ConversationBufferWindowMemory
-memory = ConversationBufferWindowMemory(k=5)
+# Create specialized agents
+researcher = create_react_agent(llm, research_tools)
+writer = create_react_agent(llm, writing_tools)
+reviewer = create_react_agent(llm, review_tools)

-# For entity tracking
-from langchain.memory import ConversationEntityMemory
-memory = ConversationEntityMemory(llm=llm)
+async def supervisor(state: MultiAgentState) -> MultiAgentState:
+    """Route to appropriate agent based on task."""
+    prompt = f"""Based on the conversation, which agent should handle this?

-# For semantic retrieval of relevant history
-from langchain.memory import VectorStoreRetrieverMemory
-memory = VectorStoreRetrieverMemory(retriever=retriever)
+    Options:
+    - researcher: For finding information
+    - writer: For creating content
+    - reviewer: For reviewing and editing
+    - FINISH: Task is complete
+
+    Messages: {state['messages']}
+
+    Respond with just the agent name."""
+
+    response = await llm.ainvoke(prompt)
+    return {"next_agent": response.content.strip().lower()}
+
+def route_to_agent(state: MultiAgentState) -> Literal["researcher", "writer", "reviewer", "end"]:
+    """Route based on supervisor decision."""
+    next_agent = state.get("next_agent", "").lower()
+    if next_agent == "finish":
+        return "end"
+    return next_agent if next_agent in ["researcher", "writer", "reviewer"] else "end"
+
+# Build multi-agent graph
+builder = StateGraph(MultiAgentState)
+builder.add_node("supervisor", supervisor)
+builder.add_node("researcher", researcher)
+builder.add_node("writer", writer)
+builder.add_node("reviewer", reviewer)
+
+builder.add_edge(START, "supervisor")
+builder.add_conditional_edges("supervisor", route_to_agent, {
+    "researcher": "researcher",
+    "writer": "writer",
+    "reviewer": "reviewer",
+    "end": END
+})
+
+# Each agent returns to supervisor
+for agent in ["researcher", "writer", "reviewer"]:
+    builder.add_edge(agent, "supervisor")
+
+multi_agent = builder.compile()
 ```

-## Callback System
+## Memory Management
+
+### Token-Based Memory with LangGraph
+
+```python
+from langgraph.checkpoint.memory import MemorySaver
+from langgraph.prebuilt import create_react_agent
+
+# In-memory checkpointer (development)
+checkpointer = MemorySaver()
+
+# Create agent with persistent memory
+agent = create_react_agent(llm, tools, checkpointer=checkpointer)
+
+# Each thread_id maintains separate conversation
+config = {"configurable": {"thread_id": "session-abc123"}}
+
+# Messages persist across invocations with same thread_id
+result1 = await agent.ainvoke({"messages": [("user", "My name is Alice")]}, config)
+result2 = await agent.ainvoke({"messages": [("user", "What's my name?")]}, config)
+# Agent remembers: "Your name is Alice"
+```
+
+### Production Memory with PostgreSQL
+
+```python
+from langgraph.checkpoint.postgres import PostgresSaver
+
+# Production checkpointer
+checkpointer = PostgresSaver.from_conn_string(
+    "postgresql://user:pass@localhost/langgraph"
+)
+
+agent = create_react_agent(llm, tools, checkpointer=checkpointer)
+```
+
+### Vector Store Memory for Long-Term Context
+
+```python
+from langchain_community.vectorstores import Chroma
+from langchain_voyageai import VoyageAIEmbeddings
+
+embeddings = VoyageAIEmbeddings(model="voyage-3-large")
+memory_store = Chroma(
+    collection_name="conversation_memory",
+    embedding_function=embeddings,
+    persist_directory="./memory_db"
+)
+
+async def retrieve_relevant_memory(query: str, k: int = 5) -> list:
+    """Retrieve relevant past conversations."""
+    docs = await memory_store.asimilarity_search(query, k=k)
+    return [doc.page_content for doc in docs]
+
+async def store_memory(content: str, metadata: dict = {}):
+    """Store conversation in long-term memory."""
+    await memory_store.aadd_texts([content], metadatas=[metadata])
+```
+
+## Callback System & LangSmith
+
+### LangSmith Tracing
+
+```python
+import os
+from langchain_anthropic import ChatAnthropic
+
+# Enable LangSmith tracing
+os.environ["LANGCHAIN_TRACING_V2"] = "true"
+os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
+os.environ["LANGCHAIN_PROJECT"] = "my-project"
+
+# All LangChain/LangGraph operations are automatically traced
+llm = ChatAnthropic(model="claude-sonnet-4-5")
+```

 ### Custom Callback Handler
+
 ```python
-from langchain.callbacks.base import BaseCallbackHandler
+from langchain_core.callbacks import BaseCallbackHandler
+from typing import Any, Dict, List

 class CustomCallbackHandler(BaseCallbackHandler):
-    def on_llm_start(self, serialized, prompts, **kwargs):
-        print(f"LLM started with prompts: {prompts}")
+    def on_llm_start(
+        self, serialized: Dict[str, Any], prompts: List[str], **kwargs
+    ) -> None:
+        print(f"LLM started with {len(prompts)} prompts")

-    def on_llm_end(self, response, **kwargs):
-        print(f"LLM ended with response: {response}")
+    def on_llm_end(self, response, **kwargs) -> None:
+        print(f"LLM completed: {len(response.generations)} generations")

-    def on_llm_error(self, error, **kwargs):
+    def on_llm_error(self, error: Exception, **kwargs) -> None:
        print(f"LLM error: {error}")

-    def on_chain_start(self, serialized, inputs, **kwargs):
-        print(f"Chain started with inputs: {inputs}")
+    def on_tool_start(
+        self, serialized: Dict[str, Any], input_str: str, **kwargs
+    ) -> None:
+        print(f"Tool started: {serialized.get('name')}")

-    def on_agent_action(self, action, **kwargs):
-        print(f"Agent taking action: {action}")
+    def on_tool_end(self, output: str, **kwargs) -> None:
+        print(f"Tool completed: {output[:100]}...")

-# Use callback
-agent.run("query", callbacks=[CustomCallbackHandler()])
+# Use callbacks
+result = await agent.ainvoke(
+    {"messages": [("user", "query")]},
+    config={"callbacks": [CustomCallbackHandler()]}
+)
+```
+
+## Streaming Responses
+
+```python
+from langchain_anthropic import ChatAnthropic
+
+llm = ChatAnthropic(model="claude-sonnet-4-5", streaming=True)
+
+# Stream tokens
+async for chunk in llm.astream("Tell me a story"):
+    print(chunk.content, end="", flush=True)
+
+# Stream agent events
+async for event in agent.astream_events(
+    {"messages": [("user", "Search and summarize")]},
+    version="v2"
+):
+    if event["event"] == "on_chat_model_stream":
+        print(event["data"]["chunk"].content, end="")
+    elif event["event"] == "on_tool_start":
+        print(f"\n[Using tool: {event['name']}]")
 ```

 ## Testing Strategies

 ```python
 import pytest
-from unittest.mock import Mock
+from unittest.mock import AsyncMock, patch

-def test_agent_tool_selection():
-    # Mock LLM to return specific tool selection
-    mock_llm = Mock()
-    mock_llm.predict.return_value = "Action: search_database\nAction Input: test query"
+@pytest.mark.asyncio
+async def test_agent_tool_selection():
+    """Test agent selects correct tool."""
+    with patch.object(llm, 'ainvoke') as mock_llm:
+        mock_llm.return_value = AsyncMock(content="Using search_database")

-    agent = initialize_agent(tools, mock_llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)
+        result = await agent.ainvoke({
+            "messages": [("user", "search for documents")]
+        })

-    result = agent.run("test query")
+        # Verify tool was called
+        assert "search_database" in str(result)

-    # Verify correct tool was selected
-    assert "search_database" in str(mock_llm.predict.call_args)
+@pytest.mark.asyncio
+async def test_memory_persistence():
+    """Test memory persists across invocations."""
+    config = {"configurable": {"thread_id": "test-thread"}}

-def test_memory_persistence():
-    memory = ConversationBufferMemory()
+    # First message
+    await agent.ainvoke(
+        {"messages": [("user", "Remember: the code is 12345")]},
+        config
+    )

-    memory.save_context({"input": "Hi"}, {"output": "Hello!"})
+    # Second message should remember
+    result = await agent.ainvoke(
+        {"messages": [("user", "What was the code?")]},
+        config
+    )

-    assert "Hi" in memory.load_memory_variables({})['history']
-    assert "Hello!" in memory.load_memory_variables({})['history']
+    assert "12345" in result["messages"][-1].content
 ```

 ## Performance Optimization

-### 1. Caching
-```python
-from langchain.cache import InMemoryCache
-import langchain
+### 1. Caching with Redis

-langchain.llm_cache = InMemoryCache()
+```python
+from langchain_community.cache import RedisCache
+from langchain_core.globals import set_llm_cache
+import redis
+
+redis_client = redis.Redis.from_url("redis://localhost:6379")
+set_llm_cache(RedisCache(redis_client))
 ```

-### 2. Batch Processing
+### 2. Async Batch Processing
+
 ```python
-# Process multiple documents in parallel
-from langchain.document_loaders import DirectoryLoader
-from concurrent.futures import ThreadPoolExecutor
+import asyncio
+from langchain_core.documents import Document

-loader = DirectoryLoader('./docs')
-docs = loader.load()
+async def process_documents(documents: list[Document]) -> list:
+    """Process documents in parallel."""
+    tasks = [process_single(doc) for doc in documents]
+    return await asyncio.gather(*tasks)

-def process_doc(doc):
-    return text_splitter.split_documents([doc])
-
-with ThreadPoolExecutor(max_workers=4) as executor:
-    split_docs = list(executor.map(process_doc, docs))
+async def process_single(doc: Document) -> dict:
+    """Process a single document."""
+    chunks = text_splitter.split_documents([doc])
+    embeddings = await embeddings_model.aembed_documents(
+        [c.page_content for c in chunks]
+    )
+    return {"doc_id": doc.metadata.get("id"), "embeddings": embeddings}
 ```

-### 3. Streaming Responses
-```python
-from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
+### 3. Connection Pooling

-llm = OpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()])
+```python
+from langchain_pinecone import PineconeVectorStore
+from pinecone import Pinecone
+
+# Reuse Pinecone client
+pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
+index = pc.Index("my-index")
+
+# Create vector store with existing index
+vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
 ```

 ## Resources

- **references/agents.md**: Deep dive on agent architectures
- **references/memory.md**: Memory system patterns
- **references/chains.md**: Chain composition strategies
- **references/document-processing.md**: Document loading and indexing
- **references/callbacks.md**: Monitoring and observability
- **assets/agent-template.py**: Production-ready agent template
- **assets/memory-config.yaml**: Memory configuration examples
- **assets/chain-example.py**: Complex chain examples
+- [LangChain Documentation](https://python.langchain.com/docs/)
+- [LangGraph Documentation](https://langchain-ai.github.io/langgraph/)
+- [LangSmith Platform](https://smith.langchain.com/)
+- [LangChain GitHub](https://github.com/langchain-ai/langchain)
+- [LangGraph GitHub](https://github.com/langchain-ai/langgraph)

 ## Common Pitfalls

-1. **Memory Overflow**: Not managing conversation history length
-2. **Tool Selection Errors**: Poor tool descriptions confuse agents
-3. **Context Window Exceeded**: Exceeding LLM token limits
-4. **No Error Handling**: Not catching and handling agent failures
-5. **Inefficient Retrieval**: Not optimizing vector store queries
+1. **Using Deprecated APIs**: Use LangGraph for agents, not `initialize_agent`
+2. **Memory Overflow**: Use checkpointers with TTL for long-running agents
+3. **Poor Tool Descriptions**: Clear descriptions help LLM select correct tools
+4. **Context Window Exceeded**: Use summarization or sliding window memory
+5. **No Error Handling**: Wrap tool functions with try/except
+6. **Blocking Operations**: Use async methods (`ainvoke`, `astream`)
+7. **Missing Observability**: Always enable LangSmith tracing in production

 ## Production Checklist

- [ ] Implement proper error handling
- [ ] Add request/response logging
- [ ] Monitor token usage and costs
- [ ] Set timeout limits for agent execution
+- [ ] Use LangGraph StateGraph for agent orchestration
+- [ ] Implement async patterns throughout (`ainvoke`, `astream`)
+- [ ] Add production checkpointer (PostgreSQL, Redis)
+- [ ] Enable LangSmith tracing
+- [ ] Implement structured tools with Pydantic schemas
+- [ ] Add timeout limits for agent execution
 - [ ] Implement rate limiting
- [ ] Add input validation
- [ ] Test with edge cases
- [ ] Set up observability (callbacks)
- [ ] Implement fallback strategies
+- [ ] Add comprehensive error handling
+- [ ] Set up health checks
 - [ ] Version control prompts and configurations
+- [ ] Write integration tests for agent workflows
--- a/plugins/llm-application-dev/skills/llm-evaluation/SKILL.md
+++ b/plugins/llm-application-dev/skills/llm-evaluation/SKILL.md
@@ -64,34 +64,71 @@ Use stronger LLMs to evaluate weaker model outputs.
 ## Quick Start

 ```python
-from llm_eval import EvaluationSuite, Metric
+from dataclasses import dataclass
+from typing import Callable
+import numpy as np

-# Define evaluation suite
+@dataclass
+class Metric:
+    name: str
+    fn: Callable
+
+    @staticmethod
+    def accuracy():
+        return Metric("accuracy", calculate_accuracy)
+
+    @staticmethod
+    def bleu():
+        return Metric("bleu", calculate_bleu)
+
+    @staticmethod
+    def bertscore():
+        return Metric("bertscore", calculate_bertscore)
+
+    @staticmethod
+    def custom(name: str, fn: Callable):
+        return Metric(name, fn)
+
+class EvaluationSuite:
+    def __init__(self, metrics: list[Metric]):
+        self.metrics = metrics
+
+    async def evaluate(self, model, test_cases: list[dict]) -> dict:
+        results = {m.name: [] for m in self.metrics}
+
+        for test in test_cases:
+            prediction = await model.predict(test["input"])
+
+            for metric in self.metrics:
+                score = metric.fn(
+                    prediction=prediction,
+                    reference=test.get("expected"),
+                    context=test.get("context")
+                )
+                results[metric.name].append(score)
+
+        return {
+            "metrics": {k: np.mean(v) for k, v in results.items()},
+            "raw_scores": results
+        }
+
+# Usage
 suite = EvaluationSuite([
    Metric.accuracy(),
    Metric.bleu(),
    Metric.bertscore(),
-    Metric.custom(name="groundedness", fn=check_groundedness)
+    Metric.custom("groundedness", check_groundedness)
 ])

-# Prepare test cases
 test_cases = [
    {
        "input": "What is the capital of France?",
        "expected": "Paris",
        "context": "France is a country in Europe. Paris is its capital."
    },
-    # ... more test cases
 ]

-# Run evaluation
-results = suite.evaluate(
-    model=your_model,
-    test_cases=test_cases
-)
-
-print(f"Overall Accuracy: {results.metrics['accuracy']}")
-print(f"BLEU Score: {results.metrics['bleu']}")
+results = await suite.evaluate(model=your_model, test_cases=test_cases)
 ```

 ## Automated Metrics Implementation
@@ -100,7 +137,7 @@ print(f"BLEU Score: {results.metrics['bleu']}")
 ```python
 from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

-def calculate_bleu(reference, hypothesis):
+def calculate_bleu(reference: str, hypothesis: str, **kwargs) -> float:
    """Calculate BLEU score between reference and hypothesis."""
    smoothie = SmoothingFunction().method4

@@ -109,21 +146,18 @@ def calculate_bleu(reference, hypothesis):
        hypothesis.split(),
        smoothing_function=smoothie
    )
-
-# Usage
-bleu = calculate_bleu(
-    reference="The cat sat on the mat",
-    hypothesis="A cat is sitting on the mat"
-)
 ```

 ### ROUGE Score
 ```python
 from rouge_score import rouge_scorer

-def calculate_rouge(reference, hypothesis):
+def calculate_rouge(reference: str, hypothesis: str, **kwargs) -> dict:
    """Calculate ROUGE scores."""
-    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
+    scorer = rouge_scorer.RougeScorer(
+        ['rouge1', 'rouge2', 'rougeL'],
+        use_stemmer=True
+    )
    scores = scorer.score(reference, hypothesis)

    return {
@@ -137,8 +171,12 @@ def calculate_rouge(reference, hypothesis):
 ```python
 from bert_score import score

-def calculate_bertscore(references, hypotheses):
-    """Calculate BERTScore using pre-trained BERT."""
+def calculate_bertscore(
+    references: list[str],
+    hypotheses: list[str],
+    **kwargs
+) -> dict:
+    """Calculate BERTScore using pre-trained model."""
    P, R, F1 = score(
        hypotheses,
        references,
@@ -155,44 +193,72 @@ def calculate_bertscore(references, hypotheses):

 ### Custom Metrics
 ```python
-def calculate_groundedness(response, context):
+def calculate_groundedness(response: str, context: str, **kwargs) -> float:
    """Check if response is grounded in provided context."""
-    # Use NLI model to check entailment
    from transformers import pipeline

-    nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
+    nli = pipeline(
+        "text-classification",
+        model="microsoft/deberta-large-mnli"
+    )

    result = nli(f"{context} [SEP] {response}")[0]

    # Return confidence that response is entailed by context
    return result['score'] if result['label'] == 'ENTAILMENT' else 0.0

-def calculate_toxicity(text):
+def calculate_toxicity(text: str, **kwargs) -> float:
    """Measure toxicity in generated text."""
    from detoxify import Detoxify

    results = Detoxify('original').predict(text)
    return max(results.values())  # Return highest toxicity score

-def calculate_factuality(claim, knowledge_base):
-    """Verify factual claims against knowledge base."""
-    # Implementation depends on your knowledge base
-    # Could use retrieval + NLI, or fact-checking API
-    pass
+def calculate_factuality(claim: str, sources: list[str], **kwargs) -> float:
+    """Verify factual claims against sources."""
+    from transformers import pipeline
+
+    nli = pipeline("text-classification", model="facebook/bart-large-mnli")
+
+    scores = []
+    for source in sources:
+        result = nli(f"{source}</s></s>{claim}")[0]
+        if result['label'] == 'entailment':
+            scores.append(result['score'])
+
+    return max(scores) if scores else 0.0
 ```

 ## LLM-as-Judge Patterns

 ### Single Output Evaluation
 ```python
-def llm_judge_quality(response, question):
-    """Use GPT-5 to judge response quality."""
-    prompt = f"""Rate the following response on a scale of 1-10 for:
-1. Accuracy (factually correct)
-2. Helpfulness (answers the question)
-3. Clarity (well-written and understandable)
+from anthropic import Anthropic
+from pydantic import BaseModel, Field
+import json
+
+class QualityRating(BaseModel):
+    accuracy: int = Field(ge=1, le=10, description="Factual correctness")
+    helpfulness: int = Field(ge=1, le=10, description="Answers the question")
+    clarity: int = Field(ge=1, le=10, description="Well-written and understandable")
+    reasoning: str = Field(description="Brief explanation")
+
+async def llm_judge_quality(
+    response: str,
+    question: str,
+    context: str = None
+) -> QualityRating:
+    """Use Claude to judge response quality."""
+    client = Anthropic()
+
+    system = """You are an expert evaluator of AI responses.
+    Rate responses on accuracy, helpfulness, and clarity (1-10 scale).
+    Provide brief reasoning for your ratings."""
+
+    prompt = f"""Rate the following response:

 Question: {question}
+{f'Context: {context}' if context else ''}
 Response: {response}

 Provide ratings in JSON format:
@@ -201,23 +267,37 @@ Provide ratings in JSON format:
  "helpfulness": <1-10>,
  "clarity": <1-10>,
  "reasoning": "<brief explanation>"
-}}
-"""
+}}"""

-    result = openai.ChatCompletion.create(
-        model="gpt-5",
-        messages=[{"role": "user", "content": prompt}],
-        temperature=0
+    message = client.messages.create(
+        model="claude-sonnet-4-5",
+        max_tokens=500,
+        system=system,
+        messages=[{"role": "user", "content": prompt}]
    )

-    return json.loads(result.choices[0].message.content)
+    return QualityRating(**json.loads(message.content[0].text))
 ```

 ### Pairwise Comparison
 ```python
-def compare_responses(question, response_a, response_b):
+from pydantic import BaseModel, Field
+from typing import Literal
+
+class ComparisonResult(BaseModel):
+    winner: Literal["A", "B", "tie"]
+    reasoning: str
+    confidence: int = Field(ge=1, le=10)
+
+async def compare_responses(
+    question: str,
+    response_a: str,
+    response_b: str
+) -> ComparisonResult:
    """Compare two responses using LLM judge."""
-    prompt = f"""Compare these two responses to the question and determine which is better.
+    client = Anthropic()
+
+    prompt = f"""Compare these two responses and determine which is better.

 Question: {question}

@@ -225,38 +305,84 @@ Response A: {response_a}

 Response B: {response_b}

-Which response is better and why? Consider accuracy, helpfulness, and clarity.
+Consider accuracy, helpfulness, and clarity.

 Answer with JSON:
 {{
  "winner": "A" or "B" or "tie",
  "reasoning": "<explanation>",
  "confidence": <1-10>
-}}
-"""
+}}"""

-    result = openai.ChatCompletion.create(
-        model="gpt-5",
-        messages=[{"role": "user", "content": prompt}],
-        temperature=0
+    message = client.messages.create(
+        model="claude-sonnet-4-5",
+        max_tokens=500,
+        messages=[{"role": "user", "content": prompt}]
    )

-    return json.loads(result.choices[0].message.content)
+    return ComparisonResult(**json.loads(message.content[0].text))
+```
+
+### Reference-Based Evaluation
+```python
+class ReferenceEvaluation(BaseModel):
+    semantic_similarity: float = Field(ge=0, le=1)
+    factual_accuracy: float = Field(ge=0, le=1)
+    completeness: float = Field(ge=0, le=1)
+    issues: list[str]
+
+async def evaluate_against_reference(
+    response: str,
+    reference: str,
+    question: str
+) -> ReferenceEvaluation:
+    """Evaluate response against gold standard reference."""
+    client = Anthropic()
+
+    prompt = f"""Compare the response to the reference answer.
+
+Question: {question}
+Reference Answer: {reference}
+Response to Evaluate: {response}
+
+Evaluate:
+1. Semantic similarity (0-1): How similar is the meaning?
+2. Factual accuracy (0-1): Are all facts correct?
+3. Completeness (0-1): Does it cover all key points?
+4. List any specific issues or errors.
+
+Respond in JSON:
+{{
+  "semantic_similarity": <0-1>,
+  "factual_accuracy": <0-1>,
+  "completeness": <0-1>,
+  "issues": ["issue1", "issue2"]
+}}"""
+
+    message = client.messages.create(
+        model="claude-sonnet-4-5",
+        max_tokens=500,
+        messages=[{"role": "user", "content": prompt}]
+    )
+
+    return ReferenceEvaluation(**json.loads(message.content[0].text))
 ```

 ## Human Evaluation Frameworks

 ### Annotation Guidelines
 ```python
+from dataclasses import dataclass, field
+from typing import Optional
+
+@dataclass
 class AnnotationTask:
    """Structure for human annotation task."""
+    response: str
+    question: str
+    context: Optional[str] = None

-    def __init__(self, response, question, context=None):
-        self.response = response
-        self.question = question
-        self.context = context
-
-    def get_annotation_form(self):
+    def get_annotation_form(self) -> dict:
        return {
            "question": self.question,
            "context": self.context,
@@ -289,22 +415,29 @@ class AnnotationTask:
 ```python
 from sklearn.metrics import cohen_kappa_score

-def calculate_agreement(rater1_scores, rater2_scores):
+def calculate_agreement(
+    rater1_scores: list[int],
+    rater2_scores: list[int]
+) -> dict:
    """Calculate inter-rater agreement."""
    kappa = cohen_kappa_score(rater1_scores, rater2_scores)

-    interpretation = {
-        kappa < 0: "Poor",
-        kappa < 0.2: "Slight",
-        kappa < 0.4: "Fair",
-        kappa < 0.6: "Moderate",
-        kappa < 0.8: "Substantial",
-        kappa <= 1.0: "Almost Perfect"
-    }
+    if kappa < 0:
+        interpretation = "Poor"
+    elif kappa < 0.2:
+        interpretation = "Slight"
+    elif kappa < 0.4:
+        interpretation = "Fair"
+    elif kappa < 0.6:
+        interpretation = "Moderate"
+    elif kappa < 0.8:
+        interpretation = "Substantial"
+    else:
+        interpretation = "Almost Perfect"

    return {
        "kappa": kappa,
-        "interpretation": interpretation[True]
+        "interpretation": interpretation
    }
 ```

@@ -314,23 +447,26 @@ def calculate_agreement(rater1_scores, rater2_scores):
 ```python
 from scipy import stats
 import numpy as np
+from dataclasses import dataclass, field

+@dataclass
 class ABTest:
-    def __init__(self, variant_a_name="A", variant_b_name="B"):
-        self.variant_a = {"name": variant_a_name, "scores": []}
-        self.variant_b = {"name": variant_b_name, "scores": []}
+    variant_a_name: str = "A"
+    variant_b_name: str = "B"
+    variant_a_scores: list[float] = field(default_factory=list)
+    variant_b_scores: list[float] = field(default_factory=list)

-    def add_result(self, variant, score):
+    def add_result(self, variant: str, score: float):
        """Add evaluation result for a variant."""
        if variant == "A":
-            self.variant_a["scores"].append(score)
+            self.variant_a_scores.append(score)
        else:
-            self.variant_b["scores"].append(score)
+            self.variant_b_scores.append(score)

-    def analyze(self, alpha=0.05):
+    def analyze(self, alpha: float = 0.05) -> dict:
        """Perform statistical analysis."""
-        a_scores = self.variant_a["scores"]
-        b_scores = self.variant_b["scores"]
+        a_scores = np.array(self.variant_a_scores)
+        b_scores = np.array(self.variant_b_scores)

        # T-test
        t_stat, p_value = stats.ttest_ind(a_scores, b_scores)
@@ -347,12 +483,12 @@ class ABTest:
            "p_value": p_value,
            "statistically_significant": p_value < alpha,
            "cohens_d": cohens_d,
-            "effect_size": self.interpret_cohens_d(cohens_d),
-            "winner": "B" if np.mean(b_scores) > np.mean(a_scores) else "A"
+            "effect_size": self._interpret_cohens_d(cohens_d),
+            "winner": self.variant_b_name if np.mean(b_scores) > np.mean(a_scores) else self.variant_a_name
        }

    @staticmethod
-    def interpret_cohens_d(d):
+    def _interpret_cohens_d(d: float) -> str:
        """Interpret Cohen's d effect size."""
        abs_d = abs(d)
        if abs_d < 0.2:
@@ -369,12 +505,22 @@ class ABTest:

 ### Regression Detection
 ```python
+from dataclasses import dataclass
+
+@dataclass
+class RegressionResult:
+    metric: str
+    baseline: float
+    current: float
+    change: float
+    is_regression: bool
+
 class RegressionDetector:
-    def __init__(self, baseline_results, threshold=0.05):
+    def __init__(self, baseline_results: dict, threshold: float = 0.05):
        self.baseline = baseline_results
        self.threshold = threshold

-    def check_for_regression(self, new_results):
+    def check_for_regression(self, new_results: dict) -> dict:
        """Detect if new results show regression."""
        regressions = []

@@ -389,39 +535,97 @@ class RegressionDetector:
            relative_change = (new_score - baseline_score) / baseline_score

            # Flag if significant decrease
-            if relative_change < -self.threshold:
-                regressions.append({
-                    "metric": metric,
-                    "baseline": baseline_score,
-                    "current": new_score,
-                    "change": relative_change
-                })
+            is_regression = relative_change < -self.threshold
+            if is_regression:
+                regressions.append(RegressionResult(
+                    metric=metric,
+                    baseline=baseline_score,
+                    current=new_score,
+                    change=relative_change,
+                    is_regression=True
+                ))

        return {
            "has_regression": len(regressions) > 0,
-            "regressions": regressions
+            "regressions": regressions,
+            "summary": f"{len(regressions)} metric(s) regressed"
        }
 ```

+## LangSmith Evaluation Integration
+
+```python
+from langsmith import Client
+from langsmith.evaluation import evaluate, LangChainStringEvaluator
+
+# Initialize LangSmith client
+client = Client()
+
+# Create dataset
+dataset = client.create_dataset("qa_test_cases")
+client.create_examples(
+    inputs=[{"question": q} for q in questions],
+    outputs=[{"answer": a} for a in expected_answers],
+    dataset_id=dataset.id
+)
+
+# Define evaluators
+evaluators = [
+    LangChainStringEvaluator("qa"),           # QA correctness
+    LangChainStringEvaluator("context_qa"),   # Context-grounded QA
+    LangChainStringEvaluator("cot_qa"),       # Chain-of-thought QA
+]
+
+# Run evaluation
+async def target_function(inputs: dict) -> dict:
+    result = await your_chain.ainvoke(inputs)
+    return {"answer": result}
+
+experiment_results = await evaluate(
+    target_function,
+    data=dataset.name,
+    evaluators=evaluators,
+    experiment_prefix="v1.0.0",
+    metadata={"model": "claude-sonnet-4-5", "version": "1.0.0"}
+)
+
+print(f"Mean score: {experiment_results.aggregate_metrics['qa']['mean']}")
+```
+
 ## Benchmarking

 ### Running Benchmarks
 ```python
+from dataclasses import dataclass
+import numpy as np
+
+@dataclass
+class BenchmarkResult:
+    metric: str
+    mean: float
+    std: float
+    min: float
+    max: float
+
 class BenchmarkRunner:
-    def __init__(self, benchmark_dataset):
+    def __init__(self, benchmark_dataset: list[dict]):
        self.dataset = benchmark_dataset

-    def run_benchmark(self, model, metrics):
+    async def run_benchmark(
+        self,
+        model,
+        metrics: list[Metric]
+    ) -> dict[str, BenchmarkResult]:
        """Run model on benchmark and calculate metrics."""
        results = {metric.name: [] for metric in metrics}

        for example in self.dataset:
            # Generate prediction
-            prediction = model.predict(example["input"])
+            prediction = await model.predict(example["input"])

            # Calculate each metric
            for metric in metrics:
-                score = metric.calculate(
+                score = metric.fn(
                    prediction=prediction,
                    reference=example["reference"],
                    context=example.get("context")
@@ -430,26 +634,24 @@ class BenchmarkRunner:

        # Aggregate results
        return {
-            metric: {
-                "mean": np.mean(scores),
-                "std": np.std(scores),
-                "min": min(scores),
-                "max": max(scores)
-            }
+            metric: BenchmarkResult(
+                metric=metric,
+                mean=np.mean(scores),
+                std=np.std(scores),
+                min=min(scores),
+                max=max(scores)
+            )
            for metric, scores in results.items()
        }
 ```

 ## Resources

- **references/metrics.md**: Comprehensive metric guide
- **references/human-evaluation.md**: Annotation best practices
- **references/benchmarking.md**: Standard benchmarks
- **references/a-b-testing.md**: Statistical testing guide
- **references/regression-testing.md**: CI/CD integration
- **assets/evaluation-framework.py**: Complete evaluation harness
- **assets/benchmark-dataset.jsonl**: Example datasets
- **scripts/evaluate-model.py**: Automated evaluation runner
+- [LangSmith Evaluation Guide](https://docs.smith.langchain.com/evaluation)
+- [RAGAS Framework](https://docs.ragas.io/)
+- [DeepEval Library](https://docs.deepeval.com/)
+- [Arize Phoenix](https://docs.arize.com/phoenix/)
+- [HELM Benchmark](https://crfm.stanford.edu/helm/)

 ## Best Practices

@@ -469,3 +671,5 @@ class BenchmarkRunner:
 - **Data Contamination**: Testing on training data
 - **Ignoring Variance**: Not accounting for statistical uncertainty
 - **Metric Mismatch**: Using metrics not aligned with business goals
+- **Position Bias**: In pairwise evals, randomize order
+- **Overfitting Prompts**: Optimizing for test set instead of real use
--- a/plugins/llm-application-dev/skills/prompt-engineering-patterns/SKILL.md
+++ b/plugins/llm-application-dev/skills/prompt-engineering-patterns/SKILL.md
@@ -16,6 +16,7 @@ Master advanced prompt engineering techniques to maximize LLM performance, relia
 - Creating reusable prompt templates with variable interpolation
 - Debugging and refining prompts that produce inconsistent outputs
 - Implementing system prompts for specialized AI assistants
+- Using structured outputs (JSON mode) for reliable parsing

 ## Core Capabilities

@@ -33,21 +34,27 @@ Master advanced prompt engineering techniques to maximize LLM performance, relia
 - Self-consistency techniques (sampling multiple reasoning paths)
 - Verification and validation steps

-### 3. Prompt Optimization
+### 3. Structured Outputs
+- JSON mode for reliable parsing
+- Pydantic schema enforcement
+- Type-safe response handling
+- Error handling for malformed outputs
+
+### 4. Prompt Optimization
 - Iterative refinement workflows
 - A/B testing prompt variations
 - Measuring prompt performance metrics (accuracy, consistency, latency)
 - Reducing token usage while maintaining quality
 - Handling edge cases and failure modes

-### 4. Template Systems
+### 5. Template Systems
 - Variable interpolation and formatting
 - Conditional prompt sections
 - Multi-turn conversation templates
 - Role-based prompt composition
 - Modular prompt components

-### 5. System Prompt Design
+### 6. System Prompt Design
 - Setting model behavior and constraints
 - Defining output formats and structure
 - Establishing role and expertise
@@ -57,68 +64,385 @@ Master advanced prompt engineering techniques to maximize LLM performance, relia
 ## Quick Start

 ```python
-from prompt_optimizer import PromptTemplate, FewShotSelector
+from langchain_anthropic import ChatAnthropic
+from langchain_core.prompts import ChatPromptTemplate
+from pydantic import BaseModel, Field

-# Define a structured prompt template
-template = PromptTemplate(
-    system="You are an expert SQL developer. Generate efficient, secure SQL queries.",
-    instruction="Convert the following natural language query to SQL:\n{query}",
-    few_shot_examples=True,
-    output_format="SQL code block with explanatory comments"
-)
+# Define structured output schema
+class SQLQuery(BaseModel):
+    query: str = Field(description="The SQL query")
+    explanation: str = Field(description="Brief explanation of what the query does")
+    tables_used: list[str] = Field(description="List of tables referenced")

-# Configure few-shot learning
-selector = FewShotSelector(
-    examples_db="sql_examples.jsonl",
-    selection_strategy="semantic_similarity",
-    max_examples=3
-)
+# Initialize model with structured output
+llm = ChatAnthropic(model="claude-sonnet-4-5")
+structured_llm = llm.with_structured_output(SQLQuery)

-# Generate optimized prompt
-prompt = template.render(
-    query="Find all users who registered in the last 30 days",
-    examples=selector.select(query="user registration date filter")
-)
+# Create prompt template
+prompt = ChatPromptTemplate.from_messages([
+    ("system", """You are an expert SQL developer. Generate efficient, secure SQL queries.
+    Always use parameterized queries to prevent SQL injection.
+    Explain your reasoning briefly."""),
+    ("user", "Convert this to SQL: {query}")
+])
+
+# Create chain
+chain = prompt | structured_llm
+
+# Use
+result = await chain.ainvoke({
+    "query": "Find all users who registered in the last 30 days"
+})
+print(result.query)
+print(result.explanation)
 ```

 ## Key Patterns

-### Progressive Disclosure
+### Pattern 1: Structured Output with Pydantic
+
+```python
+from anthropic import Anthropic
+from pydantic import BaseModel, Field
+from typing import Literal
+import json
+
+class SentimentAnalysis(BaseModel):
+    sentiment: Literal["positive", "negative", "neutral"]
+    confidence: float = Field(ge=0, le=1)
+    key_phrases: list[str]
+    reasoning: str
+
+async def analyze_sentiment(text: str) -> SentimentAnalysis:
+    """Analyze sentiment with structured output."""
+    client = Anthropic()
+
+    message = client.messages.create(
+        model="claude-sonnet-4-5",
+        max_tokens=500,
+        messages=[{
+            "role": "user",
+            "content": f"""Analyze the sentiment of this text.
+
+Text: {text}
+
+Respond with JSON matching this schema:
+{{
+    "sentiment": "positive" | "negative" | "neutral",
+    "confidence": 0.0-1.0,
+    "key_phrases": ["phrase1", "phrase2"],
+    "reasoning": "brief explanation"
+}}"""
+        }]
+    )
+
+    return SentimentAnalysis(**json.loads(message.content[0].text))
+```
+
+### Pattern 2: Chain-of-Thought with Self-Verification
+
+```python
+from langchain_core.prompts import ChatPromptTemplate
+
+cot_prompt = ChatPromptTemplate.from_template("""
+Solve this problem step by step.
+
+Problem: {problem}
+
+Instructions:
+1. Break down the problem into clear steps
+2. Work through each step showing your reasoning
+3. State your final answer
+4. Verify your answer by checking it against the original problem
+
+Format your response as:
+## Steps
+[Your step-by-step reasoning]
+
+## Answer
+[Your final answer]
+
+## Verification
+[Check that your answer is correct]
+""")
+```
+
+### Pattern 3: Few-Shot with Dynamic Example Selection
+
+```python
+from langchain_voyageai import VoyageAIEmbeddings
+from langchain_core.example_selectors import SemanticSimilarityExampleSelector
+from langchain_chroma import Chroma
+
+# Create example selector with semantic similarity
+example_selector = SemanticSimilarityExampleSelector.from_examples(
+    examples=[
+        {"input": "How do I reset my password?", "output": "Go to Settings > Security > Reset Password"},
+        {"input": "Where can I see my order history?", "output": "Navigate to Account > Orders"},
+        {"input": "How do I contact support?", "output": "Click Help > Contact Us or email support@example.com"},
+    ],
+    embeddings=VoyageAIEmbeddings(model="voyage-3-large"),
+    vectorstore_cls=Chroma,
+    k=2  # Select 2 most similar examples
+)
+
+async def get_few_shot_prompt(query: str) -> str:
+    """Build prompt with dynamically selected examples."""
+    examples = await example_selector.aselect_examples({"input": query})
+
+    examples_text = "\n".join(
+        f"User: {ex['input']}\nAssistant: {ex['output']}"
+        for ex in examples
+    )
+
+    return f"""You are a helpful customer support assistant.
+
+Here are some example interactions:
+{examples_text}
+
+Now respond to this query:
+User: {query}
+Assistant:"""
+```
+
+### Pattern 4: Progressive Disclosure
+
 Start with simple prompts, add complexity only when needed:

-1. **Level 1**: Direct instruction
-   - "Summarize this article"
+```python
+PROMPT_LEVELS = {
+    # Level 1: Direct instruction
+    "simple": "Summarize this article: {text}",

-2. **Level 2**: Add constraints
-   - "Summarize this article in 3 bullet points, focusing on key findings"
+    # Level 2: Add constraints
+    "constrained": """Summarize this article in 3 bullet points, focusing on:
+- Key findings
+- Main conclusions
+- Practical implications

-3. **Level 3**: Add reasoning
-   - "Read this article, identify the main findings, then summarize in 3 bullet points"
+Article: {text}""",

-4. **Level 4**: Add examples
-   - Include 2-3 example summaries with input-output pairs
+    # Level 3: Add reasoning
+    "reasoning": """Read this article carefully.
+1. First, identify the main topic and thesis
+2. Then, extract the key supporting points
+3. Finally, summarize in 3 bullet points

-### Instruction Hierarchy
-```
-[System Context] → [Task Instruction] → [Examples] → [Input Data] → [Output Format]
+Article: {text}
+
+Summary:""",
+
+    # Level 4: Add examples
+    "few_shot": """Read articles and provide concise summaries.
+
+Example:
+Article: "New research shows that regular exercise can reduce anxiety by up to 40%..."
+Summary:
+• Regular exercise reduces anxiety by up to 40%
+• 30 minutes of moderate activity 3x/week is sufficient
+• Benefits appear within 2 weeks of starting
+
+Now summarize this article:
+Article: {text}
+
+Summary:"""
+}
 ```

-### Error Recovery
-Build prompts that gracefully handle failures:
- Include fallback instructions
- Request confidence scores
- Ask for alternative interpretations when uncertain
- Specify how to indicate missing information
+### Pattern 5: Error Recovery and Fallback
+
+```python
+from pydantic import BaseModel, ValidationError
+import json
+
+class ResponseWithConfidence(BaseModel):
+    answer: str
+    confidence: float
+    sources: list[str]
+    alternative_interpretations: list[str] = []
+
+ERROR_RECOVERY_PROMPT = """
+Answer the question based on the context provided.
+
+Context: {context}
+Question: {question}
+
+Instructions:
+1. If you can answer confidently (>0.8), provide a direct answer
+2. If you're somewhat confident (0.5-0.8), provide your best answer with caveats
+3. If you're uncertain (<0.5), explain what information is missing
+4. Always provide alternative interpretations if the question is ambiguous
+
+Respond in JSON:
+{{
+    "answer": "your answer or 'I cannot determine this from the context'",
+    "confidence": 0.0-1.0,
+    "sources": ["relevant context excerpts"],
+    "alternative_interpretations": ["if question is ambiguous"]
+}}
+"""
+
+async def answer_with_fallback(
+    context: str,
+    question: str,
+    llm
+) -> ResponseWithConfidence:
+    """Answer with error recovery and fallback."""
+    prompt = ERROR_RECOVERY_PROMPT.format(context=context, question=question)
+
+    try:
+        response = await llm.ainvoke(prompt)
+        return ResponseWithConfidence(**json.loads(response.content))
+    except (json.JSONDecodeError, ValidationError) as e:
+        # Fallback: try to extract answer without structure
+        simple_prompt = f"Based on: {context}\n\nAnswer: {question}"
+        simple_response = await llm.ainvoke(simple_prompt)
+        return ResponseWithConfidence(
+            answer=simple_response.content,
+            confidence=0.5,
+            sources=["fallback extraction"],
+            alternative_interpretations=[]
+        )
+```
+
+### Pattern 6: Role-Based System Prompts
+
+```python
+SYSTEM_PROMPTS = {
+    "analyst": """You are a senior data analyst with expertise in SQL, Python, and business intelligence.
+
+Your responsibilities:
+- Write efficient, well-documented queries
+- Explain your analysis methodology
+- Highlight key insights and recommendations
+- Flag any data quality concerns
+
+Communication style:
+- Be precise and technical when discussing methodology
+- Translate technical findings into business impact
+- Use clear visualizations when helpful""",
+
+    "assistant": """You are a helpful AI assistant focused on accuracy and clarity.
+
+Core principles:
+- Always cite sources when making factual claims
+- Acknowledge uncertainty rather than guessing
+- Ask clarifying questions when the request is ambiguous
+- Provide step-by-step explanations for complex topics
+
+Constraints:
+- Do not provide medical, legal, or financial advice
+- Redirect harmful requests appropriately
+- Protect user privacy""",
+
+    "code_reviewer": """You are a senior software engineer conducting code reviews.
+
+Review criteria:
+- Correctness: Does the code work as intended?
+- Security: Are there any vulnerabilities?
+- Performance: Are there efficiency concerns?
+- Maintainability: Is the code readable and well-structured?
+- Best practices: Does it follow language idioms?
+
+Output format:
+1. Summary assessment (approve/request changes)
+2. Critical issues (must fix)
+3. Suggestions (nice to have)
+4. Positive feedback (what's done well)"""
+}
+```
+
+## Integration Patterns
+
+### With RAG Systems
+
+```python
+RAG_PROMPT = """You are a knowledgeable assistant that answers questions based on provided context.
+
+Context (retrieved from knowledge base):
+{context}
+
+Instructions:
+1. Answer ONLY based on the provided context
+2. If the context doesn't contain the answer, say "I don't have information about that in my knowledge base"
+3. Cite specific passages using [1], [2] notation
+4. If the question is ambiguous, ask for clarification
+
+Question: {question}
+
+Answer:"""
+```
+
+### With Validation and Verification
+
+```python
+VALIDATED_PROMPT = """Complete the following task:
+
+Task: {task}
+
+After generating your response, verify it meets ALL these criteria:
+✓ Directly addresses the original request
+✓ Contains no factual errors
+✓ Is appropriately detailed (not too brief, not too verbose)
+✓ Uses proper formatting
+✓ Is safe and appropriate
+
+If verification fails on any criterion, revise before responding.
+
+Response:"""
+```
+
+## Performance Optimization
+
+### Token Efficiency
+```python
+# Before: Verbose prompt (150+ tokens)
+verbose_prompt = """
+I would like you to please take the following text and provide me with a comprehensive
+summary of the main points. The summary should capture the key ideas and important details
+while being concise and easy to understand.
+"""
+
+# After: Concise prompt (30 tokens)
+concise_prompt = """Summarize the key points concisely:
+
+{text}
+
+Summary:"""
+```
+
+### Caching Common Prefixes
+
+```python
+from anthropic import Anthropic
+
+client = Anthropic()
+
+# Use prompt caching for repeated system prompts
+response = client.messages.create(
+    model="claude-sonnet-4-5",
+    max_tokens=1000,
+    system=[
+        {
+            "type": "text",
+            "text": LONG_SYSTEM_PROMPT,
+            "cache_control": {"type": "ephemeral"}
+        }
+    ],
+    messages=[{"role": "user", "content": user_query}]
+)
+```

 ## Best Practices

 1. **Be Specific**: Vague prompts produce inconsistent results
 2. **Show, Don't Tell**: Examples are more effective than descriptions
-3. **Test Extensively**: Evaluate on diverse, representative inputs
-4. **Iterate Rapidly**: Small changes can have large impacts
-5. **Monitor Performance**: Track metrics in production
-6. **Version Control**: Treat prompts as code with proper versioning
-7. **Document Intent**: Explain why prompts are structured as they are
+3. **Use Structured Outputs**: Enforce schemas with Pydantic for reliability
+4. **Test Extensively**: Evaluate on diverse, representative inputs
+5. **Iterate Rapidly**: Small changes can have large impacts
+6. **Monitor Performance**: Track metrics in production
+7. **Version Control**: Treat prompts as code with proper versioning
+8. **Document Intent**: Explain why prompts are structured as they are

 ## Common Pitfalls

@@ -127,60 +451,8 @@ Build prompts that gracefully handle failures:
 - **Context overflow**: Exceeding token limits with excessive examples
 - **Ambiguous instructions**: Leaving room for multiple interpretations
 - **Ignoring edge cases**: Not testing on unusual or boundary inputs
-
-## Integration Patterns
-
-### With RAG Systems
-```python
-# Combine retrieved context with prompt engineering
-prompt = f"""Given the following context:
-{retrieved_context}
-
-{few_shot_examples}
-
-Question: {user_question}
-
-Provide a detailed answer based solely on the context above. If the context doesn't contain enough information, explicitly state what's missing."""
-```
-
-### With Validation
-```python
-# Add self-verification step
-prompt = f"""{main_task_prompt}
-
-After generating your response, verify it meets these criteria:
-1. Answers the question directly
-2. Uses only information from provided context
-3. Cites specific sources
-4. Acknowledges any uncertainty
-
-If verification fails, revise your response."""
-```
-
-## Performance Optimization
-
-### Token Efficiency
- Remove redundant words and phrases
- Use abbreviations consistently after first definition
- Consolidate similar instructions
- Move stable content to system prompts
-
-### Latency Reduction
- Minimize prompt length without sacrificing quality
- Use streaming for long-form outputs
- Cache common prompt prefixes
- Batch similar requests when possible
-
-## Resources
-
- **references/few-shot-learning.md**: Deep dive on example selection and construction
- **references/chain-of-thought.md**: Advanced reasoning elicitation techniques
- **references/prompt-optimization.md**: Systematic refinement workflows
- **references/prompt-templates.md**: Reusable template patterns
- **references/system-prompts.md**: System-level prompt design
- **assets/prompt-template-library.md**: Battle-tested prompt templates
- **assets/few-shot-examples.json**: Curated example datasets
- **scripts/optimize-prompt.py**: Automated prompt optimization tool
+- **No error handling**: Assuming outputs will always be well-formed
+- **Hardcoded values**: Not parameterizing prompts for reuse

 ## Success Metrics

@@ -189,13 +461,12 @@ Track these KPIs for your prompts:
 - **Consistency**: Reproducibility across similar inputs
 - **Latency**: Response time (P50, P95, P99)
 - **Token Usage**: Average tokens per request
- **Success Rate**: Percentage of valid outputs
+- **Success Rate**: Percentage of valid, parseable outputs
 - **User Satisfaction**: Ratings and feedback

-## Next Steps
+## Resources

-1. Review the prompt template library for common patterns
-2. Experiment with few-shot learning for your specific use case
-3. Implement prompt versioning and A/B testing
-4. Set up automated evaluation pipelines
-5. Document your prompt engineering decisions and learnings
+- [Anthropic Prompt Engineering Guide](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering)
+- [Claude Prompt Caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching)
+- [OpenAI Prompt Engineering](https://platform.openai.com/docs/guides/prompt-engineering)
+- [LangChain Prompts](https://python.langchain.com/docs/concepts/prompts/)
--- a/plugins/llm-application-dev/skills/rag-implementation/SKILL.md
+++ b/plugins/llm-application-dev/skills/rag-implementation/SKILL.md
@@ -23,187 +23,276 @@ Master Retrieval-Augmented Generation (RAG) to build LLM applications that provi
 **Purpose**: Store and retrieve document embeddings efficiently

 **Options:**
- **Pinecone**: Managed, scalable, fast queries
- **Weaviate**: Open-source, hybrid search
+- **Pinecone**: Managed, scalable, serverless
+- **Weaviate**: Open-source, hybrid search, GraphQL
 - **Milvus**: High performance, on-premise
- **Chroma**: Lightweight, easy to use
- **Qdrant**: Fast, filtered search
- **FAISS**: Meta's library, local deployment
+- **Chroma**: Lightweight, easy to use, local development
+- **Qdrant**: Fast, filtered search, Rust-based
+- **pgvector**: PostgreSQL extension, SQL integration

 ### 2. Embeddings
 **Purpose**: Convert text to numerical vectors for similarity search

-**Models:**
- **text-embedding-ada-002** (OpenAI): General purpose, 1536 dims
- **all-MiniLM-L6-v2** (Sentence Transformers): Fast, lightweight
- **e5-large-v2**: High quality, multilingual
- **Instructor**: Task-specific instructions
- **bge-large-en-v1.5**: SOTA performance
+**Models (2026):**
+| Model | Dimensions | Best For |
+|-------|------------|----------|
+| **voyage-3-large** | 1024 | Claude apps (Anthropic recommended) |
+| **voyage-code-3** | 1024 | Code search |
+| **text-embedding-3-large** | 3072 | OpenAI apps, high accuracy |
+| **text-embedding-3-small** | 1536 | OpenAI apps, cost-effective |
+| **bge-large-en-v1.5** | 1024 | Open source, local deployment |
+| **multilingual-e5-large** | 1024 | Multi-language support |

 ### 3. Retrieval Strategies
 **Approaches:**
 - **Dense Retrieval**: Semantic similarity via embeddings
 - **Sparse Retrieval**: Keyword matching (BM25, TF-IDF)
- **Hybrid Search**: Combine dense + sparse
+- **Hybrid Search**: Combine dense + sparse with weighted fusion
 - **Multi-Query**: Generate multiple query variations
- **HyDE**: Generate hypothetical documents
+- **HyDE**: Generate hypothetical documents for better retrieval

 ### 4. Reranking
 **Purpose**: Improve retrieval quality by reordering results

 **Methods:**
- **Cross-Encoders**: BERT-based reranking
+- **Cross-Encoders**: BERT-based reranking (ms-marco-MiniLM)
 - **Cohere Rerank**: API-based reranking
 - **Maximal Marginal Relevance (MMR)**: Diversity + relevance
 - **LLM-based**: Use LLM to score relevance

-## Quick Start
+## Quick Start with LangGraph

 ```python
-from langchain.document_loaders import DirectoryLoader
-from langchain.text_splitters import RecursiveCharacterTextSplitter
-from langchain.embeddings import OpenAIEmbeddings
-from langchain.vectorstores import Chroma
-from langchain.chains import RetrievalQA
-from langchain.llms import OpenAI
+from langgraph.graph import StateGraph, START, END
+from langchain_anthropic import ChatAnthropic
+from langchain_voyageai import VoyageAIEmbeddings
+from langchain_pinecone import PineconeVectorStore
+from langchain_core.documents import Document
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from typing import TypedDict, Annotated

-# 1. Load documents
-loader = DirectoryLoader('./docs', glob="**/*.txt")
-documents = loader.load()
+class RAGState(TypedDict):
+    question: str
+    context: list[Document]
+    answer: str

-# 2. Split into chunks
-text_splitter = RecursiveCharacterTextSplitter(
-    chunk_size=1000,
-    chunk_overlap=200,
-    length_function=len
-)
-chunks = text_splitter.split_documents(documents)
+# Initialize components
+llm = ChatAnthropic(model="claude-sonnet-4-5")
+embeddings = VoyageAIEmbeddings(model="voyage-3-large")
+vectorstore = PineconeVectorStore(index_name="docs", embedding=embeddings)
+retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

-# 3. Create embeddings and vector store
-embeddings = OpenAIEmbeddings()
-vectorstore = Chroma.from_documents(chunks, embeddings)
+# RAG prompt
+rag_prompt = ChatPromptTemplate.from_template(
+    """Answer based on the context below. If you cannot answer, say so.

-# 4. Create retrieval chain
-qa_chain = RetrievalQA.from_chain_type(
-    llm=OpenAI(),
-    chain_type="stuff",
-    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
-    return_source_documents=True
+    Context:
+    {context}
+
+    Question: {question}
+
+    Answer:"""
 )

-# 5. Query
-result = qa_chain({"query": "What are the main features?"})
-print(result['result'])
-print(result['source_documents'])
+async def retrieve(state: RAGState) -> RAGState:
+    """Retrieve relevant documents."""
+    docs = await retriever.ainvoke(state["question"])
+    return {"context": docs}
+
+async def generate(state: RAGState) -> RAGState:
+    """Generate answer from context."""
+    context_text = "\n\n".join(doc.page_content for doc in state["context"])
+    messages = rag_prompt.format_messages(
+        context=context_text,
+        question=state["question"]
+    )
+    response = await llm.ainvoke(messages)
+    return {"answer": response.content}
+
+# Build RAG graph
+builder = StateGraph(RAGState)
+builder.add_node("retrieve", retrieve)
+builder.add_node("generate", generate)
+builder.add_edge(START, "retrieve")
+builder.add_edge("retrieve", "generate")
+builder.add_edge("generate", END)
+
+rag_chain = builder.compile()
+
+# Use
+result = await rag_chain.ainvoke({"question": "What are the main features?"})
+print(result["answer"])
 ```

 ## Advanced RAG Patterns

-### Pattern 1: Hybrid Search
+### Pattern 1: Hybrid Search with RRF
+
 ```python
-from langchain.retrievers import BM25Retriever, EnsembleRetriever
+from langchain_community.retrievers import BM25Retriever
+from langchain.retrievers import EnsembleRetriever

-# Sparse retriever (BM25)
-bm25_retriever = BM25Retriever.from_documents(chunks)
-bm25_retriever.k = 5
+# Sparse retriever (BM25 for keyword matching)
+bm25_retriever = BM25Retriever.from_documents(documents)
+bm25_retriever.k = 10

-# Dense retriever (embeddings)
-embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
+# Dense retriever (embeddings for semantic search)
+dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

-# Combine with weights
+# Combine with Reciprocal Rank Fusion weights
 ensemble_retriever = EnsembleRetriever(
-    retrievers=[bm25_retriever, embedding_retriever],
-    weights=[0.3, 0.7]
+    retrievers=[bm25_retriever, dense_retriever],
+    weights=[0.3, 0.7]  # 30% keyword, 70% semantic
 )
 ```

 ### Pattern 2: Multi-Query Retrieval
+
 ```python
 from langchain.retrievers.multi_query import MultiQueryRetriever

-# Generate multiple query perspectives
-retriever = MultiQueryRetriever.from_llm(
-    retriever=vectorstore.as_retriever(),
-    llm=OpenAI()
+# Generate multiple query perspectives for better recall
+multi_query_retriever = MultiQueryRetriever.from_llm(
+    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
+    llm=llm
 )

 # Single query → multiple variations → combined results
-results = retriever.get_relevant_documents("What is the main topic?")
+results = await multi_query_retriever.ainvoke("What is the main topic?")
 ```

 ### Pattern 3: Contextual Compression
+
 ```python
 from langchain.retrievers import ContextualCompressionRetriever
 from langchain.retrievers.document_compressors import LLMChainExtractor

+# Compressor extracts only relevant portions
 compressor = LLMChainExtractor.from_llm(llm)

 compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
-    base_retriever=vectorstore.as_retriever()
+    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
 )

 # Returns only relevant parts of documents
-compressed_docs = compression_retriever.get_relevant_documents("query")
+compressed_docs = await compression_retriever.ainvoke("specific query")
 ```

 ### Pattern 4: Parent Document Retriever
+
 ```python
 from langchain.retrievers import ParentDocumentRetriever
 from langchain.storage import InMemoryStore
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+
+# Small chunks for precise retrieval, large chunks for context
+child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
+parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

 # Store for parent documents
-store = InMemoryStore()
+docstore = InMemoryStore()

-# Small chunks for retrieval, large chunks for context
-child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
-parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
-
-retriever = ParentDocumentRetriever(
+parent_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
-    docstore=store,
+    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
 )
+
+# Add documents (splits children, stores parents)
+await parent_retriever.aadd_documents(documents)
+
+# Retrieval returns parent documents with full context
+results = await parent_retriever.ainvoke("query")
+```
+
+### Pattern 5: HyDE (Hypothetical Document Embeddings)
+
+```python
+from langchain_core.prompts import ChatPromptTemplate
+
+class HyDEState(TypedDict):
+    question: str
+    hypothetical_doc: str
+    context: list[Document]
+    answer: str
+
+hyde_prompt = ChatPromptTemplate.from_template(
+    """Write a detailed passage that would answer this question:
+
+    Question: {question}
+
+    Passage:"""
+)
+
+async def generate_hypothetical(state: HyDEState) -> HyDEState:
+    """Generate hypothetical document for better retrieval."""
+    messages = hyde_prompt.format_messages(question=state["question"])
+    response = await llm.ainvoke(messages)
+    return {"hypothetical_doc": response.content}
+
+async def retrieve_with_hyde(state: HyDEState) -> HyDEState:
+    """Retrieve using hypothetical document."""
+    # Use hypothetical doc for retrieval instead of original query
+    docs = await retriever.ainvoke(state["hypothetical_doc"])
+    return {"context": docs}
+
+# Build HyDE RAG graph
+builder = StateGraph(HyDEState)
+builder.add_node("hypothetical", generate_hypothetical)
+builder.add_node("retrieve", retrieve_with_hyde)
+builder.add_node("generate", generate)
+builder.add_edge(START, "hypothetical")
+builder.add_edge("hypothetical", "retrieve")
+builder.add_edge("retrieve", "generate")
+builder.add_edge("generate", END)
+
+hyde_rag = builder.compile()
 ```

 ## Document Chunking Strategies

 ### Recursive Character Text Splitter
 ```python
-from langchain.text_splitters import RecursiveCharacterTextSplitter
+from langchain_text_splitters import RecursiveCharacterTextSplitter

 splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
-    separators=["\n\n", "\n", " ", ""]  # Try these in order
+    separators=["\n\n", "\n", ". ", " ", ""]  # Try in order
 )
+
+chunks = splitter.split_documents(documents)
 ```

 ### Token-Based Splitting
 ```python
-from langchain.text_splitters import TokenTextSplitter
+from langchain_text_splitters import TokenTextSplitter

 splitter = TokenTextSplitter(
    chunk_size=512,
-    chunk_overlap=50
+    chunk_overlap=50,
+    encoding_name="cl100k_base"  # OpenAI tiktoken encoding
 )
 ```

 ### Semantic Chunking
 ```python
-from langchain.text_splitters import SemanticChunker
+from langchain_experimental.text_splitter import SemanticChunker

 splitter = SemanticChunker(
-    embeddings=OpenAIEmbeddings(),
-    breakpoint_threshold_type="percentile"
+    embeddings=embeddings,
+    breakpoint_threshold_type="percentile",
+    breakpoint_threshold_amount=95
 )
 ```

 ### Markdown Header Splitter
 ```python
-from langchain.text_splitters import MarkdownHeaderTextSplitter
+from langchain_text_splitters import MarkdownHeaderTextSplitter

 headers_to_split_on = [
    ("#", "Header 1"),
@@ -211,36 +300,54 @@ headers_to_split_on = [
    ("###", "Header 3"),
 ]

-splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
+splitter = MarkdownHeaderTextSplitter(
+    headers_to_split_on=headers_to_split_on,
+    strip_headers=False
+)
 ```

 ## Vector Store Configurations

-### Pinecone
+### Pinecone (Serverless)
 ```python
-import pinecone
-from langchain.vectorstores import Pinecone
+from pinecone import Pinecone, ServerlessSpec
+from langchain_pinecone import PineconeVectorStore

-pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
+# Initialize Pinecone client
+pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

-index = pinecone.Index("your-index-name")
+# Create index if needed
+if "my-index" not in pc.list_indexes().names():
+    pc.create_index(
+        name="my-index",
+        dimension=1024,  # voyage-3-large dimensions
+        metric="cosine",
+        spec=ServerlessSpec(cloud="aws", region="us-east-1")
+    )

-vectorstore = Pinecone(index, embeddings.embed_query, "text")
+# Create vector store
+index = pc.Index("my-index")
+vectorstore = PineconeVectorStore(index=index, embedding=embeddings)
 ```

 ### Weaviate
 ```python
 import weaviate
-from langchain.vectorstores import Weaviate
+from langchain_weaviate import WeaviateVectorStore

-client = weaviate.Client("http://localhost:8080")
+client = weaviate.connect_to_local()  # or connect_to_weaviate_cloud()

-vectorstore = Weaviate(client, "Document", "content", embeddings)
+vectorstore = WeaviateVectorStore(
+    client=client,
+    index_name="Documents",
+    text_key="content",
+    embedding=embeddings
+)
 ```

-### Chroma (Local)
+### Chroma (Local Development)
 ```python
-from langchain.vectorstores import Chroma
+from langchain_chroma import Chroma

 vectorstore = Chroma(
    collection_name="my_collection",
@@ -249,32 +356,47 @@ vectorstore = Chroma(
 )
 ```

+### pgvector (PostgreSQL)
+```python
+from langchain_postgres.vectorstores import PGVector
+
+connection_string = "postgresql+psycopg://user:pass@localhost:5432/vectordb"
+
+vectorstore = PGVector(
+    embeddings=embeddings,
+    collection_name="documents",
+    connection=connection_string,
+)
+```
+
 ## Retrieval Optimization

 ### 1. Metadata Filtering
 ```python
+from langchain_core.documents import Document
+
 # Add metadata during indexing
-chunks_with_metadata = []
-for i, chunk in enumerate(chunks):
-    chunk.metadata = {
-        "source": chunk.metadata.get("source"),
-        "page": i,
-        "category": determine_category(chunk.page_content)
-    }
-    chunks_with_metadata.append(chunk)
+docs_with_metadata = []
+for doc in documents:
+    doc.metadata.update({
+        "source": doc.metadata.get("source", "unknown"),
+        "category": determine_category(doc.page_content),
+        "date": datetime.now().isoformat()
+    })
+    docs_with_metadata.append(doc)

 # Filter during retrieval
-results = vectorstore.similarity_search(
+results = await vectorstore.asimilarity_search(
    "query",
    filter={"category": "technical"},
    k=5
 )
 ```

-### 2. Maximal Marginal Relevance
+### 2. Maximal Marginal Relevance (MMR)
 ```python
 # Balance relevance with diversity
-results = vectorstore.max_marginal_relevance_search(
+results = await vectorstore.amax_marginal_relevance_search(
    "query",
    k=5,
    fetch_k=20,  # Fetch 20, return top 5 diverse
@@ -288,116 +410,140 @@ from sentence_transformers import CrossEncoder

 reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

+async def retrieve_and_rerank(query: str, k: int = 5) -> list[Document]:
    # Get initial results
-candidates = vectorstore.similarity_search("query", k=20)
+    candidates = await vectorstore.asimilarity_search(query, k=20)

    # Rerank
    pairs = [[query, doc.page_content] for doc in candidates]
    scores = reranker.predict(pairs)

    # Sort by score and take top k
-reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
+    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
+    return [doc for doc, score in ranked[:k]]
+```
+
+### 4. Cohere Rerank
+```python
+from langchain.retrievers import CohereRerank
+from langchain_cohere import CohereRerank
+
+reranker = CohereRerank(model="rerank-english-v3.0", top_n=5)
+
+# Wrap retriever with reranking
+reranked_retriever = ContextualCompressionRetriever(
+    base_compressor=reranker,
+    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20})
+)
 ```

 ## Prompt Engineering for RAG

-### Contextual Prompt
+### Contextual Prompt with Citations
 ```python
-prompt_template = """Use the following context to answer the question. If you cannot answer based on the context, say "I don't have enough information."
+rag_prompt = ChatPromptTemplate.from_template(
+    """Answer the question based on the context below. Include citations using [1], [2], etc.
+
+    If you cannot answer based on the context, say "I don't have enough information."

    Context:
    {context}

    Question: {question}

-Answer:"""
-```
-
-### With Citations
-```python
-prompt_template = """Answer the question based on the context below. Include citations using [1], [2], etc.
-
-Context:
-{context}
-
-Question: {question}
+    Instructions:
+    1. Use only information from the context
+    2. Cite sources with [1], [2] format
+    3. If uncertain, express uncertainty

    Answer (with citations):"""
+)
 ```

-### With Confidence
+### Structured Output for RAG
 ```python
-prompt_template = """Answer the question using the context. Provide a confidence score (0-100%) for your answer.
+from pydantic import BaseModel, Field

-Context:
-{context}
+class RAGResponse(BaseModel):
+    answer: str = Field(description="The answer based on context")
+    confidence: float = Field(description="Confidence score 0-1")
+    sources: list[str] = Field(description="Source document IDs used")
+    reasoning: str = Field(description="Brief reasoning for the answer")

-Question: {question}
-
-Answer:
-Confidence:"""
+# Use with structured output
+structured_llm = llm.with_structured_output(RAGResponse)
 ```

 ## Evaluation Metrics

 ```python
-def evaluate_rag_system(qa_chain, test_cases):
-    metrics = {
-        'accuracy': [],
-        'retrieval_quality': [],
-        'groundedness': []
-    }
+from typing import TypedDict
+
+class RAGEvalMetrics(TypedDict):
+    retrieval_precision: float  # Relevant docs / retrieved docs
+    retrieval_recall: float     # Retrieved relevant / total relevant
+    answer_relevance: float     # Answer addresses question
+    faithfulness: float         # Answer grounded in context
+    context_relevance: float    # Context relevant to question
+
+async def evaluate_rag_system(
+    rag_chain,
+    test_cases: list[dict]
+) -> RAGEvalMetrics:
+    """Evaluate RAG system on test cases."""
+    metrics = {k: [] for k in RAGEvalMetrics.__annotations__}

    for test in test_cases:
-        result = qa_chain({"query": test['question']})
+        result = await rag_chain.ainvoke({"question": test["question"]})

-        # Check if answer matches expected
-        accuracy = calculate_accuracy(result['result'], test['expected'])
-        metrics['accuracy'].append(accuracy)
+        # Retrieval metrics
+        retrieved_ids = {doc.metadata["id"] for doc in result["context"]}
+        relevant_ids = set(test["relevant_doc_ids"])

-        # Check if relevant docs were retrieved
-        retrieval_quality = evaluate_retrieved_docs(
-            result['source_documents'],
-            test['relevant_docs']
+        precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
+        recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
+
+        metrics["retrieval_precision"].append(precision)
+        metrics["retrieval_recall"].append(recall)
+
+        # Use LLM-as-judge for quality metrics
+        quality = await evaluate_answer_quality(
+            question=test["question"],
+            answer=result["answer"],
+            context=result["context"],
+            expected=test.get("expected_answer")
        )
-        metrics['retrieval_quality'].append(retrieval_quality)
-
-        # Check if answer is grounded in context
-        groundedness = check_groundedness(
-            result['result'],
-            result['source_documents']
-        )
-        metrics['groundedness'].append(groundedness)
+        metrics["answer_relevance"].append(quality["relevance"])
+        metrics["faithfulness"].append(quality["faithfulness"])
+        metrics["context_relevance"].append(quality["context_relevance"])

    return {k: sum(v) / len(v) for k, v in metrics.items()}
 ```

 ## Resources

- **references/vector-databases.md**: Detailed comparison of vector DBs
- **references/embeddings.md**: Embedding model selection guide
- **references/retrieval-strategies.md**: Advanced retrieval techniques
- **references/reranking.md**: Reranking methods and when to use them
- **references/context-window.md**: Managing context limits
- **assets/vector-store-config.yaml**: Configuration templates
- **assets/retriever-pipeline.py**: Complete RAG pipeline
- **assets/embedding-models.md**: Model comparison and benchmarks
+- [LangChain RAG Tutorial](https://python.langchain.com/docs/tutorials/rag/)
+- [LangGraph RAG Examples](https://langchain-ai.github.io/langgraph/tutorials/rag/)
+- [Pinecone Best Practices](https://docs.pinecone.io/guides/get-started/overview)
+- [Voyage AI Embeddings](https://docs.voyageai.com/)
+- [RAG Evaluation Guide](https://docs.ragas.io/)

 ## Best Practices

-1. **Chunk Size**: Balance between context and specificity (500-1000 tokens)
+1. **Chunk Size**: Balance between context (larger) and specificity (smaller) - typically 500-1000 tokens
 2. **Overlap**: Use 10-20% overlap to preserve context at boundaries
 3. **Metadata**: Include source, page, timestamp for filtering and debugging
-4. **Hybrid Search**: Combine semantic and keyword search for best results
-5. **Reranking**: Improve top results with cross-encoder
+4. **Hybrid Search**: Combine semantic and keyword search for best recall
+5. **Reranking**: Use cross-encoder reranking for precision-critical applications
 6. **Citations**: Always return source documents for transparency
 7. **Evaluation**: Continuously test retrieval quality and answer accuracy
-8. **Monitoring**: Track retrieval metrics in production
+8. **Monitoring**: Track retrieval metrics and latency in production

 ## Common Issues

 - **Poor Retrieval**: Check embedding quality, chunk size, query formulation
 - **Irrelevant Results**: Add metadata filtering, use hybrid search, rerank
- **Missing Information**: Ensure documents are properly indexed
+- **Missing Information**: Ensure documents are properly indexed, check chunking
 - **Slow Queries**: Optimize vector store, use caching, reduce k
 - **Hallucinations**: Improve grounding prompt, add verification step
+- **Context Too Long**: Use compression or parent document retriever