feat(llm-application-dev): modernize to LangGraph and latest models v2.0.0

- Migrate from LangChain 0.x to LangChain 1.x/LangGraph patterns
- Update model references to Claude 4.5 and GPT-5.2
- Add Voyage AI as primary embedding recommendation
- Add structured outputs with Pydantic
- Replace deprecated initialize_agent() with StateGraph
- Fix security: use AST-based safe math instead of unsafe execution
- Add plugin.json and README.md for consistency
- Bump marketplace version to 1.3.3
This commit is contained in:
Seth Hobson
2026-01-19 15:43:25 -05:00
parent e827cc713a
commit 8be0e8ac7a
12 changed files with 1940 additions and 708 deletions

View File

@@ -18,14 +18,18 @@ Guide to selecting and optimizing embedding models for vector search application
## Core Concepts
### 1. Embedding Model Comparison
### 1. Embedding Model Comparison (2026)
| Model | Dimensions | Max Tokens | Best For |
|-------|------------|------------|----------|
| **text-embedding-3-large** | 3072 | 8191 | High accuracy |
| **text-embedding-3-small** | 1536 | 8191 | Cost-effective |
| **voyage-2** | 1024 | 4000 | Code, legal |
| **bge-large-en-v1.5** | 1024 | 512 | Open source |
| **voyage-3-large** | 1024 | 32000 | Claude apps (Anthropic recommended) |
| **voyage-3** | 1024 | 32000 | Claude apps, cost-effective |
| **voyage-code-3** | 1024 | 32000 | Code search |
| **voyage-finance-2** | 1024 | 32000 | Financial documents |
| **voyage-law-2** | 1024 | 32000 | Legal documents |
| **text-embedding-3-large** | 3072 | 8191 | OpenAI apps, high accuracy |
| **text-embedding-3-small** | 1536 | 8191 | OpenAI apps, cost-effective |
| **bge-large-en-v1.5** | 1024 | 512 | Open source, local deployment |
| **all-MiniLM-L6-v2** | 384 | 256 | Fast, lightweight |
| **multilingual-e5-large** | 1024 | 512 | Multi-language |
@@ -39,7 +43,34 @@ Document → Chunking → Preprocessing → Embedding Model → Vector
## Templates
### Template 1: OpenAI Embeddings
### Template 1: Voyage AI Embeddings (Recommended for Claude)
```python
from langchain_voyageai import VoyageAIEmbeddings
from typing import List
import os
# Initialize Voyage AI embeddings (recommended by Anthropic for Claude)
embeddings = VoyageAIEmbeddings(
model="voyage-3-large",
voyage_api_key=os.environ.get("VOYAGE_API_KEY")
)
def get_embeddings(texts: List[str]) -> List[List[float]]:
"""Get embeddings from Voyage AI."""
return embeddings.embed_documents(texts)
def get_query_embedding(query: str) -> List[float]:
"""Get single query embedding."""
return embeddings.embed_query(query)
# Specialized models for domains
code_embeddings = VoyageAIEmbeddings(model="voyage-code-3")
finance_embeddings = VoyageAIEmbeddings(model="voyage-finance-2")
legal_embeddings = VoyageAIEmbeddings(model="voyage-law-2")
```
### Template 2: OpenAI Embeddings
```python
from openai import OpenAI
@@ -53,7 +84,7 @@ def get_embeddings(
model: str = "text-embedding-3-small",
dimensions: int = None
) -> List[List[float]]:
"""Get embeddings from OpenAI."""
"""Get embeddings from OpenAI with optional dimension reduction."""
# Handle batching for large lists
batch_size = 100
all_embeddings = []
@@ -63,6 +94,7 @@ def get_embeddings(
kwargs = {"input": batch, "model": model}
if dimensions:
# Matryoshka dimensionality reduction
kwargs["dimensions"] = dimensions
response = client.embeddings.create(**kwargs)
@@ -77,7 +109,7 @@ def get_embedding(text: str, **kwargs) -> List[float]:
return get_embeddings([text], **kwargs)[0]
# Dimension reduction with OpenAI
# Dimension reduction with Matryoshka embeddings
def get_reduced_embedding(text: str, dimensions: int = 512) -> List[float]:
"""Get embedding with reduced dimensions (Matryoshka)."""
return get_embedding(
@@ -87,7 +119,7 @@ def get_reduced_embedding(text: str, dimensions: int = 512) -> List[float]:
)
```
### Template 2: Local Embeddings with Sentence Transformers
### Template 3: Local Embeddings with Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
@@ -103,6 +135,7 @@ class LocalEmbedder:
device: str = "cuda"
):
self.model = SentenceTransformer(model_name, device=device)
self.model_name = model_name
def embed(
self,
@@ -120,9 +153,9 @@ class LocalEmbedder:
return embeddings
def embed_query(self, query: str) -> np.ndarray:
"""Embed a query with BGE-style prefix."""
# BGE models benefit from query prefix
if "bge" in self.model.get_sentence_embedding_dimension():
"""Embed a query with appropriate prefix for retrieval models."""
# BGE and similar models benefit from query prefix
if "bge" in self.model_name.lower():
query = f"Represent this sentence for searching relevant passages: {query}"
return self.embed([query])[0]
@@ -137,13 +170,15 @@ class E5Embedder:
self.model = SentenceTransformer(model_name)
def embed_query(self, query: str) -> np.ndarray:
"""E5 requires 'query:' prefix for queries."""
return self.model.encode(f"query: {query}")
def embed_document(self, document: str) -> np.ndarray:
"""E5 requires 'passage:' prefix for documents."""
return self.model.encode(f"passage: {document}")
```
### Template 3: Chunking Strategies
### Template 4: Chunking Strategies
```python
from typing import List, Tuple
@@ -288,20 +323,33 @@ def recursive_character_splitter(
return split_text(text, separators)
```
### Template 4: Domain-Specific Embedding Pipeline
### Template 5: Domain-Specific Embedding Pipeline
```python
import re
from typing import List, Optional
from dataclasses import dataclass
@dataclass
class EmbeddedDocument:
id: str
document_id: str
chunk_index: int
text: str
embedding: List[float]
metadata: dict
class DomainEmbeddingPipeline:
"""Pipeline for domain-specific embeddings."""
def __init__(
self,
embedding_model: str = "text-embedding-3-small",
embedding_model: str = "voyage-3-large",
chunk_size: int = 512,
chunk_overlap: int = 50,
preprocessing_fn=None
):
self.embedding_model = embedding_model
self.embeddings = VoyageAIEmbeddings(model=embedding_model)
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.preprocess = preprocessing_fn or self._default_preprocess
@@ -310,7 +358,7 @@ class DomainEmbeddingPipeline:
"""Default preprocessing."""
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text)
# Remove special characters
# Remove special characters (customize for your domain)
text = re.sub(r'[^\w\s.,!?-]', '', text)
return text.strip()
@@ -319,8 +367,8 @@ class DomainEmbeddingPipeline:
documents: List[dict],
id_field: str = "id",
content_field: str = "content",
metadata_fields: List[str] = None
) -> List[dict]:
metadata_fields: Optional[List[str]] = None
) -> List[EmbeddedDocument]:
"""Process documents for vector storage."""
processed = []
@@ -339,25 +387,26 @@ class DomainEmbeddingPipeline:
)
# Create embeddings
embeddings = get_embeddings(chunks, self.embedding_model)
embeddings = await self.embeddings.aembed_documents(chunks)
# Create records
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
record = {
"id": f"{doc_id}_chunk_{i}",
"document_id": doc_id,
"chunk_index": i,
"text": chunk,
"embedding": embedding
}
metadata = {"document_id": doc_id, "chunk_index": i}
# Add metadata
# Add specified metadata fields
if metadata_fields:
for field in metadata_fields:
if field in doc:
record[field] = doc[field]
metadata[field] = doc[field]
processed.append(record)
processed.append(EmbeddedDocument(
id=f"{doc_id}_chunk_{i}",
document_id=doc_id,
chunk_index=i,
text=chunk,
embedding=embedding,
metadata=metadata
))
return processed
@@ -366,42 +415,77 @@ class DomainEmbeddingPipeline:
class CodeEmbeddingPipeline:
"""Specialized pipeline for code embeddings."""
def __init__(self, model: str = "voyage-code-2"):
self.model = model
def __init__(self):
# Use Voyage's code-specific model
self.embeddings = VoyageAIEmbeddings(model="voyage-code-3")
def chunk_code(self, code: str, language: str) -> List[dict]:
"""Chunk code by functions/classes."""
import tree_sitter
"""Chunk code by functions/classes using tree-sitter."""
try:
import tree_sitter_languages
parser = tree_sitter_languages.get_parser(language)
tree = parser.parse(bytes(code, "utf8"))
# Parse with tree-sitter
# Extract functions, classes, methods
# Return chunks with context
pass
chunks = []
# Extract function and class definitions
self._extract_nodes(tree.root_node, code, chunks)
return chunks
except ImportError:
# Fallback to simple chunking
return [{"text": code, "type": "module"}]
def embed_with_context(self, chunk: str, context: str) -> List[float]:
def _extract_nodes(self, node, source_code: str, chunks: list):
"""Recursively extract function/class definitions."""
if node.type in ['function_definition', 'class_definition', 'method_definition']:
text = source_code[node.start_byte:node.end_byte]
chunks.append({
"text": text,
"type": node.type,
"name": self._get_name(node),
"start_line": node.start_point[0],
"end_line": node.end_point[0]
})
for child in node.children:
self._extract_nodes(child, source_code, chunks)
def _get_name(self, node) -> str:
"""Extract name from function/class node."""
for child in node.children:
if child.type == 'identifier' or child.type == 'name':
return child.text.decode('utf8')
return "unknown"
async def embed_with_context(
self,
chunk: str,
context: str = ""
) -> List[float]:
"""Embed code with surrounding context."""
combined = f"Context: {context}\n\nCode:\n{chunk}"
return get_embedding(combined, model=self.model)
if context:
combined = f"Context: {context}\n\nCode:\n{chunk}"
else:
combined = chunk
return await self.embeddings.aembed_query(combined)
```
### Template 5: Embedding Quality Evaluation
### Template 6: Embedding Quality Evaluation
```python
import numpy as np
from typing import List, Tuple
from typing import List, Dict
def evaluate_retrieval_quality(
queries: List[str],
relevant_docs: List[List[str]], # List of relevant doc IDs per query
retrieved_docs: List[List[str]], # List of retrieved doc IDs per query
k: int = 10
) -> dict:
) -> Dict[str, float]:
"""Evaluate embedding quality for retrieval."""
def precision_at_k(relevant: set, retrieved: List[str], k: int) -> float:
retrieved_k = retrieved[:k]
relevant_retrieved = len(set(retrieved_k) & relevant)
return relevant_retrieved / k
return relevant_retrieved / k if k > 0 else 0
def recall_at_k(relevant: set, retrieved: List[str], k: int) -> float:
retrieved_k = retrieved[:k]
@@ -446,7 +530,7 @@ def compute_embedding_similarity(
) -> np.ndarray:
"""Compute similarity matrix between embedding sets."""
if metric == "cosine":
# Normalize
# Normalize and compute dot product
norm1 = embeddings1 / np.linalg.norm(embeddings1, axis=1, keepdims=True)
norm2 = embeddings2 / np.linalg.norm(embeddings2, axis=1, keepdims=True)
return norm1 @ norm2.T
@@ -455,25 +539,68 @@ def compute_embedding_similarity(
return -cdist(embeddings1, embeddings2, metric='euclidean')
elif metric == "dot":
return embeddings1 @ embeddings2.T
else:
raise ValueError(f"Unknown metric: {metric}")
def compare_embedding_models(
texts: List[str],
models: Dict[str, callable],
queries: List[str],
relevant_indices: List[List[int]],
k: int = 5
) -> Dict[str, Dict[str, float]]:
"""Compare multiple embedding models on retrieval quality."""
results = {}
for model_name, embed_fn in models.items():
# Embed all texts
doc_embeddings = np.array(embed_fn(texts))
retrieved_per_query = []
for query in queries:
query_embedding = np.array(embed_fn([query])[0])
# Compute similarities
similarities = compute_embedding_similarity(
query_embedding.reshape(1, -1),
doc_embeddings,
metric="cosine"
)[0]
# Get top-k indices
top_k_indices = np.argsort(similarities)[::-1][:k]
retrieved_per_query.append([str(i) for i in top_k_indices])
# Convert relevant indices to string IDs
relevant_docs = [[str(i) for i in indices] for indices in relevant_indices]
results[model_name] = evaluate_retrieval_quality(
queries, relevant_docs, retrieved_per_query, k
)
return results
```
## Best Practices
### Do's
- **Match model to use case** - Code vs prose vs multilingual
- **Chunk thoughtfully** - Preserve semantic boundaries
- **Normalize embeddings** - For cosine similarity
- **Batch requests** - More efficient than one-by-one
- **Cache embeddings** - Avoid recomputing
- **Match model to use case**: Code vs prose vs multilingual
- **Chunk thoughtfully**: Preserve semantic boundaries
- **Normalize embeddings**: For cosine similarity search
- **Batch requests**: More efficient than one-by-one
- **Cache embeddings**: Avoid recomputing for static content
- **Use Voyage AI for Claude apps**: Recommended by Anthropic
### Don'ts
- **Don't ignore token limits** - Truncation loses info
- **Don't mix embedding models** - Incompatible spaces
- **Don't skip preprocessing** - Garbage in, garbage out
- **Don't over-chunk** - Lose context
- **Don't ignore token limits**: Truncation loses information
- **Don't mix embedding models**: Incompatible vector spaces
- **Don't skip preprocessing**: Garbage in, garbage out
- **Don't over-chunk**: Lose important context
- **Don't forget metadata**: Essential for filtering and debugging
## Resources
- [OpenAI Embeddings](https://platform.openai.com/docs/guides/embeddings)
- [Voyage AI Documentation](https://docs.voyageai.com/)
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
- [Sentence Transformers](https://www.sbert.net/)
- [MTEB Benchmark](https://huggingface.co/spaces/mteb/leaderboard)
- [LangChain Embedding Models](https://python.langchain.com/docs/integrations/text_embedding/)