mirror of
https://github.com/wshobson/agents.git
synced 2026-03-18 09:37:15 +00:00
fix(skills): remove phantom resource references and fix CoC links (#447)
Remove references to non-existent resource files (references/, assets/, scripts/, examples/) from 115 skill SKILL.md files. These sections pointed to directories and files that were never created, causing confusion when users install skills. Also fix broken Code of Conduct links in issue templates to use absolute GitHub URLs instead of relative paths that 404.
This commit is contained in:
@@ -664,32 +664,3 @@ class BenchmarkRunner:
|
||||
for metric, scores in results.items()
|
||||
}
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- [LangSmith Evaluation Guide](https://docs.smith.langchain.com/evaluation)
|
||||
- [RAGAS Framework](https://docs.ragas.io/)
|
||||
- [DeepEval Library](https://docs.deepeval.com/)
|
||||
- [Arize Phoenix](https://docs.arize.com/phoenix/)
|
||||
- [HELM Benchmark](https://crfm.stanford.edu/helm/)
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Multiple Metrics**: Use diverse metrics for comprehensive view
|
||||
2. **Representative Data**: Test on real-world, diverse examples
|
||||
3. **Baselines**: Always compare against baseline performance
|
||||
4. **Statistical Rigor**: Use proper statistical tests for comparisons
|
||||
5. **Continuous Evaluation**: Integrate into CI/CD pipeline
|
||||
6. **Human Validation**: Combine automated metrics with human judgment
|
||||
7. **Error Analysis**: Investigate failures to understand weaknesses
|
||||
8. **Version Control**: Track evaluation results over time
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
- **Single Metric Obsession**: Optimizing for one metric at the expense of others
|
||||
- **Small Sample Size**: Drawing conclusions from too few examples
|
||||
- **Data Contamination**: Testing on training data
|
||||
- **Ignoring Variance**: Not accounting for statistical uncertainty
|
||||
- **Metric Mismatch**: Using metrics not aligned with business goals
|
||||
- **Position Bias**: In pairwise evals, randomize order
|
||||
- **Overfitting Prompts**: Optimizing for test set instead of real use
|
||||
|
||||
Reference in New Issue
Block a user