Showing 181-200 of 222 projects
An open-source, end-to-end observability tool for LLM applications with real-time tracing, evaluations and metrics.
An open-source toolkit for building, evaluating, and deploying sophisticated AI agents with flexibility and control.
A Python framework for sequence labeling evaluation, useful for named-entity recognition and POS tagging.
High-fidelity performance metrics for generative models in PyTorch
Distribute and run AI workloads on Kubernetes with a Python-based infrastructure toolkit like PyTorch.
A Python framework for comprehensive diagnosis and optimization of AI agents using simulated, realistic synthetic interactions.
Prompty is a Python library that makes it easy to create, manage, debug, and evaluate LLM prompts for AI applications.
An open-source Mathematica Kernel written in Python with built-in functions, variables, and a parser/evaluator.
A comprehensive Python library for evaluating object detection models using various metrics like mAP, AR, and STT-AP.
A functional programming library for TypeScript/JavaScript developers with concurrency, lazy evaluation, and other FP features.
A benchmark library for evaluating correlation filter-based visual tracking algorithms.
A.S.E (AICGSecEval) is a repository-level AI-generated code security evaluation benchmark developed by Tencent Wukong Code Security Team.
An open-source framework for building, evaluating, and training general multi-agent assistance systems using AI tools.
Collection of Chinese safety prompts for evaluating and improving the safety of large language models (LLMs).
R package for evaluating the quality and performance of statistical models in R.
Tau-Bench is a Python library for benchmarking and evaluating AI language models and tools.
A Python package for uncertainty quantification and hallucination detection in large language models (LLMs)
A Python library that provides a single interface to use and evaluate different AI agent frameworks.
LongBench is a benchmark for evaluating large language models on long-context tasks.
A Python benchmark suite for evaluating text-to-3D generation models and techniques.
Get weekly updates on trending AI coding tools and projects.