Explore Projects

Discover 19 open source projects

Active filters (1):
Search: unstructured-dataร—
Clear all

Showing 1-19 of 19 projects

treeverse/dvc

dvc is a data versioning and ML experiments tool that helps developers manage and track data and model changes.

15.4K
Active
Python
ETL & Pipelines
Python
#data-versioning#machine-learning#reproducibility

voxel51/fiftyone

Refine high-quality datasets and visual AI models with this Python library for active learning and data curation.

10.4K
Active
Python
Computer Vision
Python
#active-learning#data-curation#data-quality

neo4j-labs/llm-graph-builder

Builds a Neo4j graph from unstructured data using LLMs

4.5K
Active
Jupyter Notebook
LLM Frameworks
AI Tool Connectors
React
#graph-construction#LLM#Neo4j

ucbepic/docetl

A system for agentic LLM-powered data processing and ETL workflows for unstructured data analysis.

3.7K
Active
Python
Agents & Orchestration
ETL & Pipelines
Python
#agents#data-pipelines#document-processing

towhee-io/towhee

A fast and simple framework for building neural data processing pipelines using Python.

3.5K
Archived
Python
LLM Frameworks
Computer Vision
Python
#machine-learning#computer-vision#embeddings

milvus-io/bootcamp

This GitHub repository provides a Bootcamp for dealing with unstructured data like reverse image search, audio search, and NLP.

2.4K
Active
Jupyter Notebook
Embeddings
Semantic Search
Python
#audio-search#image-search#nlp

instill-ai/instill-core

Instill Core is an open-source AI infrastructure tool for orchestrating data, models, and pipelines to build AI-powered applications.

2.3K
Active
Python
LLM Frameworks
Agents & Orchestration
Golang
#ai#generative-ai#llm

nomic-ai/nomic

Nomic Developer API SDK is a Python library that provides tools for clustering, duplicate detection, embeddings, and topic modeling on unstructured data.

1.9K
Stable
Python
LLM Wrappers & SDKs
Databases
Python
#clustering#embeddings#text-processing

NanoNets/docext

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit.

1.9K
Stable
Python
Computer Vision
API Frameworks
Python
#document-analysis#document-data-extraction#ocr-benchmark

shcherbak-ai/contextgem

A Python library for extracting data and LLM outputs from various document types with ease.

1.8K
Stable
Python
LLM Frameworks
Data Extraction
#llm#data-extraction#document-intelligence

dingodb/dingo

A high-performance, MySQL-compatible vector database that supports structured and unstructured data for AI-driven applications.

1.7K
Active
Java
Vector Databases
API Frameworks
#vector-database#mysql-compatibility#structured-data

yobix-ai/extractous

Powerful, fast, and efficient unstructured data extraction library written in Rust with language bindings.

1.7K
Archived
Rust
ETL & Pipelines
ETL & Pipelines
Rust
#data-extraction#unstructured-data#etl

lotus-data/lotus

A Python library that uses LLMs and embeddings to process datasets with up to 1000x speedups

1.6K
Active
Python
LLM Frameworks
ETL & Pipelines
Python
#ai-data-processing#llm#semantic-search

emcf/thepipe

A Python library that helps developers extract structured data from tricky documents using vision-language models.

1.5K
Stable
Python
LLM Frameworks
ETL & Pipelines
Python
#document-processing#large-language-models#multimodal

tstanislawek/awesome-document-understanding

A curated list of resources for Document Understanding (DU) related to machine learning and natural language processing.

1.5K
Archived
Computer Vision
Natural Language Processing
#document-understanding#pdf-processing#ocr

amphi-ai/amphi-etl

A visual data preparation tool powered by Python, designed for data analysis and ETL tasks.

1.4K
Active
TypeScript
ETL & Pipelines
Data Analysis
TypeScript
#data-analysis#data-pipelines#data-transformation

Renumics/spotlight

Interactively explore unstructured datasets like audio, images, and video using this TypeScript library.

1.3K
Active
TypeScript
Computer Vision
Caching
React
#data-visualization#exploratory-data-analysis#unstructured-data

Open-Source-Legal/OpenContracts

An enterprise-grade, API-first LLM workspace for unstructured document processing, with features like data extraction, redaction, and prompt engineering.

1.2K
Active
Python
LLM Frameworks
ETL & Pipelines
Python
#llm#prompt-engineering#etl

databricks/lilac

An open-source Python library that helps curate better data for large language models (LLMs).

1.1K
Archived
Python
LLM Frameworks
Data Analysis
Python
#data-curation#unstructured-data#dataset-analysis

Stay in the loop

Get weekly updates on trending AI coding tools and projects.