Explore Projects

Discover 18 open source projects

Active filters (1):
Search: data-processingร—
Clear all

Showing 1-18 of 18 projects

pathwaycom/pathway

Python ETL framework for real-time analytics and LLM pipelines

59.5K
Active
Python
LLM Frameworks
ETL & Pipelines
Python
#etl#real-time#llm

onceupon/Bash-Oneliner

A collection of handy Bash one-liners and terminal tricks for data processing and Linux system maintenance.

10.7K
Active
CLI Tools
#bash#one-liners#data-processing

johnkerl/miller

Miller is a powerful CLI tool for processing tabular data like CSV, TSV, and JSON, similar to awk, sed, and other Unix utilities.

9.8K
Active
Go
CLI Tools
#csv#json#data-processing

TomWright/dasel

A Go tool for selecting, updating, and deleting data from various file formats like JSON, YAML, and XML.

7.9K
Active
Go
CLI Tools
Data Processing
#data-processing#configuration#cli-tool

cocoindex-io/cocoindex

Data transformation framework for AI with ultra-fast, incremental processing capabilities.

6.3K
Active
Rust
LLM Frameworks
ETL & Pipelines
Rust
#ai#data-engineering#data-transformation

datajuicer/data-juicer

A Python library for processing and analyzing data with foundation models and large language models.

6.0K
Active
Python
LLM Frameworks
ETL & Pipelines
Python
#data-processing#data-analysis#foundation-models

NVIDIA/DALI

A highly optimized GPU-accelerated library for accelerating deep learning training and inference applications.

5.6K
Active
C++
GPU
Data Processing
PyTorch
#gpu#data-processing#deep-learning

deepseek-ai/smallpond

A lightweight data processing framework built on DuckDB and 3FS for vibe coders working with AI tools.

4.9K
Experimental
Python
Databases
LLM Frameworks
Python
#data-processing#duckdb#llm

OpenDCAI/DataFlow

LLMs-based Operators and Pipelines for data prep

2.9K
Active
Python
AI Coding Tools
Gradio
#data-science#data-agent#data-cleaning

numaproj/numaflow

Numaflow is a Kubernetes-native platform to run massively parallel data/streaming jobs.

2.4K
Active
Rust
API Frameworks
Pipelines
#kubernetes#data-processing#streaming

microsoft/DialoGPT

A large-scale pretrained dialogue model for building conversational AI applications.

2.4K
Archived
Python
LLM Frameworks
API Frameworks
PyTorch
#dialogue#language-model#text-generation

asyml/texar

Texar is a toolkit for machine learning, NLP, and text generation in TensorFlow, part of the CASL project.

2.4K
Archived
Python
LLM Frameworks
API Frameworks
TensorFlow
#machine-learning#natural-language-processing#text-generation

bytewax/bytewax

Bytewax is a Python library for building scalable, fault-tolerant, and low-latency data processing pipelines.

2.0K
Experimental
Python
ETL & Pipelines
API Frameworks
Python
#streaming#data-engineering#data-processing

pyper-dev/pyper

Concurrent Python made simple, with support for asyncio, multiprocessing, and threading.

1.5K
Experimental
Python
API Frameworks
CLI Tools
Python
#asyncio#concurrency#multiprocessing

NVIDIA-NeMo/Curator

Scalable data pre processing and curation toolkit for Large Language Models (LLMs)

1.4K
Active
Python
Python
#data-curation#large-language-models#data-preparation

allenai/dolma

A Python library and tools for generating and inspecting data for pre-training large language models (LLMs).

1.4K
Stable
Python
LLM Frameworks
Data Processing
Python
#large-language-models#data-processing#natural-language-processing

GoogleCloudPlatform/data-science-on-gcp

A repository providing data science tools and examples for the Google Cloud Platform.

1.4K
Stable
Jupyter Notebook
React
#data-science#cloud-computing#google-cloud

run-house/kubetorch

Distribute and run AI workloads on Kubernetes with a Python-based infrastructure toolkit like PyTorch.

1.2K
Active
Python
ML Ops
Containerization
PyTorch
#kubernetes#distributed-computing#data-science

Stay in the loop

Get weekly updates on trending AI coding tools and projects.