Explore Projects

Discover 18 open source projects

Active filters (1):

Search: data-processing×

Clear all

Showing 1-18 of 18 projects

pathwaycom/pathway

Python ETL framework for real-time analytics and LLM pipelines

59.5K

Active

Python

LLM Frameworks

ETL & Pipelines

Python

#etl#real-time#llm

onceupon/Bash-Oneliner

A collection of handy Bash one-liners and terminal tricks for data processing and Linux system maintenance.

10.7K

Active

CLI Tools

#bash#one-liners#data-processing

johnkerl/miller

Miller is a powerful CLI tool for processing tabular data like CSV, TSV, and JSON, similar to awk, sed, and other Unix utilities.

9.8K

Active

CLI Tools

#csv#json#data-processing

TomWright/dasel

A Go tool for selecting, updating, and deleting data from various file formats like JSON, YAML, and XML.

7.9K

Active

CLI Tools

Data Processing

#data-processing#configuration#cli-tool

cocoindex-io/cocoindex

Data transformation framework for AI with ultra-fast, incremental processing capabilities.

6.3K

Active

Rust

LLM Frameworks

ETL & Pipelines

Rust

#ai#data-engineering#data-transformation

datajuicer/data-juicer

A Python library for processing and analyzing data with foundation models and large language models.

6.0K

Active

Python

LLM Frameworks

ETL & Pipelines

Python

#data-processing#data-analysis#foundation-models

NVIDIA/DALI

A highly optimized GPU-accelerated library for accelerating deep learning training and inference applications.

5.6K

Active

C++

GPU

Data Processing

PyTorch

#gpu#data-processing#deep-learning

deepseek-ai/smallpond

A lightweight data processing framework built on DuckDB and 3FS for vibe coders working with AI tools.

4.9K

Experimental

Python

Databases

LLM Frameworks

Python

#data-processing#duckdb#llm

OpenDCAI/DataFlow

LLMs-based Operators and Pipelines for data prep

2.9K

Active

Python

AI Coding Tools

Gradio

#data-science#data-agent#data-cleaning

numaproj/numaflow

Numaflow is a Kubernetes-native platform to run massively parallel data/streaming jobs.

2.4K

Active

Rust

API Frameworks

Pipelines

#kubernetes#data-processing#streaming

microsoft/DialoGPT

A large-scale pretrained dialogue model for building conversational AI applications.

2.4K

Archived

Python

LLM Frameworks

API Frameworks

PyTorch

#dialogue#language-model#text-generation

asyml/texar

Texar is a toolkit for machine learning, NLP, and text generation in TensorFlow, part of the CASL project.

2.4K

Archived

Python

LLM Frameworks

API Frameworks

TensorFlow

#machine-learning#natural-language-processing#text-generation

bytewax/bytewax

Bytewax is a Python library for building scalable, fault-tolerant, and low-latency data processing pipelines.

2.0K

Experimental

Python

ETL & Pipelines

API Frameworks

Python

#streaming#data-engineering#data-processing

pyper-dev/pyper

Concurrent Python made simple, with support for asyncio, multiprocessing, and threading.

1.5K

Experimental

Python

API Frameworks

CLI Tools

Python

#asyncio#concurrency#multiprocessing

NVIDIA-NeMo/Curator

Scalable data pre processing and curation toolkit for Large Language Models (LLMs)

1.4K

Active

Python

#data-curation#large-language-models#data-preparation

allenai/dolma

A Python library and tools for generating and inspecting data for pre-training large language models (LLMs).

1.4K

Stable

Python

LLM Frameworks

Data Processing

Python

#large-language-models#data-processing#natural-language-processing

GoogleCloudPlatform/data-science-on-gcp

A repository providing data science tools and examples for the Google Cloud Platform.

1.4K

Stable

Jupyter Notebook

React

#data-science#cloud-computing#google-cloud

run-house/kubetorch

Distribute and run AI workloads on Kubernetes with a Python-based infrastructure toolkit like PyTorch.

1.2K

Active

Python

ML Ops

Containerization

PyTorch

#kubernetes#distributed-computing#data-science

Stay in the loop

Get weekly updates on trending AI coding tools and projects.