Explore Projects

Discover 39 open source projects

Active filters (1):
Search: data-pipelinesร—
Clear all

Showing 1-20 of 39 projects

pathwaycom/pathway

Python ETL framework for real-time analytics and LLM pipelines

59.5K
Active
Python
LLM Frameworks
ETL & Pipelines
Python
#etl#real-time#llm

apache/airflow

Apache Airflow for workflow orchestration

44.5K
Active
Python
ETL & Pipelines
Background Jobs
Python
#airflow#data-pipelines#workflow-orchestration

airbytehq/airbyte

Data integration platform for ELT pipelines from APIs, databases & files to databases, warehouses & lakes

20.8K
Active
Python
ETL & Pipelines
#data-integration#elt#etl

apache/shardingsphere

Distributed SQL database middleware for sharding, scalability, and security

20.7K
Active
Java
Databases
Java
#distributed-sql#database-sharding#data-encryption

dagster-io/dagster

An open-source data orchestration platform for developing, running, and observing data pipelines and workflows.

15.1K
Active
Python
ETL & Pipelines
Python
#data-engineering#data-orchestration#workflow-automation

apache/dolphinscheduler

Apache DolphinScheduler is a modern data orchestration platform for creating high-performance workflows with low-code.

14.2K
Active
Java
Realtime
#workflow-orchestration#job-scheduler#data-pipelines

Unstructured-IO/unstructured

Unstructured is an open-source ETL solution for transforming complex documents into structured data for language models.

14.1K
Active
HTML
Document Processing
#document-processing#data-pipelines#natural-language-processing

debezium/debezium

An open-source framework for change data capture from various databases using Apache Kafka.

12.5K
Active
Java
ETL & Pipelines
Apache Kafka
#change-data-capture#event-streaming#database

mage-ai/mage-ai

mage-ai is a Python-based platform for building, running, and managing data pipelines and integrating/transforming data.

8.7K
Active
Python
ETL & Pipelines
ML Ops
Python
#data-pipelines#data-transformation#data-integration

snowplow/snowplow

A powerful customer data pipeline for collecting, processing, and analyzing user events and behavior.

7.0K
Experimental
Scala
ETL & Pipelines
API Frameworks
#data-pipeline#analytics#marketing-analytics

apache/flink-cdc

Flink CDC is a streaming data integration tool that enables real-time data pipelines and change data capture.

6.4K
Active
Java
ETL & Pipelines
Realtime
#streaming#cdc#change-data-capture

datajuicer/data-juicer

A Python library for processing and analyzing data with foundation models and large language models.

6.0K
Active
Python
LLM Frameworks
ETL & Pipelines
Python
#data-processing#data-analysis#foundation-models

fluvio-community/fluvio

Fluvio is an event stream processing engine for developers to build responsive data-intensive apps.

5.2K
Active
Rust
Data Pipelines
Realtime
Rust
#streaming#real-time#data-processing

rudderlabs/rudder-server

Rudder Server is a privacy-focused, Segment-alternative customer data platform written in Go and React.

4.4K
Active
Go
Customer Data Platform
ETL & Pipelines
React
#customer-data-platform#customer-data-pipeline#data-integration

StructuredLabs/preswald

Preswald is a WASM packager for Python-based interactive data apps that can be run completely in-browser.

4.3K
Experimental
Python
LLM Frameworks
ETL & Pipelines
Python
#data-applications#data-visualization#data-pipelines

adilkhash/Data-Engineering-HowTo

A list of resources to learn Data Engineering from scratch

4.0K
Archived
React
#data-engineering#data-pipeline#distributed-systems

Netflix/maestro

Maestro is Netflix's workflow orchestrator for building data pipelines and batch processing workflows.

3.7K
Active
Java
ETL & Pipelines
Background Jobs
Java
#data-engineering#batch-processing#workflow-orchestration

ucbepic/docetl

A system for agentic LLM-powered data processing and ETL workflows for unstructured data analysis.

3.7K
Active
Python
Agents & Orchestration
ETL & Pipelines
Python
#agents#data-pipelines#document-processing

superstreamlabs/memphis

Memphis.dev is a highly scalable and effortless data streaming platform

3.4K
Archived
Go
Go
#streaming#data-engineering#golang

bruin-data/ingestr

ingestr is a CLI tool that seamlessly copies data between any databases with a single command.

3.4K
Active
Python
API Frameworks
ETL & Pipelines
Python
#data-ingestion#data-integration#data-pipeline
2

Stay in the loop

Get weekly updates on trending AI coding tools and projects.