ETL & Pipelines

Explore 310 open source projects in ETL & Pipelines

Showing 101-120 of 310 projects

sdv-dev/SDV

Generates synthetic tabular data for machine learning and AI applications

3.4K
Active
Python
AI Code Generation
Next.js
#synthetic-data-generation#tabular-data#machine-learning

bruin-data/ingestr

ingestr is a CLI tool that seamlessly copies data between any databases with a single command.

3.4K
Active
Python
API Frameworks
ETL & Pipelines
Python
#data-ingestion#data-integration#data-pipeline

databricks/koalas

Koalas is a pandas-like API for Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.

3.4K
Archived
Python
ORMs & Query Builders
Databases
Spark
#big-data#data-science#dataframe

WeBankFinTech/DataSphereStudio

DataSphereStudio is a one-stop data application development and management portal covering data exchange, analysis, and visualization.

3.3K
Stable
Java
ETL & Pipelines
API Frameworks
Spark
#data-management#data-analysis#data-visualization

lakesoul-io/LakeSoul

LakeSoul is a cloud-native, real-time Lakehouse framework for fast data ingestion and analytics on cloud storage.

3.2K
Active
Java
API Frameworks
Databases
#big-data#lakehouse#streaming

apache/paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark.

3.2K
Active
Java
ETL & Pipelines
Realtime
#big-data#data-ingestion#flink

internetarchive/heritrix3

Heritrix is an open-source, extensible web crawler for archiving websites at scale.

3.2K
Active
Java
Backend Frameworks
ETL & Pipelines
#web-crawling#warc#java

pydata/pandas-datareader

A Python library for extracting data from a wide range of internet sources into a pandas DataFrame.

3.2K
Experimental
Python
Databases
ETL & Pipelines
Python
#data-analysis#data-extraction#pandas

delta-io/delta-rs

A Rust library for interacting with Delta Lake, a data lake storage format, with Python bindings.

3.2K
Active
Rust
ETL & Pipelines
API Frameworks
#delta-lake#etl#data-engineering

spark-notebook/spark-notebook

An interactive and reactive data science platform powered by Scala and Apache Spark.

3.2K
Archived
JavaScript
Databases
ETL & Pipelines
Scala
#data-science#interactive#reactive

blockchain-etl/ethereum-etl

Python scripts for extracting, transforming and loading Ethereum blockchain data into Google BigQuery.

3.1K
Active
Python
ETL & Pipelines
API Frameworks
#blockchain-analytics#erc20#erc721

gunnarmorling/awesome-opensource-data-engineering

An Awesome List of open-source data engineering projects for developers.

3.0K
Archived
ETL & Pipelines
CLI Tools
#data-engineering#data-pipeline#etl

webdataset/webdataset

A high-performance I/O system for large deep learning problems with strong PyTorch support.

3.0K
Experimental
Python
ML Ops
ETL & Pipelines
PyTorch
#data-augmentation#deep-learning#pytorch

PeerDB-io/peerdb

Fast, cost-effective data replication tool from Postgres to data warehouses, queues, and storage

3.0K
Active
Go
ETL & Pipelines
Realtime
#postgres#data-replication#etl

datafold/data-diff

A Python library for comparing data across databases, supporting various database engines.

3.0K
Archived
Python
Databases
ETL & Pipelines
#data-diffing#data-quality#data-engineering

alldatacenter/alldata

An open-source platform for building data-driven applications and AI-powered solutions with a focus on vibe coders.

3.0K
Active
Java
LLM Frameworks
MCP Frameworks
Spring Cloud
#data-platform#ai-tools#etl

x4nth055/pythoncode-tutorials

A collection of Python tutorials covering a wide range of topics from computer vision to network security.

3.0K
Stable
Jupyter Notebook
Tutorials & Courses
ETL & Pipelines
#python#tutorials#machine-learning

apache/incubator-devlake

An open-source dev data platform to ingest, analyze, and visualize data from DevOps tools for engineering insights.

2.9K
Active
Go
ETL & Pipelines
CLI Tools
Go
#devops#data-analysis#data-engineering

TobikoData/sqlmesh

Scalable and efficient data transformation framework with backwards compatibility for dbt.

2.9K
Active
Python
ETL & Pipelines
Databases
Python
#data-engineering#dataops#dbt

huggingface/datatrove

A Python library that provides a set of customizable pipeline processing blocks for data processing tasks.

2.9K
Active
Python
ETL & Pipelines
CLI Tools
Python
#data-processing#pipeline#customizable
1...57...16

Stay in the loop

Get weekly updates on trending AI coding tools and projects.