ETL & Pipelines

Explore 310 open source projects in ETL & Pipelines

Showing 41-60 of 310 projects

kedro-org/kedro

Kedro is a Python toolkit for building production-ready data science and machine learning pipelines.

10.8K
Active
Python
ETL & Pipelines
Python
#machine-learning#data-engineering#pipeline

PRQL/prql

PRQL is a modern, powerful, and pipelined SQL replacement for transforming data.

10.7K
Active
Rust
ETL & Pipelines
#data-transformation#sql-alternative#pipeline

drivendataorg/cookiecutter-data-science

A flexible and standardized cookiecutter template for doing and sharing data science work in Python.

9.7K
Active
Python
ETL & Pipelines
Python
#data-science#machine-learning#cookiecutter

apache/seatunnel

A high-performance, distributed data integration tool for batch, streaming, and CDC use cases.

9.1K
Active
Java
ETL & Pipelines
Realtime
#data-integration#batch#streaming

blue-yonder/tsfresh

Automatic feature extraction from time series data for data science and machine learning applications.

9.1K
Stable
Jupyter Notebook
Feature Extraction
ETL & Pipelines
Python
#time-series#feature-engineering#data-science

saulpw/visidata

A terminal spreadsheet multitool for discovering and arranging data

8.9K
Active
Python
CLI Tools
Databases
Python
#cli#csv#datawrangling

risingwavelabs/risingwave

An open-source, Rust-based event streaming platform for real-time data processing and analytics.

8.8K
Active
Rust
API Frameworks
Databases
Rust
#event-streaming#real-time#data-processing

mage-ai/mage-ai

mage-ai is a Python-based platform for building, running, and managing data pipelines and integrating/transforming data.

8.7K
Active
Python
ETL & Pipelines
ML Ops
Python
#data-pipelines#data-transformation#data-integration

delta-io/delta

An open-source data lakehouse framework that enables building data pipelines with leading big data compute engines.

8.6K
Active
Scala
ETL & Pipelines
API Frameworks
Spark
#big-data#data-engineering#data-lakehouse

redpanda-data/connect

A highly configurable, production-ready stream processing platform for building real-time data pipelines.

8.6K
Active
Go
Realtime
ETL & Pipelines
Go
#stream-processing#message-queue#data-engineering

apache/beam

Apache Beam is a unified programming model for batch and streaming data processing.

8.5K
Active
Java
ETL & Pipelines
API Frameworks
#batch#streaming#big-data

vaexio/vaex

A high-performance Python library for working with large tabular datasets, offering efficient data manipulation and visualization.

8.5K
Stable
Python
Databases
Caching
Python
#bigdata#data-science#dataframe

apache/datafusion

Apache DataFusion is a powerful SQL query engine written in Rust, designed for big data processing and analysis.

8.5K
Active
Rust
Databases
ETL & Pipelines
#big-data#dataframe#olap

pentaho/pentaho-kettle

Pentaho Data Integration (ETL) is a Java-based tool for building data integration and ETL pipelines.

8.3K
Active
Java
ETL & Pipelines
#etl#data-integration#pentaho

kangvcar/InfoSpider

INFO-SPIDER is an open-source web scraping toolkit that helps users retrieve data from various sources like email, e-commerce, and social platforms.

8.2K
Active
Python
Backend Frameworks
ETL & Pipelines
Python
#web-scraping#data-extraction#open-source

lorien/awesome-web-scraping

A comprehensive list of libraries, tools, and APIs for web scraping and data processing.

7.8K
Active
Makefile
Backend Frameworks
ETL & Pipelines
#web-scraping#crawling#data-processing

turbot/steampipe

Steampipe is a zero-ETL, SQL-powered platform for live querying cloud APIs and infrastructure.

7.7K
Active
Go
API Frameworks
ETL & Pipelines
#cloud#etl#sql

fluent/fluent-bit

Fluent Bit is a fast and lightweight log, metrics, and traces processor for Linux, BSD, OSX, and Windows.

7.7K
Active
C
API Frameworks
ETL & Pipelines
#logging#metrics#traces

alteryx/featuretools

An open-source Python library for automated feature engineering in machine learning.

7.6K
Active
Python
Automated Machine Learning
ETL & Pipelines
Python
#automated-feature-engineering#machine-learning#data-science

tabulapdf/tabula

Tabula is a tool for extracting data from PDF files, allowing developers to easily parse and extract tables.

7.3K
Experimental
CSS
API Frameworks
ETL & Pipelines
#pdf#scraping#data-extraction
124...16

Stay in the loop

Get weekly updates on trending AI coding tools and projects.