ETL & Pipelines

Explore 310 open source projects in ETL & Pipelines

Showing 61-80 of 310 projects

scikit-learn-contrib/imbalanced-learn

A Python package to tackle the curse of imbalanced datasets in machine learning

7.1K
Stable
Python
Python
#machine-learning#imbalanced-datasets#python-package

snowplow/snowplow

A powerful customer data pipeline for collecting, processing, and analyzing user events and behavior.

7.0K
Experimental
Scala
ETL & Pipelines
API Frameworks
#data-pipeline#analytics#marketing-analytics

flyteorg/flyte

A flexible workflow orchestration platform that seamlessly integrates data, ML, and analytics stacks.

6.8K
Active
Go
ML Ops
API Frameworks
Go
#workflow-orchestration#data-integration#machine-learning

nteract/papermill

Papermill is a Python library that allows you to parameterize, execute, and analyze Jupyter notebooks.

6.4K
Active
Python
CLI Tools
Documentation
Jupyter
#notebooks#jupyter#python

apache/flink-cdc

Flink CDC is a streaming data integration tool that enables real-time data pipelines and change data capture.

6.4K
Active
Java
ETL & Pipelines
Realtime
#streaming#cdc#change-data-capture

cloudquery/cloudquery

Data pipelines for cloud config and security data, enabling CSPM, FinOps, and vulnerability management solutions.

6.3K
Active
Go
API Frameworks
ETL & Pipelines
Go
#cloud#security#data-engineering

cocoindex-io/cocoindex

Data transformation framework for AI with ultra-fast, incremental processing capabilities.

6.3K
Active
Rust
LLM Frameworks
ETL & Pipelines
Rust
#ai#data-engineering#data-transformation

pachyderm/pachyderm

Pachyderm is a data-centric pipeline and data versioning platform for building and scaling data-intensive applications.

6.3K
Experimental
Go
ETL & Pipelines
Containerization
Go
#data-pipelines#data-versioning#distributed-systems

datajuicer/data-juicer

A Python library for processing and analyzing data with foundation models and large language models.

6.0K
Active
Python
LLM Frameworks
ETL & Pipelines
Python
#data-processing#data-analysis#foundation-models

apache/nifi

Apache NiFi is a powerful data flow management system that enables developers to build complex data pipelines.

6.0K
Active
Java
API Frameworks
ETL & Pipelines
#data-pipeline#etl#streaming

WeiYe-Jing/datax-web

DataX-Web is a visual data integration platform that supports RDBMS, Hive, HBase, ClickHouse, MongoDB and other data sources.

6.0K
Archived
Java
BaaS Platforms
ETL & Pipelines
Java
#data-integration#etl#rdbms

DropsDevopsOrg/ECommerceCrawlers

A collection of Python-based web crawlers for scraping data from various e-commerce and online platforms.

5.4K
Archived
Python
Backend Frameworks
ETL & Pipelines
Scrapy
#web-scraping#data-extraction#e-commerce

Eventual-Inc/Daft

High-performance data engine for AI and multimodal workloads, processing images, audio, video, and structured data at scale.

5.3K
Active
Rust
ML Ops
ETL & Pipelines
Rust
#ai-engineering#data-engineering#distributed

TurboWay/bigdata_analyse

This is a Python project for big data analysis, focusing on HQL, SQL, and data processing.

5.0K
Archived
Python
Databases
ETL & Pipelines
#big-data#data-processing#data-analysis

dlt-hub/dlt

An open-source Python library that simplifies the process of loading data into data lakes and warehouses.

5.0K
Active
Python
ETL & Pipelines
CLI Tools
Python
#data-engineering#data-loading#data-pipelines

Alfred1984/interesting-python

This GitHub repository contains a collection of interesting Python web scraping and data analysis projects.

5.0K
Archived
Jupyter Notebook
Backend Frameworks
ETL & Pipelines
#web-scraping#data-analysis#python

jitsucom/jitsu

Open-source data pipeline engine for real-time ETL, connecting data sources to warehouses like BigQuery, Snowflake, Redshift.

4.7K
Active
TypeScript
ETL & Pipelines
API Frameworks
TypeScript
#data-ingestion#etl#segment-alternative

makcedward/nlpaug

A data augmentation library for natural language processing (NLP) tasks, enabling developers to improve model performance.

4.6K
Archived
Jupyter Notebook
Computer Vision
ML Ops
Python
#natural-language-processing#data-augmentation#computer-vision

deanmalmgren/textract

A Python library that provides a simple and unified interface for extracting text from any document format.

4.5K
Archived
HTML
ETL & Pipelines
CLI Tools
Python
#text-extraction#pdf#docx

rudderlabs/rudder-server

Rudder Server is a privacy-focused, Segment-alternative customer data platform written in Go and React.

4.4K
Active
Go
Customer Data Platform
ETL & Pipelines
React
#customer-data-platform#customer-data-pipeline#data-integration
1...35...16

Stay in the loop

Get weekly updates on trending AI coding tools and projects.