ETL & Pipelines

Explore 310 open source projects in ETL & Pipelines

Showing 281-300 of 310 projects

NVIDIA-Merlin/NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data used in recommender systems.

1.1K
Stable
Python
ML Ops
ETL & Pipelines
Python
#deep-learning#feature-engineering#feature-selection

spring-attic/spring-cloud-dataflow

A microservices-based platform for cloud-native streaming and batch data processing on Cloud Foundry and Kubernetes.

1.1K
Experimental
Java
API Frameworks
ETL & Pipelines
Spring
#batch-processing#cloud-native#datapipelines

lvgalvao/data-engineering-roadmap

Comprehensive roadmap for data engineering and AI development in Python

1.1K
Active
Python
ETL & Pipelines
ML Ops
Python
#data-engineering#machine-learning#python

DataExpert-io/llm-driven-data-engineering

A public repository for exploring LLM-driven data engineering concepts and tools.

1.1K
Archived
Python
LLM Frameworks
ETL & Pipelines
Python
#llm#data-engineering#etl

xillwillx/skiptracer

An open-source Python web scraping framework for OSINT and data collection tasks.

1.1K
Archived
Python
CLI Tools
Backend Frameworks
#web-scraping#osint#data-collection

ucarGroup/DataLink

DataLink is a real-time and offline data exchange platform that supports synchronization between heterogeneous data sources.

1.1K
Archived
Java
ETL & Pipelines
Realtime
Java
#data-exchange#data-replication#realtime

scratchdata/scratchdata

A Swiss army knife for big data, enabling seamless integration with popular data warehousing solutions.

1.1K
Archived
Go
Databases
CLI Tools
#bigquery#clickhouse#data-warehouse

apache/amoro

Apache Amoro is an open-source Lakehouse management system built on big data formats like Flink, Hudi, and Iceberg.

1.1K
Active
Java
Databases
ETL & Pipelines
Flink
#big-data#data-lake#lakehouse

Teradata/kylo

Kylo is an enterprise-grade data lake management platform built on big data technologies like Spark and Hadoop.

1.1K
Archived
Java
ETL & Pipelines
Realtime
#data-lake#hadoop#spark

moby/datakit

Connect processes into powerful data pipelines with a simple git-like filesystem interface

1.1K
Archived
OCaml
ETL & Pipelines
CLI Tools
#data-flow#database#docker

patrickhulce/third-party-web

A comprehensive dataset on third-party entities and their impact on the web, useful for web performance analysis.

1.1K
Stable
JavaScript
Backend & APIs
CLI Tools
JavaScript
#web-performance#http-archive#javascript

oeljeklaus-you/UserActionAnalyzePlatform

A big data platform for analyzing e-commerce user behavior using Hadoop, Spark, and Java.

1.1K
Archived
Java
API Frameworks
Databases
Spark
#big-data#data-analytics#e-commerce

apachecn/pyda-2e-zh

A Chinese translation of the book 'Python for Data Analysis' 2nd Edition, covering NumPy, Pandas, and other data analysis tools.

1.1K
Archived
CSS
Databases
ETL & Pipelines
#data-analysis#numpy#pandas

OpenGeoscience/geonotebook

A Jupyter notebook extension for geospatial visualization and analysis, focused on open geoscience.

1.1K
Archived
Python
Charts & Visualization
ETL & Pipelines
Jupyter
#geospatial#data-visualization#jupyter-notebook

mahmoudparsian/data-algorithms-book

This repository provides a comprehensive guide and implementations for data algorithms using MapReduce, Spark, Java, and Scala.

1.1K
Archived
Java
Databases
ETL & Pipelines
Apache Hadoop
#data-algorithms#mapreduce#spark

jf-tech/omniparser

A native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, and more.

1.1K
Experimental
Go
API Frameworks
ETL & Pipelines
#csv#edi#etl

rhiever/datacleaner

A Python tool that automatically cleans and preprocesses data for analysis and machine learning.

1.1K
Archived
Python
ETL & Pipelines
CLI Tools
Python
#data-cleaning#data-preprocessing#machine-learning

crazyhottommy/RNA-seq-analysis

This GitHub repository contains notes and code for analyzing RNA-seq data using Python and Snakemake.

1.1K
Archived
Python
Databases
ETL & Pipelines
Python
#rna-seq#bioinformatics#data-analysis

metachris/pdfx

A Python library to extract text, metadata, and references from PDF files, including downloading referenced PDFs.

1.1K
Archived
Python
API Clients & Testing
APIs
#pdf#text-extraction#metadata-extraction

Mrkuhuo/data-warehouse-learning

Open-source data warehouse learning project with examples and code for building real-time and offline data pipelines.

1.1K
Stable
Java
ETL & Pipelines
API Frameworks
Flink
#data-engineering#etl#pipelines
1...1416

Stay in the loop

Get weekly updates on trending AI coding tools and projects.