ETL & Pipelines

Explore 310 open source projects in ETL & Pipelines

Showing 201-220 of 310 projects

paradigmxyz/cryo

cryo is a Rust library for extracting blockchain data to parquet, CSV, JSON, or Python dataframes.

1.5K
Archived
Rust
ETL & Pipelines
API Frameworks
#blockchain#ethereum#parquet

aws-samples/aws-glue-samples

AWS Glue code samples for building data integration and ETL pipelines on AWS.

1.5K
Stable
Python
ETL & Pipelines
#aws#glue#etl

combust/mleap

MLeap is a library for deploying machine learning pipelines to production using Scala, Python, and Spark.

1.5K
Active
Scala
ML Ops
API Frameworks
Scala
#machine-learning#pipeline#production

OBenner/data-engineering-interview-questions

This GitHub repository contains over 2,000 data engineering interview questions to help developers prepare.

1.5K
Active
Python
Interview Prep
ETL & Pipelines
#data-engineering#interview-questions#interview-prep

meta-llama/synthetic-data-kit

Tool for generating high-quality synthetic datasets

1.5K
Stable
Python
React
#synthetic-data-kit#data-generation#llm

emcf/thepipe

A Python library that helps developers extract structured data from tricky documents using vision-language models.

1.5K
Stable
Python
LLM Frameworks
ETL & Pipelines
Python
#document-processing#large-language-models#multimodal

fossasia/event-collect

A Python-based scraper and converter for event websites to the Open Event format.

1.5K
Archived
Python
API Frameworks
ETL & Pipelines
#event-scraping#open-event-format#data-conversion

pvlib/pvlib-python

A Python library for simulating the performance of photovoltaic energy systems.

1.5K
Active
Python
API Frameworks
ETL & Pipelines
#photovoltaic#renewable-energy#solar-energy

superlinked/superlinked

Superlinked is a Python framework for building high-performance search & recommendation apps with structured and unstructured data.

1.5K
Stable
Jupyter Notebook
LLM Frameworks
RAG & Vector
Python
#data-pipeline#embeddings#information-retrieval

san089/goodreads_etl_pipeline

An end-to-end data pipeline for building a data lake, data warehouse, and analytics platform from GoodReads data.

1.5K
Archived
Python
ETL & Pipelines
Background Jobs
Apache Airflow
#data-engineering#etl-pipeline#data-lake

pyjanitor-devs/pyjanitor

A Python library for cleaning and transforming data, inspired by the R package Janitor.

1.5K
Active
Python
ETL & Pipelines
CLI Tools
#cleaning-data#data-transformation#pandas-extension

Factual/drake

A data workflow tool for data engineers and analysts, similar to 'Make for data'.

1.5K
Archived
Clojure
ETL & Pipelines
#data-pipelines#etl#workflow

DataBrewery/cubes

A lightweight Python OLAP framework for multi-dimensional data analysis and reporting.

1.5K
Archived
Python
ORMs & Query Builders
Databases
#olap#data-analysis#multidimensional-data

compose/transporter

Transporter is a powerful ETL tool that allows developers to sync data between various persistence engines.

1.4K
Archived
Go
ETL & Pipelines
API Frameworks
Go
#etl#data-sync#persistence-engine

bruin-data/bruin

A data platform that enables building data pipelines with SQL, Python, and ingesting from various sources.

1.4K
Active
Go
ETL & Pipelines
API Frameworks
Go
#data-pipelines#data-ingestion#data-transformation

NVIDIA-NeMo/Curator

Scalable data pre processing and curation toolkit for Large Language Models (LLMs)

1.4K
Active
Python
Python
#data-curation#large-language-models#data-preparation

tidyverse/tidyr

tidyr is an R package that provides a set of functions to tidy messy data into a format suitable for analysis.

1.4K
Active
R
ETL & Pipelines
CLI Tools
#data-transformation#data-cleaning#tidy-data

AlexTheAnalyst/PortfolioProjects

This repository contains a collection of portfolio projects for a data analyst, not a developer discovery platform.

1.4K
Archived
Jupyter Notebook
Databases
ETL & Pipelines
#data-analysis#portfolio#tutorials

astronomer/dag-factory

Declaratively construct Apache Airflow DAGs with YAML configuration files, simplifying complex data pipeline management.

1.4K
Active
Python
API Frameworks
ETL & Pipelines
Python
#airflow#data-pipelines#etl

4lex4/scantailor-advanced

ScanTailor Advanced is a C++ library for processing scanned documents, including binarization, book scanning, and digitalization.

1.4K
Archived
C++
Backend Frameworks
ETL & Pipelines
#binarization#book-scanning#digitalization
1...1012...16

Stay in the loop

Get weekly updates on trending AI coding tools and projects.