ETL & Pipelines

Explore 310 open source projects in ETL & Pipelines

Showing 181-200 of 310 projects

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

This GitHub repository contains SQL data analysis and visualization projects using various tools and databases.

1.7K
Archived
Jupyter Notebook
Databases
ETL & Pipelines
#sql#data-analysis#data-visualization

MarkPDFdown/markpdfdown

A high-quality PDF to Markdown conversion tool powered by large language model visual recognition.

1.7K
Active
Python
LLM Wrappers & SDKs
ETL & Pipelines
Python
#pdf-converter#markdown-generation#llm-integration

jadianes/spark-py-notebooks

Apache Spark and Python tutorials for big data analysis and machine learning as Jupyter notebooks.

1.7K
Archived
Jupyter Notebook
Databases
ETL & Pipelines
Jupyter Notebook
#big-data#data-analysis#data-science

ClimbsRocks/auto_ml

Automated machine learning for analytics & production use cases powered by popular ML libraries.

1.7K
Archived
Python
ML Ops
ETL & Pipelines
scikit-learn
#automated-machine-learning#data-science#production-ready

jasonwei20/eda_nlp

Data augmentation for NLP using CNN and RNN, presented at EMNLP 2019

1.6K
Archived
Python
Python
#data-augmentation#nlp#text-classification

google/UIforETW

UIforETW is a C++ library for recording and managing ETW traces, providing a user interface for developers.

1.6K
Experimental
C++
CLI Tools
Realtime
#etw#tracing#diagnostics

Hiflylabs/awesome-dbt

A curated list of awesome resources for the data transformation tool dbt, focused on analytics engineering.

1.6K
Active
ETL & Pipelines
#analytics-engineering#data-engineering#dbt

Multiwoven/multiwoven

Open-source reverse ETL tool for data activation and customer data platform integration.

1.6K
Active
Ruby
API Frameworks
ETL & Pipelines
React
#data-activation#customer-data-platform#reverse-etl

tansu-io/tansu

Apache Kafka-compatible broker with support for S3, PostgreSQL, SQLite, Apache Iceberg, and Delta Lake.

1.6K
Active
Rust
API Frameworks
Databases
#apache-kafka#s3#postgresql

saermart/DouyinLiveWebFetcher

A Python library for scraping real-time data from Douyin (TikTok) live streams, including comments and metadata.

1.6K
Stable
Python
API Frameworks
Backend Frameworks
#web-scraping#live-streaming#comments

stripe-archive/mosql

A Ruby library that enables streaming replication from MongoDB to PostgreSQL databases.

1.6K
Archived
Ruby
API Frameworks
Databases
#mongodb#postgresql#streaming

cgarciae/pypeln

Concurrent data pipelines in Python for building efficient and scalable data processing workflows.

1.6K
Archived
Python
ETL & Pipelines
CLI Tools
Python
#data-processing#concurrent-pipelines#etl

probberechts/soccerdata

A Python library for scraping soccer data from various sources for sports analytics and data science.

1.6K
Active
Python
ETL & Pipelines
CLI Tools
#soccer-analytics#data-scraping#sports-data

srx-2000/spider_collection

A collection of Python web scraping scripts for various websites and platforms, including music, video, and real estate data.

1.6K
Archived
Python
Backend Frameworks
ETL & Pipelines
#web-scraping#data-extraction#python-scripts

getdozer/dozer

Dozer is a real-time data movement tool that leverages CDC to move data between various sources and sinks.

1.6K
Archived
Rust
ETL & Pipelines
Realtime
Rust
#realtime#data-movement#etl

re-data/re-data

A data quality and observability tool for monitoring and fixing data issues before they become problems.

1.6K
Archived
HTML
ETL & Pipelines
CLI Tools
dbt
#data-quality#data-observability#data-monitoring

lotus-data/lotus

A Python library that uses LLMs and embeddings to process datasets with up to 1000x speedups

1.6K
Active
Python
LLM Frameworks
ETL & Pipelines
Python
#ai-data-processing#llm#semantic-search

ArchiveTeam/grab-site

A web crawler tool that outputs WARC files and provides a dashboard for managing crawls.

1.6K
Experimental
Python
CLI Tools
Backend Frameworks
Python
#archiving#crawling#web-scraping

capitalone/DataProfiler

A Python library for extracting schema, statistics, and entities from datasets, useful for data profiling and privacy analysis.

1.5K
Stable
Python
ETL & Pipelines
CLI Tools
Python
#data-profiling#data-analysis#privacy

hi-primus/optimus

Agile data preparation workflows made easy with popular Python data science libraries.

1.5K
Archived
Python
ETL & Pipelines
API Frameworks
#big-data-cleaning#data-analysis#data-cleaning
1...911...16

Stay in the loop

Get weekly updates on trending AI coding tools and projects.