ETL & Pipelines

Explore 310 open source projects in ETL & Pipelines

Showing 121-140 of 310 projects

ptwobrussell/Mining-the-Social-Web-2nd-Edition

An official compendium for the book 'Mining the Social Web' focused on web scraping and data analysis.

2.9K
Archived
HTML
Backend Frameworks
ETL & Pipelines
#web-scraping#data-analysis#book-companion

apache/gravitino

An open-source data catalog platform for building a high-performance, federated metadata lake.

2.9K
Active
Java
Databases
ETL & Pipelines
Java
#data-catalog#datalake#federated-query

microsoft/table-transformer

Deep learning model for extracting & analyzing table structures from PDFs and images with datasets.

2.9K
Archived
Python
Computer Vision
ETL & Pipelines
PyTorch
#table-extraction#computer-vision#document-processing

ciur/papermerge

Self-hosted document management system with OCR for scanning and archiving papers digitally.

2.9K
Stable
Python
API Frameworks
ETL & Pipelines
Django
#document-management-system#ocr-scanning#paperless

whylabs/whylogs

An open-source data logging library for machine learning models and data pipelines.

2.8K
Archived
Jupyter Notebook
React
#data-pipeline#machine-learning#open-source

susanli2016/NLP-with-Python

A collection of Jupyter Notebooks demonstrating various NLP techniques and libraries like Scikit-Learn, NLTK, Spacy, and Gensim.

2.8K
Archived
Jupyter Notebook
ML Ops
ETL & Pipelines
Jupyter Notebook
#natural-language-processing#machine-learning#data-analysis

mars-project/mars

A unified framework for large-scale data computation that scales popular Python data tools like NumPy, Pandas, and Scikit-Learn.

2.7K
Archived
Python
ML Ops
Caching
Dask
#machine-learning#data-processing#scale

datachain-ai/datachain

Comprehensive analytics, versioning, and ETL toolkit for multimodal data (video, audio, PDFs, images)

2.7K
Active
Python
Computer Vision
ETL & Pipelines
Python
#data-analytics#data-wrangling#embeddings

snakemake/snakemake

Snakemake is a workflow management system for reproducible and scalable data analysis.

2.7K
Active
Python
CLI Tools
ETL & Pipelines
#reproducibility#workflow-management#data-pipelines

jae-jae/QueryList

A progressive PHP crawler framework that allows developers to build elegant web scrapers and crawlers.

2.7K
Experimental
PHP
Backend Frameworks
ETL & Pipelines
#crawler#scraper#spider

oxylabs/how-to-scrape-amazon-product-data

A Python-based web scraper for extracting Amazon product data like titles, ratings, prices, images, and descriptions.

2.6K
Stable
Backend Frameworks
API Clients & Testing
Python
#amazon#web-scraping#data-extraction

justinzm/gopup

A Python data interface for various APIs, including economic and news data.

2.6K
Archived
Python
React
#data interface#APIs#economic data

veb-101/Data-Science-Projects

A collection of data science projects in Python using Jupyter Notebook.

2.6K
Archived
Jupyter Notebook
Databases
ETL & Pipelines
#data-science#python#jupyter-notebook

rilldata/rill

Rill is a tool for transforming data sets into powerful dashboards using SQL, enabling BI-as-code.

2.5K
Active
Go
Databases
ETL & Pipelines
#data-analysis#data-visualization#sql

duckdb/ducklake

DuckLake is an integrated data lake and catalog format written in C++.

2.5K
Active
C++
Databases
ETL & Pipelines
#data-lake#data-catalog#database

The-Japan-DataScientist-Society/100knocks-preprocess

A repository for the 100 Knocks of Data Science Preprocessing, focused on structured data processing.

2.5K
Experimental
HTML
ETL & Pipelines
#data-science#preprocessing#structured-data

EntilZha/PyFunctional

A Python library for creating data processing pipelines using functional programming principles.

2.5K
Experimental
Python
ETL & Pipelines
CLI Tools
Python
#data-pipeline#functional-programming#python-library

taspinar/twitterscraper

A Python library for scraping tweets from Twitter, useful for data analysis and social media monitoring.

2.5K
Archived
Python
Backend & APIs
ETL & Pipelines
#twitter#data-scraping#social-media

neilotoole/sq

sq is a Go-based data wrangling tool that supports a variety of data formats and databases.

2.5K
Active
Go
Databases
ETL & Pipelines
Go
#data-wrangling#csv#json

timhutton/twitter-archive-parser

Python script to parse and export Twitter archive data in various formats.

2.4K
Archived
Python
CLI Tools
ETL & Pipelines
Python
#twitter#data-parsing#data-export
1...68...16

Stay in the loop

Get weekly updates on trending AI coding tools and projects.