ETL & Pipelines

Explore 310 open source projects in ETL & Pipelines

Showing 261-280 of 310 projects

ptwobrussell/Mining-the-Social-Web

A library for mining data from social media platforms like Twitter, Facebook, and Reddit.

1.2K
Archived
JavaScript
Backend Frameworks
ETL & Pipelines
Node.js
#social-media#data-mining#web-scraping

2ndQuadrant/pglogical

A high-performance logical replication extension for PostgreSQL that enables fast, cross-version database replication.

1.2K
Stable
C
ETL & Pipelines
API Frameworks
#database-replication#etl#logical-decoding

kkyon/botflow

A Python framework for building data pipelines, web crawlers, and quantitative trading applications.

1.2K
Archived
Python
API Frameworks
ETL & Pipelines
Python
#data-pipeline#web-crawler#quantitative-trading

sryza/spark-timeseries

A library for time series analysis on Apache Spark, enabling efficient large-scale time series processing.

1.2K
Archived
Scala
Databases
ETL & Pipelines
Spark
#time-series#large-scale#data-processing

langchain-ai/langchain-extract

A LangChain-based framework for extracting data from various sources using LLMs and APIs.

1.2K
Stable
Rich Text Format
LLM Frameworks
API Clients & Testing
FastAPI
#extraction#data-extraction#llms

marsupialtail/quokka

A scalable, distributed ETL framework for building data lake analytics pipelines.

1.2K
Archived
Python
ETL & Pipelines
API Frameworks
Python
#data-lake#analytics#distributed

react-csv/react-csv

A React component library for generating CSV files on the fly from data arrays or objects.

1.2K
Archived
JavaScript
Component Libraries (React)
ETL & Pipelines
React
#csv#data-export#data-processing

predict-idlab/plotly-resampler

A Python library that helps visualize large time series data using the Plotly data visualization library.

1.2K
Stable
Python
Charts & Visualization
ETL & Pipelines
Python
#data-visualization#time-series#plotly

lakehq/sail

LakeSail is a Rust-based computation framework that unifies batch processing, stream processing, and AI workloads.

1.2K
Active
Rust
ML Ops
ETL & Pipelines
#distributed-computing#data-engineering#big-data

pytroll/satpy

A Python package for processing earth-observing satellite data with support for common data formats and tools.

1.2K
Active
Python
Databases
ETL & Pipelines
Python
#satellite#weather#climate

apache/incubator-xtable

Apache XTable is a cross-table converter for lakehouse table formats that facilitates interoperability across data processing systems and query engines.

1.2K
Active
Java
ETL & Pipelines
#interoperability#lakehouse#data-processing

ChawlaAvi/Daily-Dose-of-Data-Science

A collection of code snippets and tutorials for data science and data analysis in Python.

1.2K
Experimental
Jupyter Notebook
Databases
ETL & Pipelines
Jupyter
#data-analysis#data-science#jupyter-notebook

zinggAI/zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

1.2K
Active
Java
ETL & Pipelines
ML Ops
#identity-resolution#entity-resolution#data-deduplication

thinh-vu/vnstock

A beginner-friendly Python toolkit for financial data extraction, analysis, and automation.

1.2K
Active
Python
ETL & Pipelines
Backend Frameworks
Python
#data-extraction#quantitative-analysis#stock-market

KEV0143/Parser-Chitai-Gorod

A high-speed, intelligent web scraper for the Chitai-Gorod book catalog, enabling structured data collection.

1.2K
Experimental
Python
Backend & APIs
CLI Tools
Python
#web-scraping#data-extraction#book-catalog

astronomer/astronomer-cosmos

Run your dbt Core or dbt Fusion projects as Apache Airflow DAGs and Task Groups with a few lines of code.

1.2K
Active
Python
API Frameworks
ETL & Pipelines
Python
#airflow#dbt#workflow

lensacom/sparkit-learn

A Python library that integrates Scikit-learn into the Apache Spark distributed computing framework.

1.2K
Archived
Python
ML Ops
ETL & Pipelines
#apache-spark#scikit-learn#distributed-computing

abhishek-ch/around-dataengineering

A comprehensive knowledge hub for data engineering, machine learning, and MLOps tools and practices.

1.1K
Archived
Python
ETL & Pipelines
ML Ops
Python
#data-engineering#machine-learning#mlops

petewarden/dstk

A collection of open data sets and tools for data science and machine learning tasks.

1.1K
Archived
Ruby
Databases
ETL & Pipelines
#data-science#machine-learning#open-data

opensemanticsearch/open-semantic-search

Open-source search and text analytics platform for exploring large document collections with semantic search and NLP

1.1K
Experimental
Shell
Search-as-a-Service
Search
#search#semantic-search#text-analytics
1...131516

Stay in the loop

Get weekly updates on trending AI coding tools and projects.