Showing 281-300 of 310 projects
NVTabular is a feature engineering and preprocessing library for tabular data used in recommender systems.
A microservices-based platform for cloud-native streaming and batch data processing on Cloud Foundry and Kubernetes.
Comprehensive roadmap for data engineering and AI development in Python
A public repository for exploring LLM-driven data engineering concepts and tools.
An open-source Python web scraping framework for OSINT and data collection tasks.
DataLink is a real-time and offline data exchange platform that supports synchronization between heterogeneous data sources.
A Swiss army knife for big data, enabling seamless integration with popular data warehousing solutions.
Apache Amoro is an open-source Lakehouse management system built on big data formats like Flink, Hudi, and Iceberg.
Kylo is an enterprise-grade data lake management platform built on big data technologies like Spark and Hadoop.
Connect processes into powerful data pipelines with a simple git-like filesystem interface
A comprehensive dataset on third-party entities and their impact on the web, useful for web performance analysis.
A big data platform for analyzing e-commerce user behavior using Hadoop, Spark, and Java.
A Chinese translation of the book 'Python for Data Analysis' 2nd Edition, covering NumPy, Pandas, and other data analysis tools.
A Jupyter notebook extension for geospatial visualization and analysis, focused on open geoscience.
This repository provides a comprehensive guide and implementations for data algorithms using MapReduce, Spark, Java, and Scala.
A native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, and more.
A Python tool that automatically cleans and preprocesses data for analysis and machine learning.
This GitHub repository contains notes and code for analyzing RNA-seq data using Python and Snakemake.
A Python library to extract text, metadata, and references from PDF files, including downloading referenced PDFs.
Open-source data warehouse learning project with examples and code for building real-time and offline data pipelines.
Get weekly updates on trending AI coding tools and projects.