Category
Showing 501-550 of 897 trending projects
A data workflow tool for data engineers and analysts, similar to 'Make for data'.
An end-to-end data engineering project example showcasing tools and technologies for building data pipelines.
A simple, fast and versatile Datalog database written in Clojure for vibe coders.
A fast, flexible, ocean-flavored fluid dynamics library for climate and ocean modeling on CPUs and GPUs.
Java client library for connecting to the InfluxDB time series database.
Kylo is an enterprise-grade data lake management platform built on big data technologies like Spark and Hadoop.
An open-source, scalable, and fault-tolerant NoSQL database with a focus on reliability and offline-first design.
ArcticDB is a high-performance, serverless DataFrame database for the Python data science ecosystem.
A fast, in-memory B-tree implementation for sorted collections in Swift.
A Python library that provides a simple and unified interface for extracting text from any document format.
Rill is a tool for transforming data sets into powerful dashboards using SQL, enabling BI-as-code.
A real-time Postgres data replication and streaming library built in Rust for building CDC pipelines.
A comprehensive guide to big data technologies like Hadoop, Spark, Kafka, and more for developers.
Pentaho Data Integration (ETL) is a Java-based tool for building data integration and ETL pipelines.
R package for Bayesian generalized multivariate non-linear multilevel models using Stan
A collection of PySpark examples covering RDD, DataFrame, and Dataset operations in Python.
An ultra-lightweight database that supports key-value and time series data for embedded and IoT applications.
A C# in-memory document database with source generator-based embedded typed readonly data.
Flink CDC is a streaming data integration tool that enables real-time data pipelines and change data capture.
Biopython is a set of Python modules that provide a wide range of functionality for bioinformatics, including DNA/RNA/protein sequence analysis, phylogenetics, and more.
A fast and versatile ETL tool that can transfer data between RDBMS and NoSQL databases seamlessly
Fastest open-source data pipeline tool for replicating databases to data lakes in Apache Iceberg format.
Presto is an open-source distributed SQL query engine for big data, allowing fast analysis of large datasets.
Pandas Cookbook is a collection of recipes for using Python's powerful data analysis library, Pandas.
A distributed, scalable Prometheus-compatible time series database written in Scala.
A Python tool that generates Entity Relationship Diagrams (ERDs) from SQLAlchemy models.
Mycelite is a SQLite extension that enables replication between SQLite instances.
A curated list of resources for machine learning-based algorithmic trading and quantitative finance.
Poisson Surface Reconstruction is a C++ library for reconstructing surfaces from point cloud data.
A comprehensive Python library for modeling and forecasting financial time series data using ARCH models.
A comprehensive Go library for working with Cassandra/Scylla databases, providing a query builder, ORM, and migration tool.
sq is a Go-based data wrangling tool that supports a variety of data formats and databases.
A community-driven wiki for learning data engineering, covering topics like data modeling, pipelines, and databases.
An interactive tutorial for the Dask distributed computing library, focused on data analysis and manipulation.
The LevelDB key-value database in the Go programming language.
Azure/AzurePublicDataset is a repository containing Microsoft Azure Traces, a Jupyter Notebook-based resource.
Scripts to download genomes from the NCBI FTP servers for bioinformatics and genomics research.
A Python tool to convert CAJ (China Academic Journals) files to PDF for developers who work with academic literature.
A Python library that provides a set of customizable pipeline processing blocks for data processing tasks.
A DICOM to NIfTI converter for medical imaging research and neuroimaging applications.
A comprehensive collection of resources and learning materials for big data technologies like Flink, Spark, Hadoop, and Hive.
SQLDelight - Generates type-safe Kotlin APIs from SQL, enabling easier database management in Kotlin projects.
An efficient and compressed N-dimensional array library for Python, useful for data scientists and ML engineers.
A scalable, SQL-based streaming analytics platform from Uber, built on top of Apache Flink.
This is a dataset of Borg cluster traces from Google, which can be useful for researchers and developers in the field of distributed systems and cloud infrastructure.
ArangoDB is a multi-model database supporting documents, graphs, and key-values for high-performance applications.
A book on data science, covering topics from basic math to machine learning using Python and Jupyter Notebooks.
Apache Druid is a high-performance real-time analytics database for vibe coders working with data-intensive applications.
A curated list of awesome database tools and resources to make working with databases easier.
The Auron accelerator framework leverages vectorized execution to speed up distributed computing on big data platforms like Spark.
Get weekly updates on trending AI coding tools and projects.