Category
Showing 401-450 of 897 trending projects
A Rust-based graph database for developers who need to store and query connected data.
A collection of data science related questions and answers for developers.
Hamilton is an open-source ETL framework that helps data scientists and engineers build modular, testable dataflows with lineage and metadata.
Malloy is an open-source language for describing data relationships and transformations.
A lightweight key-value store built with C++ using a skiplist data structure.
A fast numerical array expression evaluator for Python, NumPy, Pandas, PyTables and more.
Meltano is a declarative, code-first data integration engine for building and scaling data and ML-powered products.
Starter code for working with the YouTube-8M dataset, a large-scale video understanding dataset.
Open-source BI platform for engineers to explore and model large-scale data pipelines.
A Python library that generates fake data for custom test databases.
PyWavelets is a Python library for wavelet transform algorithms and techniques, useful for image and signal processing.
A composable data framework for building ambitious web applications using TypeScript.
Fast in-memory cache library for Go with low GC overhead, optimized for a large number of entries.
A Python library for creating beautiful visualizations of language differences across document types.
A simple Python wrapper for the Tabula Java library, which extracts tables from PDF files into Pandas DataFrames.
Apache Parquet Format, a columnar data storage format used in the Apache Hadoop ecosystem.
A Python library to access historical market data from the Binance cryptocurrency exchange.
A Python library for accessing the HDF5 binary data format, a popular format for scientific and numerical data.
ArcticDB is a high-performance, serverless DataFrame database for the Python data science ecosystem.
A real-time Postgres data replication and streaming library built in Rust for building CDC pipelines.
A data warehouse for COVID-19 time series data, useful for data analysis and visualization.
ggstatsplot is an R library that enhances ggplot2 visualizations with statistical analysis and hypothesis testing.
Scalable, low-latency vector search in Postgres, revolutionizing vector search and databases.
Fast, single-binary C++ SQL ETL pipeline for stream processing, observability, analytics, and AI/ML.
A distributed SQL database built from scratch, not focused on vibe coders or AI tools.
Open source time series library for Python, useful for statistical analysis and modeling.
A unified interface for distributed computing on Spark, Dask and Ray without any rewrites.
Framework for collecting and analyzing prediction market data with comprehensive Polymarket/Kalshi datasets.
A simple Python library for creating dataclasses from dictionaries.
A collection of Python code, notebooks, and examples for practical business data analysis and visualization.
Fast, accurate, and scalable probabilistic data linkage with support for multiple SQL backends.
MMseqs2 is an ultra-fast and sensitive bioinformatics tool for sequence search and clustering.
Apache BookKeeper is a scalable, fault tolerant and low latency storage service optimized for append-only workloads.
GlobalBuildingAtlas is an open global and complete dataset of building polygons, heights and LoD1 3D models.
Apache DataFusion Ballista is a distributed query engine for big data analysis, built with Rust and Arrow.
A dataset of cluster data collected from Alibaba's production clusters for cluster management research.
Converts MySQL database dumps to SQLite3 compatible formats for easier migration and data portability.
A collection of stock analysis tools across various programming languages and platforms.
A JavaScript library that converts CSV and tab-delimited data to web-friendly formats like JSON and XML.
Bytewax is a Python library for building scalable, fault-tolerant, and low-latency data processing pipelines.
A powerful C library for analyzing complex networks and graph-based data structures.
Powerful plotting and data visualization library for the Julia programming language.
An efficient and compressed N-dimensional array library for Python, useful for data scientists and ML engineers.
A space-efficient trie data structure in Go with fast lookup performance.
A versatile ORM for multiple databases including MySQL, SQLite, MariaDB, PostgreSQL, and MongoDB in Deno.
WebAssembly version of the DuckDB analytical database, enabling fast in-browser analytics and SQL queries.
Zui is a powerful desktop app for exploring and working with data, with support for CSV, JSON, and the Zed data format.
Irmin is a distributed database that follows the same design principles as Git, allowing for distributed version control of data.
This repository provides a comprehensive guide on optimizing MySQL performance and solving common database problems.
A high-performance, embeddable key-value storage engine written in Rust for developers building data-intensive applications.
Get weekly updates on trending AI coding tools and projects.