Category
Showing 401-450 of 897 trending projects
An educational distributed SQL database written in Rust, not focused on AI coding tools.
A Python library that provides a simple and unified interface for extracting text from any document format.
The Auron accelerator framework leverages vectorized execution to speed up distributed computing on big data platforms like Spark.
Build vector tilesets from large collections of GeoJSON features.
Hazelcast is a high-performance, distributed in-memory data platform for real-time insights and stream processing.
A library for time series analysis on Apache Spark, enabling efficient large-scale time series processing.
TrailDB is an efficient database for storing and querying series of events.
An ultra-lightweight database that supports key-value and time series data for embedded and IoT applications.
MMseqs2 is an ultra-fast and sensitive bioinformatics tool for sequence search and clustering.
Notebooks for financial economics, including analyses of Federal Reserve, GDP, inflation, and more.
A dataset of cluster data collected from Alibaba's production clusters for cluster management research.
A data workflow tool for data engineers and analysts, similar to 'Make for data'.
A Python library for financial data visualization using Matplotlib, focused on candlestick and OHLC charts.
This is a code repository for a book on practical statistics for data scientists, not a developer discovery platform.
OpenMapTiles is an open-source vector tile schema implementation for creating custom map tiles.
A Python library for creating easy-to-use, visually appealing data tables and summaries.
A comprehensive guide to feature engineering and feature selection techniques in Python, with examples.
AWS Glue code samples for building data integration and ETL pipelines on AWS.
DiceDB is an open-source, fast, reactive, in-memory database optimized for modern hardware.
A comprehensive collection of resources and learning materials for big data technologies like Flink, Spark, Hadoop, and Hive.
A Rust library for serializing and deserializing data in the Rusty Object Notation (RON) format.
A curated list of software packages and data resources for single-cell analysis, including RNA-seq and ATAC-seq.
A C# library for reading and writing metadata in media files, useful for audio and video processing applications.
Apache Druid is a high-performance real-time analytics database for vibe coders working with data-intensive applications.
Pandas Cookbook is a collection of recipes for using Python's powerful data analysis library, Pandas.
Flink CDC is a streaming data integration tool that enables real-time data pipelines and change data capture.
A book on data science, covering topics from basic math to machine learning using Python and Jupyter Notebooks.
Malloy is an open-source language for describing data relationships and transformations.
A Python package for interactive geospatial analysis and visualization with Google Earth Engine.
Mycelite is a SQLite extension that enables replication between SQLite instances.
A comprehensive Go library for working with Cassandra/Scylla databases, providing a query builder, ORM, and migration tool.
A Python library for extracting tabular data from PDF files, useful for data processing and analysis.
A collection of data science take-home challenges and solutions implemented in Jupyter Notebooks.
A grammar of graphics library for creating highly customizable and publication-quality plots in Python.
A distributed, Redis-compatible NoSQL database that provides high performance and scalability.
Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.
Pongo is a MongoDB-compatible database that runs on top of PostgreSQL, offering strong consistency benefits.
Useful scripts, UDFs, views, and other utilities for migration and data warehouse operations in BigQuery.
This repo contains a list of the 10,000 most common English words, useful for NLP and language modeling tasks.
An efficient and compressed N-dimensional array library for Python, useful for data scientists and ML engineers.
HyperLogLog data structure library with space-efficient sparse and LogLog-Beta implementations.
A type-safe, Swift-language layer over SQLite3 for building database-backed Swift applications.
OrbitDB is a peer-to-peer database for the decentralized web, enabling developers to build offline-first, distributed applications.
Comprehensive collection of city and administrative region data for China, with features like CSV export, JS code generation, and web scraping.
A Python library for common data analysis and machine learning tasks
sq is a Go-based data wrangling tool that supports a variety of data formats and databases.
A community-driven wiki for learning data engineering, covering topics like data modeling, pipelines, and databases.
A fast, flexible, ocean-flavored fluid dynamics library for climate and ocean modeling on CPUs and GPUs.
A curated list of awesome R packages, frameworks and software for data analysis and data science.
Apache HBase is a distributed, scalable, fault-tolerant database for large datasets built on top of HDFS.
Get weekly updates on trending AI coding tools and projects.