Trending Projects

Discover the fastest growing open source projects

Showing 501-550 of 897 trending projects

#501

Factual/drake

A data workflow tool for data engineers and analysts, similar to 'Make for data'.

+63

+4.4%

1.5K

total stars

Clojure

#502

damklis/DataEngineeringProject

An end-to-end data engineering project example showcasing tools and technologies for building data pipelines.

+63

+4.8%

1.4K

total stars

Python

#503

datalevin/datalevin

A simple, fast and versatile Datalog database written in Clojure for vibe coders.

+63

+4.8%

1.4K

total stars

Clojure

#504

CliMA/Oceananigans.jl

A fast, flexible, ocean-flavored fluid dynamics library for climate and ocean modeling on CPUs and GPUs.

+63

+5.2%

1.3K

total stars

Julia

#505

influxdata/influxdb-java

Java client library for connecting to the InfluxDB time series database.

+63

+5.5%

1.2K

total stars

Java

#506

Teradata/kylo

Kylo is an enterprise-grade data lake management platform built on big data technologies like Spark and Hadoop.

+63

+6.0%

1.1K

total stars

Java

#507

apache/couchdb

An open-source, scalable, and fault-tolerant NoSQL database with a focus on reliability and offline-first design.

+62

+0.9%

6.8K

total stars

Erlang

#508

man-group/ArcticDB

ArcticDB is a high-performance, serverless DataFrame database for the Python data science ecosystem.

+62

+2.9%

2.2K

total stars

C++

#509

attaswift/BTree

A fast, in-memory B-tree implementation for sorted collections in Swift.

+62

+4.9%

1.3K

total stars

Swift

#510

deanmalmgren/textract

A Python library that provides a simple and unified interface for extracting text from any document format.

+61

+1.4%

4.5K

total stars

HTML

#511

rilldata/rill

Rill is a tool for transforming data sets into powerful dashboards using SQL, enabling BI-as-code.

+61

+2.5%

2.5K

total stars

#512

supabase/etl

A real-time Postgres data replication and streaming library built in Rust for building CDC pipelines.

+61

+2.9%

2.2K

total stars

Rust

#513

heibaiying/BigData-Notes

A comprehensive guide to big data technologies like Hadoop, Spark, Kafka, and more for developers.

+60

+0.4%

16.9K

total stars

Java

#514

pentaho/pentaho-kettle

Pentaho Data Integration (ETL) is a Java-based tool for building data integration and ETL pipelines.

+60

+0.7%

8.3K

total stars

Java

#515

paul-buerkner/brms

R package for Bayesian generalized multivariate non-linear multilevel models using Stan

+60

+4.5%

1.4K

total stars

#516

spark-examples/pyspark-examples

A collection of PySpark examples covering RDD, DataFrame, and Dataset operations in Python.

+60

+4.7%

1.3K

total stars

Python

#517

armink/FlashDB

An ultra-lightweight database that supports key-value and time series data for embedded and IoT applications.

+59

+2.5%

2.4K

total stars

#518

Cysharp/MasterMemory

A C# in-memory document database with source generator-based embedded typed readonly data.

+59

+3.4%

1.8K

total stars

#519

apache/flink-cdc

Flink CDC is a streaming data integration tool that enables real-time data pipelines and change data capture.

+58

+0.9%

6.4K

total stars

Java

#520

biopython/biopython

Biopython is a set of Python modules that provide a wide range of functionality for bioinformatics, including DNA/RNA/protein sequence analysis, phylogenetics, and more.

+58

+1.2%

4.9K

total stars

Python

#521

wgzhao/Addax

A fast and versatile ETL tool that can transfer data between RDBMS and NoSQL databases seamlessly

+58

+4.3%

1.4K

total stars

Java

#522

datazip-inc/olake

Fastest open-source data pipeline tool for replicating databases to data lakes in Apache Iceberg format.

+58

+4.7%

1.3K

total stars

#523

prestodb/presto

Presto is an open-source distributed SQL query engine for big data, allowing fast analysis of large datasets.

+57

+0.3%

16.7K

total stars

Java

#524

jvns/pandas-cookbook

Pandas Cookbook is a collection of recipes for using Python's powerful data analysis library, Pandas.

+57

+0.8%

7.0K

total stars

Jupyter Notebook

#525

filodb/FiloDB

A distributed, scalable Prometheus-compatible time series database written in Scala.

+57

+4.1%

1.5K

total stars

Scala

#526

eralchemy/eralchemy

A Python tool that generates Entity Relationship Diagrams (ERDs) from SQLAlchemy models.

+57

+4.2%

1.4K

total stars

Python

#527

mycelial/mycelite

Mycelite is a SQLite extension that enables replication between SQLite instances.

+57

+5.5%

1.1K

total stars

Rust

#528

cbailes/awesome-deep-trading

A curated list of resources for machine learning-based algorithmic trading and quantitative finance.

+56

+3.1%

1.8K

total stars

#529

mkazhdan/PoissonRecon

Poisson Surface Reconstruction is a C++ library for reconstructing surfaces from point cloud data.

+56

+3.2%

1.8K

total stars

C++

#530

bashtage/arch

A comprehensive Python library for modeling and forecasting financial time series data using ARCH models.

+55

+3.8%

1.5K

total stars

Python

#531

scylladb/gocqlx

A comprehensive Go library for working with Cassandra/Scylla databases, providing a query builder, ORM, and migration tool.

+55

+5.7%

1.0K

total stars

#532

neilotoole/sq

sq is a Go-based data wrangling tool that supports a variety of data formats and databases.

+54

+2.3%

2.5K

total stars

#533

data-engineering-community/data-engineering-wiki

A community-driven wiki for learning data engineering, covering topics like data modeling, pipelines, and databases.

+54

+2.9%

1.9K

total stars

CSS

#534

dask/dask-tutorial

An interactive tutorial for the Dask distributed computing library, focused on data analysis and manipulation.

+54

+3.0%

1.9K

total stars

Jupyter Notebook

#535

golang/leveldb

The LevelDB key-value database in the Go programming language.

+54

+4.9%

1.2K

total stars

#536

Azure/AzurePublicDataset

Azure/AzurePublicDataset is a repository containing Microsoft Azure Traces, a Jupyter Notebook-based resource.

+54

+5.2%

1.1K

total stars

Jupyter Notebook

#537

kblin/ncbi-genome-download

Scripts to download genomes from the NCBI FTP servers for bioinformatics and genomics research.

+54

+5.3%

1.1K

total stars

Python

#538

caj2pdf/caj2pdf

A Python tool to convert CAJ (China Academic Journals) files to PDF for developers who work with academic literature.

+52

+1.7%

3.2K

total stars

Python

#539

huggingface/datatrove

A Python library that provides a set of customizable pipeline processing blocks for data processing tasks.

+52

+1.8%

2.9K

total stars

Python

#540

rordenlab/dcm2niix

A DICOM to NIfTI converter for medical imaging research and neuroimaging applications.

+52

+4.8%

1.1K

total stars

C++

#541

wangzhiwubigdata/God-Of-BigData

A comprehensive collection of resources and learning materials for big data technologies like Flink, Spark, Hadoop, and Hive.

+51

+0.5%

10.4K

total stars

#542

sqldelight/sqldelight

SQLDelight - Generates type-safe Kotlin APIs from SQL, enabling easier database management in Kotlin projects.

+51

+0.8%

6.8K

total stars

Kotlin

#543

zarr-developers/zarr-python

An efficient and compressed N-dimensional array library for Python, useful for data scientists and ML engineers.

+51

+2.7%

1.9K

total stars

Python

#544

uber-archive/AthenaX

A scalable, SQL-based streaming analytics platform from Uber, built on top of Apache Flink.

+51

+4.3%

1.2K

total stars

Java

#545

google/cluster-data

This is a dataset of Borg cluster traces from Google, which can be useful for researchers and developers in the field of distributed systems and cloud infrastructure.

+51

+5.2%

1.0K

total stars

TeX

#546

arangodb/arangodb

ArangoDB is a multi-model database supporting documents, graphs, and key-values for high-performance applications.

+50

+0.4%

14.1K

total stars