Explore Projects

Discover 85 open source projects

Active filters (1):
Search: big-dataร—
Clear all

Showing 21-40 of 85 projects

apache/beam

Apache Beam is a unified programming model for batch and streaming data processing.

8.5K
Active
Java
ETL & Pipelines
API Frameworks
#batch#streaming#big-data

apache/datafusion

Apache DataFusion is a powerful SQL query engine written in Rust, designed for big data processing and analysis.

8.5K
Active
Rust
Databases
ETL & Pipelines
#big-data#dataframe#olap

h2oai/h2o-3

An open-source, distributed machine learning platform with support for various algorithms and autoML.

7.5K
Active
Jupyter Notebook
ML Ops
Databases
#machine-learning#automl#distributed

arkime/arkime

Arkime is an open-source packet capture and network monitoring system for security and network analysis.

7.3K
Active
Vue
API Frameworks
Databases
Vue
#network-monitoring#packet-capture#security

apache/couchdb

An open-source, scalable, and fault-tolerant NoSQL database with a focus on reliability and offline-first design.

6.8K
Active
Erlang
Databases
API Frameworks
#database#nosql#offline-first

vespa-engine/vespa

Vespa is an AI-powered search and recommendation engine for building data-driven, scalable applications.

6.8K
Active
Java
Search-as-a-Service
Search
Java
#search#recommendation#vector-database

feast-dev/feast

An open-source feature store for AI/ML applications

6.8K
Active
Python
React
#feature-store#open-source#AI/ML

apache/zeppelin

Zeppelin is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents.

6.6K
Active
Java
Databases
API Frameworks
Java
#big-data#database#data-analytics

hazelcast/hazelcast

Hazelcast is a high-performance, distributed in-memory data platform for real-time insights and stream processing.

6.6K
Active
Java
Caching
Realtime
#big-data#distributed#in-memory

pachyderm/pachyderm

Pachyderm is a data-centric pipeline and data versioning platform for building and scaling data-intensive applications.

6.3K
Experimental
Go
ETL & Pipelines
Containerization
Go
#data-pipelines#data-versioning#distributed-systems

apache/hive

Apache Hive is a data warehouse software built on top of Apache Hadoop for querying and managing large datasets.

6.0K
Active
Java
Databases
API Frameworks
#apache#big-data#database

Eventual-Inc/Daft

High-performance data engine for AI and multimodal workloads, processing images, audio, video, and structured data at scale.

5.3K
Active
Rust
ML Ops
ETL & Pipelines
Rust
#ai-engineering#data-engineering#distributed

microsoft/SynapseML

SynapseML is a simple and distributed machine learning library for building and deploying AI models at scale.

5.2K
Active
Scala
ML Ops
Big Data
Apache Spark
#machine-learning#distributed-computing#big-data

tschellenbach/Stream-Framework

A Python library for building scalable news feeds, activity streams, and notification systems using Cassandra and Redis.

4.7K
Stable
Python
API Frameworks
Databases
#activity-stream#news-feed#big-data

rom1504/img2dataset

Easily convert large sets of image URLs into a dataset for AI/ML training and experimentation.

4.4K
Stable
Python
Computer Vision
Databases
Python
#dataset#image-processing#big-data

crate/crate

CrateDB is a distributed, scalable SQL database for storing and analyzing massive amounts of data in near real-time.

4.4K
Active
Java
Databases
API Frameworks
#database#distributed#scalable

alibaba/fastjson2

A high-performance Java JSON library for fast serialization and deserialization.

4.3K
Active
Java
API Frameworks
ORMs & Query Builders
#json#serialization#deserialization

databricks/koalas

Koalas is a pandas-like API for Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.

3.4K
Archived
Python
ORMs & Query Builders
Databases
Spark
#big-data#data-science#dataframe

lakesoul-io/LakeSoul

LakeSoul is a cloud-native, real-time Lakehouse framework for fast data ingestion and analytics on cloud storage.

3.2K
Active
Java
API Frameworks
Databases
#big-data#lakehouse#streaming

apache/paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark.

3.2K
Active
Java
ETL & Pipelines
Realtime
#big-data#data-ingestion#flink

Stay in the loop

Get weekly updates on trending AI coding tools and projects.