Explore Projects

Discover 85 open source projects

Active filters (1):

Search: big-data×

Clear all

Showing 21-40 of 85 projects

apache/beam

Apache Beam is a unified programming model for batch and streaming data processing.

8.5K

Active

Java

ETL & Pipelines

API Frameworks

#batch#streaming#big-data

apache/datafusion

Apache DataFusion is a powerful SQL query engine written in Rust, designed for big data processing and analysis.

8.5K

Active

Rust

Databases

ETL & Pipelines

#big-data#dataframe#olap

h2oai/h2o-3

An open-source, distributed machine learning platform with support for various algorithms and autoML.

7.5K

Active

Jupyter Notebook

ML Ops

Databases

#machine-learning#automl#distributed

arkime/arkime

Arkime is an open-source packet capture and network monitoring system for security and network analysis.

7.3K

Active

Vue

API Frameworks

Databases

Vue

#network-monitoring#packet-capture#security

apache/couchdb

An open-source, scalable, and fault-tolerant NoSQL database with a focus on reliability and offline-first design.

6.8K

Active

Erlang

Databases

API Frameworks

#database#nosql#offline-first

vespa-engine/vespa

Vespa is an AI-powered search and recommendation engine for building data-driven, scalable applications.

6.8K

Active

Java

Search-as-a-Service

Java

#search#recommendation#vector-database

feast-dev/feast

An open-source feature store for AI/ML applications

6.8K

Active

Python

React

#feature-store#open-source#AI/ML

apache/zeppelin

Zeppelin is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents.

6.6K

Active

Java

Databases

API Frameworks

Java

#big-data#database#data-analytics

hazelcast/hazelcast

Hazelcast is a high-performance, distributed in-memory data platform for real-time insights and stream processing.

6.6K

Active

Java

Caching

Realtime

#big-data#distributed#in-memory

pachyderm/pachyderm

Pachyderm is a data-centric pipeline and data versioning platform for building and scaling data-intensive applications.

6.3K

Experimental

ETL & Pipelines

Containerization

#data-pipelines#data-versioning#distributed-systems

apache/hive

Apache Hive is a data warehouse software built on top of Apache Hadoop for querying and managing large datasets.

6.0K

Active

Java

Databases

API Frameworks

#apache#big-data#database

Eventual-Inc/Daft

High-performance data engine for AI and multimodal workloads, processing images, audio, video, and structured data at scale.

5.3K

Active

Rust

ML Ops

ETL & Pipelines

Rust

#ai-engineering#data-engineering#distributed

microsoft/SynapseML

SynapseML is a simple and distributed machine learning library for building and deploying AI models at scale.

5.2K

Active

Scala

ML Ops

Big Data

Apache Spark

#machine-learning#distributed-computing#big-data

tschellenbach/Stream-Framework

A Python library for building scalable news feeds, activity streams, and notification systems using Cassandra and Redis.

4.7K

Stable

Python

API Frameworks

Databases

#activity-stream#news-feed#big-data

rom1504/img2dataset

Easily convert large sets of image URLs into a dataset for AI/ML training and experimentation.

4.4K

Stable

Python

Computer Vision

Databases

Python

#dataset#image-processing#big-data

crate/crate

CrateDB is a distributed, scalable SQL database for storing and analyzing massive amounts of data in near real-time.

4.4K

Active

Java

Databases

API Frameworks

#database#distributed#scalable

alibaba/fastjson2

A high-performance Java JSON library for fast serialization and deserialization.

4.3K

Active

Java

API Frameworks

ORMs & Query Builders

#json#serialization#deserialization

databricks/koalas

Koalas is a pandas-like API for Apache Spark, enabling data scientists to work with big data using familiar pandas syntax.

3.4K

Archived

Python

ORMs & Query Builders

Databases

Spark

#big-data#data-science#dataframe

lakesoul-io/LakeSoul

LakeSoul is a cloud-native, real-time Lakehouse framework for fast data ingestion and analytics on cloud storage.

3.2K

Active

Java

API Frameworks

Databases

#big-data#lakehouse#streaming

apache/paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark.

3.2K

Active

Java

ETL & Pipelines

Realtime

#big-data#data-ingestion#flink

13 4 5

Stay in the loop

Get weekly updates on trending AI coding tools and projects.