Explore Projects

Discover 140 open source projects

Active filters (1):

Search: spark×

Clear all

Showing 121-140 of 140 projects

lensacom/sparkit-learn

A Python library that integrates Scikit-learn into the Apache Spark distributed computing framework.

1.2K

Archived

Python

ML Ops

ETL & Pipelines

#apache-spark#scikit-learn#distributed-computing

apache/datafusion-comet

A Spark accelerator for Apache DataFusion, a SQL query engine written in Rust, aimed at vibe coders.

1.1K

Active

Scala

LLM Frameworks

Databases

Spark

#spark#rust#data-processing

HariSekhon/Nagios-Plugins

A comprehensive collection of Nagios plugins for monitoring AWS, Hadoop, Cloud, Kafka, and other popular technologies.

1.1K

Active

Python

Monitoring

CLI Tools

#monitoring#cloud#devops

abhishek-ch/around-dataengineering

A comprehensive knowledge hub for data engineering, machine learning, and MLOps tools and practices.

1.1K

Archived

Python

ETL & Pipelines

ML Ops

Python

#data-engineering#machine-learning#mlops

graphframes/graphframes

GraphFrames provides DataFrame-based Graphs for Apache Spark, enabling scalable graph analysis and algorithms.

1.1K

Active

Scala

Databases

Caching

#apache-spark#big-data#graph-analysis

apache/amoro

Apache Amoro is an open-source Lakehouse management system built on big data formats like Flink, Hudi, and Iceberg.

1.1K

Active

Java

Databases

ETL & Pipelines

Flink

#big-data#data-lake#lakehouse

Teradata/kylo

Kylo is an enterprise-grade data lake management platform built on big data technologies like Spark and Hadoop.

1.1K

Archived

Java

ETL & Pipelines

Realtime

#data-lake#hadoop#spark

oeljeklaus-you/UserActionAnalyzePlatform

A big data platform for analyzing e-commerce user behavior using Hadoop, Spark, and Java.

1.1K

Archived

Java

API Frameworks

Databases

Spark

#big-data#data-analytics#e-commerce

jacksu/utils4s

A collection of Scala and Spark usage examples and related resources for developers.

1.1K

Archived

Scala

API Frameworks

Databases

Scala

#scala#spark#akka

mahmoudparsian/data-algorithms-book

This repository provides a comprehensive guide and implementations for data algorithms using MapReduce, Spark, Java, and Scala.

1.1K

Archived

Java

Databases

ETL & Pipelines

Apache Hadoop

#data-algorithms#mapreduce#spark

JohnSnowLabs/spark-nlp-workshop

Public runnable examples of using John Snow Labs' NLP for Apache Spark, a popular open-source library for natural language processing.

1.1K

Active

Jupyter Notebook

ML Ops

API Frameworks

Spark

#natural-language-processing#machine-learning#data-processing

databricks/spark-sklearn

Deprecated Scikit-learn integration package for Apache Spark, useful for machine learning on big data.

1.1K

Archived

Python

ML Ops

Databases

#machine-learning#scikit-learn#apache-spark

databricks/spark-csv

CSV Data Source for Apache Spark 1.x, a Scala library for working with structured data.

1.1K

Archived

Scala

Databases

API Frameworks

#apache-spark#csv#data-processing

bigdatagenomics/adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Spark and Apache Parquet.

1.0K

Experimental

Scala

ETL & Pipelines

API Frameworks

Spark

#bioinformatics#genomics#big-data

josonle/Coding-Now

A collection of study notes, ebooks, and resources on big data, machine learning, Linux, and more for developers.

1.0K

Archived

Python

Databases

CLI Tools

#big-data#machine-learning#data-analysis

pixiedust/pixiedust

A Python helper library for enhancing Jupyter Notebooks with data visualization and analysis capabilities.

1.0K

Archived

Jupyter Notebook

Visualization

IDE Extensions

Jupyter Notebook

#data-science#jupyter-notebook#python-notebook

apache/celeborn

Apache Celeborn is a high-performance shuffle and spilled data service for big data applications.

1.0K

Active

Java

Caching

Realtime

#bigdata#shuffle#spark

TIBCOSoftware/snappydata

SnappyData is a memory-optimized analytics database based on Apache Spark and Apache Geode, enabling real-time stream processing, transactions, and predictive analytics.

1.0K

Archived

Scala

Databases

API Frameworks

Spark

#analytics#memory-database#scale

twosigma/flint

A time series library for Apache Spark that provides a high-level API for working with time series data.

1.0K

Archived

Scala

Databases

API Frameworks

Spark

#time-series#spark#scala

cloudera/livy

Livy is an open source REST interface for interacting with Apache Spark from anywhere

1.0K

Archived

Scala

API Frameworks

Databases

#spark#rest-api#data-processing

1 2 3 4 5 6

Stay in the loop

Get weekly updates on trending AI coding tools and projects.