Explore Projects

Discover 140 open source projects

Active filters (1):
Search: spark×
Clear all

Showing 121-140 of 140 projects

lensacom/sparkit-learn

A Python library that integrates Scikit-learn into the Apache Spark distributed computing framework.

1.2K
Archived
Python
ML Ops
ETL & Pipelines
#apache-spark#scikit-learn#distributed-computing

apache/datafusion-comet

A Spark accelerator for Apache DataFusion, a SQL query engine written in Rust, aimed at vibe coders.

1.1K
Active
Scala
LLM Frameworks
Databases
Spark
#spark#rust#data-processing

HariSekhon/Nagios-Plugins

A comprehensive collection of Nagios plugins for monitoring AWS, Hadoop, Cloud, Kafka, and other popular technologies.

1.1K
Active
Python
Monitoring
CLI Tools
#monitoring#cloud#devops

abhishek-ch/around-dataengineering

A comprehensive knowledge hub for data engineering, machine learning, and MLOps tools and practices.

1.1K
Archived
Python
ETL & Pipelines
ML Ops
Python
#data-engineering#machine-learning#mlops

graphframes/graphframes

GraphFrames provides DataFrame-based Graphs for Apache Spark, enabling scalable graph analysis and algorithms.

1.1K
Active
Scala
Databases
Caching
#apache-spark#big-data#graph-analysis

apache/amoro

Apache Amoro is an open-source Lakehouse management system built on big data formats like Flink, Hudi, and Iceberg.

1.1K
Active
Java
Databases
ETL & Pipelines
Flink
#big-data#data-lake#lakehouse

Teradata/kylo

Kylo is an enterprise-grade data lake management platform built on big data technologies like Spark and Hadoop.

1.1K
Archived
Java
ETL & Pipelines
Realtime
#data-lake#hadoop#spark

oeljeklaus-you/UserActionAnalyzePlatform

A big data platform for analyzing e-commerce user behavior using Hadoop, Spark, and Java.

1.1K
Archived
Java
API Frameworks
Databases
Spark
#big-data#data-analytics#e-commerce

jacksu/utils4s

A collection of Scala and Spark usage examples and related resources for developers.

1.1K
Archived
Scala
API Frameworks
Databases
Scala
#scala#spark#akka

mahmoudparsian/data-algorithms-book

This repository provides a comprehensive guide and implementations for data algorithms using MapReduce, Spark, Java, and Scala.

1.1K
Archived
Java
Databases
ETL & Pipelines
Apache Hadoop
#data-algorithms#mapreduce#spark

JohnSnowLabs/spark-nlp-workshop

Public runnable examples of using John Snow Labs' NLP for Apache Spark, a popular open-source library for natural language processing.

1.1K
Active
Jupyter Notebook
ML Ops
API Frameworks
Spark
#natural-language-processing#machine-learning#data-processing

databricks/spark-sklearn

Deprecated Scikit-learn integration package for Apache Spark, useful for machine learning on big data.

1.1K
Archived
Python
ML Ops
Databases
#machine-learning#scikit-learn#apache-spark

databricks/spark-csv

CSV Data Source for Apache Spark 1.x, a Scala library for working with structured data.

1.1K
Archived
Scala
Databases
API Frameworks
#apache-spark#csv#data-processing

bigdatagenomics/adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Spark and Apache Parquet.

1.0K
Experimental
Scala
ETL & Pipelines
API Frameworks
Spark
#bioinformatics#genomics#big-data

josonle/Coding-Now

A collection of study notes, ebooks, and resources on big data, machine learning, Linux, and more for developers.

1.0K
Archived
Python
Databases
CLI Tools
#big-data#machine-learning#data-analysis

pixiedust/pixiedust

A Python helper library for enhancing Jupyter Notebooks with data visualization and analysis capabilities.

1.0K
Archived
Jupyter Notebook
Visualization
IDE Extensions
Jupyter Notebook
#data-science#jupyter-notebook#python-notebook

apache/celeborn

Apache Celeborn is a high-performance shuffle and spilled data service for big data applications.

1.0K
Active
Java
Caching
Realtime
#bigdata#shuffle#spark

TIBCOSoftware/snappydata

SnappyData is a memory-optimized analytics database based on Apache Spark and Apache Geode, enabling real-time stream processing, transactions, and predictive analytics.

1.0K
Archived
Scala
Databases
API Frameworks
Spark
#analytics#memory-database#scale

twosigma/flint

A time series library for Apache Spark that provides a high-level API for working with time series data.

1.0K
Archived
Scala
Databases
API Frameworks
Spark
#time-series#spark#scala

cloudera/livy

Livy is an open source REST interface for interacting with Apache Spark from anywhere

1.0K
Archived
Scala
API Frameworks
Databases
#spark#rest-api#data-processing

Stay in the loop

Get weekly updates on trending AI coding tools and projects.