Explore Projects

Discover 140 open source projects

Active filters (1):
Search: sparkร—
Clear all

Showing 81-100 of 140 projects

zhonghuasheng/Tutorial

A comprehensive tutorial covering a wide range of backend technologies like Java, Go, MySQL, Redis, and more.

1.7K
Experimental
Shell
API Frameworks
Databases
Spring
#java#go#backend

collabH/bigdata-growth

A comprehensive repository covering big data knowledge, including data warehouse modeling, real-time computing, Hadoop, Spark, and more.

1.7K
Stable
Shell
Databases
ETL & Pipelines
#bigdata#hadoop#spark

apache/auron

The Auron accelerator framework leverages vectorized execution to speed up distributed computing on big data platforms like Spark.

1.7K
Active
Rust
Databases
API Frameworks
Spark
#big-data#distributed-computing#vectorized-execution

strapdata/elassandra

Elassandra is a distributed search and analytics platform that combines Elasticsearch and Apache Cassandra for developers building mission-critical applications.

1.7K
Experimental
Java
API Frameworks
Databases
#cassandra#elasticsearch#nosql

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

This GitHub repository contains SQL data analysis and visualization projects using various tools and databases.

1.7K
Archived
Jupyter Notebook
Databases
ETL & Pipelines
#sql#data-analysis#data-visualization

jadianes/spark-py-notebooks

Apache Spark and Python tutorials for big data analysis and machine learning as Jupyter notebooks.

1.7K
Archived
Jupyter Notebook
Databases
ETL & Pipelines
Jupyter Notebook
#big-data#data-analysis#data-science

almond-sh/almond

A Scala kernel for Jupyter, allowing developers to use Scala in Jupyter Notebooks.

1.6K
Active
Scala
API Frameworks
IDE Extensions
#jupyter#scala#repl

maxpumperla/elephas

Distributed deep learning library for Keras and Spark, enabling scalable training of neural networks.

1.6K
Archived
Python
LLM Frameworks
Databases
Keras
#deep-learning#distributed-computing#neural-networks

holdenk/spark-testing-base

A base library for writing tests with Apache Spark in Scala.

1.6K
Stable
Scala
Testing
Scala
#testing#spark#scala

japila-books/apache-spark-internals

This repository provides an in-depth look at the internals of the popular Apache Spark data processing framework.

1.5K
Experimental
API Frameworks
Databases
#apache-spark#data-processing#distributed-computing

hi-primus/optimus

Agile data preparation workflows made easy with popular Python data science libraries.

1.5K
Archived
Python
ETL & Pipelines
API Frameworks
#big-data-cleaning#data-analysis#data-cleaning

combust/mleap

MLeap is a library for deploying machine learning pipelines to production using Scala, Python, and Spark.

1.5K
Active
Scala
ML Ops
API Frameworks
Scala
#machine-learning#pipeline#production

OBenner/data-engineering-interview-questions

This GitHub repository contains over 2,000 data engineering interview questions to help developers prepare.

1.5K
Active
Python
Interview Prep
ETL & Pipelines
#data-engineering#interview-questions#interview-prep

sryza/aas

Code to accompany the book Advanced Analytics with Spark, focused on Scala-based big data and machine learning.

1.5K
Archived
Scala
API Frameworks
ORMs & Query Builders
#spark#big-data#machine-learning

apache/incubator-gluten

Gluten is a Scala library that offloads JVM-based SQL engines' execution to native engines for improved performance.

1.5K
Active
Scala
API Frameworks
Databases
Scala
#spark-sql#clickhouse#simd

san089/goodreads_etl_pipeline

An end-to-end data pipeline for building a data lake, data warehouse, and analytics platform from GoodReads data.

1.5K
Archived
Python
ETL & Pipelines
Background Jobs
Apache Airflow
#data-engineering#etl-pipeline#data-lake

SeldonIO/seldon-server

A machine learning platform and recommendation engine built on Kubernetes for deployment on cloud platforms.

1.5K
Archived
Java
ML Ops
API Frameworks
Kubernetes
#machine-learning#recommendation-engine#kubernetes

apache/carbondata

CarbonData is a high-performance data store solution for big data analytics on Hadoop and Spark.

1.4K
Active
Scala
Databases
API Frameworks
Spark
#big-data#hadoop#spark

projectnessie/nessie

Nessie is a transactional data catalog for data lakes that provides Git-like semantics and functionality.

1.4K
Active
Java
Databases
API Frameworks
#data-catalog#data-lakes#git-semantics

mesos/spark

Lightning-fast cluster computing in Java, Scala and Python.

1.4K
Archived
Scala
API Frameworks
ORMs & Query Builders
Scala
#cluster-computing#big-data#distributed-systems

Stay in the loop

Get weekly updates on trending AI coding tools and projects.