Explore Projects

Discover 19 open source projects

Active filters (1):

Search: pyspark×

Clear all

Showing 1-19 of 19 projects

ibis-project/ibis

Portable Python dataframe library for data analysis and manipulation

6.4K

Active

Python

React

#dataframe#python#analysis

microsoft/SynapseML

SynapseML is a simple and distributed machine learning library for building and deploying AI models at scale.

5.2K

Active

Scala

ML Ops

Big Data

Apache Spark

#machine-learning#distributed-computing#big-data

apache/linkis

Apache Linkis provides a computation middleware layer to connect, govern, and orchestrate applications with data engines.

3.4K

Active

Java

MCP Servers

BaaS Platforms

#application-manager#engine#jdbc

uber/petastorm

Petastorm enables training and evaluation of deep learning models from Apache Parquet datasets.

1.9K

Active

Python

ML Ops

Databases

PyTorch

#deep-learning#machine-learning#data-processing

awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources for developers.

1.9K

Archived

Shell

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

This GitHub repository contains SQL data analysis and visualization projects using various tools and databases.

1.7K

Archived

Jupyter Notebook

Databases

ETL & Pipelines

#sql#data-analysis#data-visualization

jadianes/spark-py-notebooks

Apache Spark and Python tutorials for big data analysis and machine learning as Jupyter notebooks.

1.7K

Archived

Jupyter Notebook

Databases

ETL & Pipelines

Jupyter Notebook

#big-data#data-analysis#data-science

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between popular dataframe libraries like Pandas, Dask, and PySpark.

1.5K

Active

Python

Databases

CLI Tools

Python

#dataframes#compatibility#pandas

hi-primus/optimus

Agile data preparation workflows made easy with popular Python data science libraries.

1.5K

Archived

Python

ETL & Pipelines

API Frameworks

#big-data-cleaning#data-analysis#data-cleaning

jupyter-incubator/sparkmagic

Provides Jupyter magics and kernels for working with remote Spark clusters, enabling data scientists to easily interact with Spark from Jupyter Notebooks.

1.4K

Stable

Python

API Frameworks

Databases

Jupyter

#spark#jupyter-notebook#pyspark

spark-examples/pyspark-examples

A collection of PySpark examples covering RDD, DataFrame, and Dataset operations in Python.

1.3K

Stable

Python

Databases

API Frameworks

Python

#pyspark#spark#big-data

logicalclocks/hopsworks

Hopsworks is a feature store and MLOps platform for data-intensive AI and machine learning applications.

1.3K

Experimental

Java

ML Ops

Feature Store

#feature-store#mlops#machine-learning

mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark for big data analytics and data processing.

1.3K

Experimental

Jupyter Notebook

Databases

ETL & Pipelines

#big-data#data-algorithms#dataframes

palantir/pyspark-style-guide

This is a style guide for PySpark code, providing best practices for common situations in PySpark repos.

1.2K

Stable

Python

API Frameworks

Linters & Formatters

#pyspark#code-style#best-practices

kavgan/nlp-in-practice

Starter code for solving real-world text data problems using NLP techniques like Gensim Word2Vec and text classification.

1.2K

Archived

Jupyter Notebook

LLM Frameworks

API Frameworks

Jupyter Notebook

#gensim#machine-learning#natural-language-processing

lakehq/sail

LakeSail is a Rust-based computation framework that unifies batch processing, stream processing, and AI workloads.

1.2K

Active

Rust

ML Ops

ETL & Pipelines

#distributed-computing#data-engineering#big-data

lensacom/sparkit-learn

A Python library that integrates Scikit-learn into the Apache Spark distributed computing framework.

1.2K

Archived

Python

ML Ops

ETL & Pipelines

#apache-spark#scikit-learn#distributed-computing

graphframes/graphframes

GraphFrames provides DataFrame-based Graphs for Apache Spark, enabling scalable graph analysis and algorithms.

1.1K

Active

Scala

Databases

Caching

#apache-spark#big-data#graph-analysis

mahmoudparsian/data-algorithms-book

This repository provides a comprehensive guide and implementations for data algorithms using MapReduce, Spark, Java, and Scala.

1.1K

Archived

Java

Databases

ETL & Pipelines

Apache Hadoop

#data-algorithms#mapreduce#spark

Stay in the loop

Get weekly updates on trending AI coding tools and projects.