Explore Projects

Discover 19 open source projects

Active filters (1):
Search: pysparkร—
Clear all

Showing 1-19 of 19 projects

ibis-project/ibis

Portable Python dataframe library for data analysis and manipulation

6.4K
Active
Python
React
#dataframe#python#analysis

microsoft/SynapseML

SynapseML is a simple and distributed machine learning library for building and deploying AI models at scale.

5.2K
Active
Scala
ML Ops
Big Data
Apache Spark
#machine-learning#distributed-computing#big-data

apache/linkis

Apache Linkis provides a computation middleware layer to connect, govern, and orchestrate applications with data engines.

3.4K
Active
Java
MCP Servers
BaaS Platforms
#application-manager#engine#jdbc

uber/petastorm

Petastorm enables training and evaluation of deep learning models from Apache Parquet datasets.

1.9K
Active
Python
ML Ops
Databases
PyTorch
#deep-learning#machine-learning#data-processing

awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources for developers.

1.9K
Archived
Shell

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

This GitHub repository contains SQL data analysis and visualization projects using various tools and databases.

1.7K
Archived
Jupyter Notebook
Databases
ETL & Pipelines
#sql#data-analysis#data-visualization

jadianes/spark-py-notebooks

Apache Spark and Python tutorials for big data analysis and machine learning as Jupyter notebooks.

1.7K
Archived
Jupyter Notebook
Databases
ETL & Pipelines
Jupyter Notebook
#big-data#data-analysis#data-science

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between popular dataframe libraries like Pandas, Dask, and PySpark.

1.5K
Active
Python
Databases
CLI Tools
Python
#dataframes#compatibility#pandas

hi-primus/optimus

Agile data preparation workflows made easy with popular Python data science libraries.

1.5K
Archived
Python
ETL & Pipelines
API Frameworks
#big-data-cleaning#data-analysis#data-cleaning

jupyter-incubator/sparkmagic

Provides Jupyter magics and kernels for working with remote Spark clusters, enabling data scientists to easily interact with Spark from Jupyter Notebooks.

1.4K
Stable
Python
API Frameworks
Databases
Jupyter
#spark#jupyter-notebook#pyspark

spark-examples/pyspark-examples

A collection of PySpark examples covering RDD, DataFrame, and Dataset operations in Python.

1.3K
Stable
Python
Databases
API Frameworks
Python
#pyspark#spark#big-data

logicalclocks/hopsworks

Hopsworks is a feature store and MLOps platform for data-intensive AI and machine learning applications.

1.3K
Experimental
Java
ML Ops
Feature Store
#feature-store#mlops#machine-learning

mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark for big data analytics and data processing.

1.3K
Experimental
Jupyter Notebook
Databases
ETL & Pipelines
#big-data#data-algorithms#dataframes

palantir/pyspark-style-guide

This is a style guide for PySpark code, providing best practices for common situations in PySpark repos.

1.2K
Stable
Python
API Frameworks
Linters & Formatters
#pyspark#code-style#best-practices

kavgan/nlp-in-practice

Starter code for solving real-world text data problems using NLP techniques like Gensim Word2Vec and text classification.

1.2K
Archived
Jupyter Notebook
LLM Frameworks
API Frameworks
Jupyter Notebook
#gensim#machine-learning#natural-language-processing

lakehq/sail

LakeSail is a Rust-based computation framework that unifies batch processing, stream processing, and AI workloads.

1.2K
Active
Rust
ML Ops
ETL & Pipelines
#distributed-computing#data-engineering#big-data

lensacom/sparkit-learn

A Python library that integrates Scikit-learn into the Apache Spark distributed computing framework.

1.2K
Archived
Python
ML Ops
ETL & Pipelines
#apache-spark#scikit-learn#distributed-computing

graphframes/graphframes

GraphFrames provides DataFrame-based Graphs for Apache Spark, enabling scalable graph analysis and algorithms.

1.1K
Active
Scala
Databases
Caching
#apache-spark#big-data#graph-analysis

mahmoudparsian/data-algorithms-book

This repository provides a comprehensive guide and implementations for data algorithms using MapReduce, Spark, Java, and Scala.

1.1K
Archived
Java
Databases
ETL & Pipelines
Apache Hadoop
#data-algorithms#mapreduce#spark

Stay in the loop

Get weekly updates on trending AI coding tools and projects.