Post

Apache Spark

Apache Spark

This is an extension of my 2025 Learning Log.

Reviewing Spark (PySpark) through this course Taming Big Data With Apache Spark

Setting up

Apache Spark 3.x is only compatible with Java 8, Java 11, or Java 17, and Apache Spark 4 is only compatible with Java 17.

Currently, Spark is not compatible with Python 3.12 or newer.

So I needed to install alternative lower versions of Java and Python.

Install Java 11

1
brew install openjdk@11

Make sure that it is the default Java in the system

1
2
3
4
5
6
cd /Library/Java/JavaVirtualMachines/jdk-23.jdk/Contents
mv Info.plist Info.plist.disabled

sudo ln -sfn /opt/homebrew/opt/openjdk\@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk.jdk
export JAVA_HOME=`/usr/libexec/java_home -v 11`
java -version

Downgrade Python

1
brew install python@3.10

Create a virtual environment

1
2
3
4
5
6
python3.10 -m pip3 install virtualenv
python3.10 -m virtualenv .venv
source .venv/bin/activate

# test
pyspark

To test

1
spark-submit test.py

Spark 3.5 and Spark 4 features

Here are some of the new features in Spark 3.5 or Spark 4

  • Spark Connect
    • Client / Server architecture for Apache Spark
    • allows control of remote cluster
  • Expanded SQL functionality
    • e.g. UDTF’s, optimized UDFs, more SQL functions
  • English SDK
    • e.g.
1
transformed_df = revenue_df.ai.transform('What are the best-selling and the second best-selling products in every category')
  • Debugging features
    • e.g. enhanced error messages, testing API
  • DeepSpeed Distributor
    • distributed training for PyTorch models
  • ANSI mode by default
    • i.e. better error handling in SQL-style queries
  • Variant Data type
    • better support for semi-structured data
  • Collation support
    • for sorting and comparisons e.g. case sensitivity, unicode-support
  • Data source APIs (Python)
    • read or write to custom data sources or sinks, respectively
  • RDD interface at legacy, instead DataFrame API is emphasized
  • Delta Lake 4.0 support

Introduction to Spark

Spark is a fast and general engine for large-scale data processing.

It is characterized by its scalability and fault-tolerance features.

Lazy evalution - it doesn’t do anything until it is asked to produce results; creates DAGs (direct acyclic graph) of steps to produce said results, and figures out the optimal path; reason why Spark is fast

Components of Spark

  • Spark core

    • Spark streaming
    • Spark SQL
    • MLLib
    • GraphX

Introduction to RDDs

RDD is Spark’s original dataset structure - under the hood, it is the core object that everything in Spark revolves around

Dataset - abstraction for a giant set of data Distributed - distribute processing of data; spread out across clusters of computer; may or may not be local Resilient - handle failures, redistribute load when failure occurs

Use the RDD object to do actions on the dataset

Spark Context - responsible for making RDDs resilient and distributed

This post is licensed under CC BY 4.0 by the author.