Pyspark

Setup spark

  1. Download Spark from official website https://spark.apache.org/downloads.html

  2. Setup environment

Download Spark

Connect Spark

Spark Architecture

Spark Architectire

Concept

Lazy Evaluation

Evaluation strategy that delays the evaluation of an expression until its value is needed

Transformations

  • Narrow: can be done within a single partition ex. map(), filter(), union()

  • Wide: need to be combined data from all partitions ex. groupByKey(), reduceByKey(), join()

Catalyst Optimizer

  • Analysis

  • Logical optimization

  • Physical planning

  • Code generation

Tungsten Binary Format (Unsafe Row)

Tungsten Binary Format is data sharing optimization in shuffle step

  • Data shuffle creates stage boundary

  • Spark can execute a process backwards

  • The cache process works like shuffle files

Application

Job, Stage, and Task
  • Job: a sequence of transformations on data. ex count(), first(), collect(), save(), read(), write(), take()

  • Stage: a job is divided into stages by shuffling

  • Task: smallest unit of work. Each stage is divided into tasks. Execute on executors parallely. Operation over a single partition of data

Stage and Shuffle

Memory

Type of Memory

  • Driver Memory: the memory assigned to the driver process used for creating execution plans, tracking data, and gathering results.

  • Execurtor Memory: the amount of memory allocated to each executor process

    • Reserved Memory: the memory reserved by the system 300MB that cannot be changed. If executor memory less than 1.5 times of reserved memory (450MB), it will be failed

    • User Memory: stores all the user defined data structures, UDFs

    • Spark Memory: This is responsible for storing intermediate state while doing task execution, cache/persisted data

      • Storage Memory: used for storing all of the cached data, broadcast variable

      • Execution Memory: used by Spark for objects created during execution of task

Calulation

Reserved Memory = 300 MB

User Memory = (java heap - Reserved Memory) * (1- spark.memory.fraction)

Spark Memory = (java heap - Reserved Memory) * spark.memory.fraction

Storage Memory = SparkMemory * spark.memory.storageFraction

Execution Memory = SparkMemory * (1-spark.memory.storageFraction)

Coding

quick start

Data Type

Functions

Transform

Write

Insert into

  • overwrite (static): overwrite table

  • overwrite (dynamic): overwrite by partition

saveAsTable

overwrite table

Create if not exists table

WIndow Analytics

Configuration

Dynamic Partition

Memory

References

Last updated