Pyspark
Setup spark
Download Spark from official website https://spark.apache.org/downloads.html
Setup environment

Connect Spark
Spark Architecture

Concept
Lazy Evaluation
Evaluation strategy that delays the evaluation of an expression until its value is needed
Transformations
Narrow: can be done within a single partition ex. map(), filter(), union()
Wide: need to be combined data from all partitions ex. groupByKey(), reduceByKey(), join()
Catalyst Optimizer

Analysis
Logical optimization
Physical planning
Code generation
Tungsten Binary Format (Unsafe Row)
Tungsten Binary Format is data sharing optimization in shuffle step
Data shuffle creates stage boundary
Spark can execute a process backwards
The cache process works like shuffle files
Application

Job: a sequence of transformations on data. ex count(), first(), collect(), save(), read(), write(), take()
Stage: a job is divided into stages by shuffling
Task: smallest unit of work. Each stage is divided into tasks. Execute on executors parallely. Operation over a single partition of data

Memory

Type of Memory
Driver Memory: the memory assigned to the driver process used for creating execution plans, tracking data, and gathering results.
Execurtor Memory: the amount of memory allocated to each executor process
Reserved Memory: the memory reserved by the system 300MB that cannot be changed. If executor memory less than 1.5 times of reserved memory (450MB), it will be failed
User Memory: stores all the user defined data structures, UDFs
Spark Memory: This is responsible for storing intermediate state while doing task execution, cache/persisted data
Storage Memory: used for storing all of the cached data, broadcast variable
Execution Memory: used by Spark for objects created during execution of task
Calulation
Reserved Memory = 300 MB
User Memory = (java heap - Reserved Memory) * (1- spark.memory.fraction)
Spark Memory = (java heap - Reserved Memory) * spark.memory.fraction
Storage Memory = SparkMemory * spark.memory.storageFraction
Execution Memory = SparkMemory * (1-spark.memory.storageFraction)
Coding
quick start
Data Type
Functions
Transform
Write
Insert into
overwrite (static): overwrite table
overwrite (dynamic): overwrite by partition
saveAsTable
overwrite table
Create if not exists table
WIndow Analytics
Configuration
Dynamic Partition
Memory
References
Last updated