Evaluation strategy that delays the evaluation of an expression until its value is needed
Transformations
Narrow: can be done within a single partition ex. map(), filter(), union()
Wide: need to be combined data from all partitions ex. groupByKey(), reduceByKey(), join()
Catalyst Optimizer
Analysis
Logical optimization
Physical planning
Code generation
Tungsten Binary Format (Unsafe Row)
Tungsten Binary Format is data sharing optimization in shuffle step
Data shuffle creates stage boundary
Spark can execute a process backwards
The cache process works like shuffle files
Application
Job: a sequence of transformations on data. ex count(), first(), collect(), save(), read(), write(), take()
Stage: a job is divided into stages byshuffling
Task: smallest unit of work. Each stage is divided into tasks. Execute on executors parallely. Operation over a single partition of data
Memory
Type of Memory
Driver Memory: the memory assigned to the driver process used for creating execution plans, tracking data, and gathering results.
Execurtor Memory: the amount of memory allocated to each executor process
Reserved Memory: the memory reserved by the system 300MB that cannot be changed. If executor memory less than 1.5 times of reserved memory (450MB), it will be failed
User Memory: stores all the user defined data structures, UDFs
Spark Memory: This is responsible for storing intermediate state while doing task execution, cache/persisted data
Storage Memory: used for storing all of the cached data, broadcast variable
Execution Memory: used by Spark for objects created during execution of task