Spark
Create Project
Open Vscode >> File >> Open Folder >> Select empty folder
Create
build.sbt
file followed code belowVscode menu >> Terminal >> New Terminal >> Command Prompt >> run
sbt
create folder and file
src/main/scala/<package>/<filename>.scala
if using window then download winutils to folder <your-path>/winutils/hadoop-3.3.1/bin/winutils.exe
set environment
HADOOP_HOME=<your-path>/winutils/hadoop-3.3.1
Transformations and Actions
Transformations
return new RDDs as results. They are lazy, their results RDD is not immediately computed
Transformation | Definition | Description |
---|---|---|
map | map[B](f: A => B): RDD[B] | Apply function to each element in the RDD and return an RDD of the result |
flatMap | flatMap[B](f: A => TraversableOnce[B]: RDD[B] | Apply a function to each element in the RDD and return an RDD of the contents of the iterators returned |
filter | filter(pred: A => Boolean): RDD[A] | Apply predicate function to each element in the RDD and return an RDD of elements that have passed the predicate condition |
distinct | distinct(): RDD[B] | Return RDD woth duplicates removed |
union | union(other: RDD[T]): RDD[T] | |
intersection | intersection(other: RDD[T]): RDD[T] | Return an RDD contain elements inly found in both RDDs |
subtract | subtract(other: RDD[T]): RDD[T] | Return an RDD with the contents of the other RDD removed |
cartesian | cartesian[U](other: RDD[U]): RDD[(T, U)] | Cartesian product with the other RDD |
Actions
compute a result based on an RDD, and either returned or saved to an external storage system (e.g. HDFS). They are eager, their result is immediately computed
Action | Definition | |
---|---|---|
collect | collect(): Array[T] | Return all elements from RDD |
count | count(): :Long | Return the number of elements in the RDD |
take | take(num: Int): Array[T] | Return the first num elements of the RDD |
reduce | reduce(op: (A, A) => A): A | Combine the elements in the RDD together using op function and return result |
foreach | foreach(f: T => Unit): Unit | Apply function to each element in the RDD |
takeSample | takeSample(withRepl: Boolean, num: Int): Array[T] | Return an array with a ramdom sample of num elements of the dataset, with or without replacement |
takeOrdered | takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] | Return the first n elements of the RDD using either their natural order or a custom comparator |
saveAsTextFile | saveAsTextFile(path: String): Unit | Write the elements of the dataset as a text file in the local filesystem or HDFS |
saveAsSequenceFile | saveAsSequenceFile(path: String): Unit | Write the elements of the dataset as a Hadoop SequenceFile in the local filesystem of HDFS |
Evaluation in Spark
Interation and Big Data Procerssing
Logistic Regression
Caching and Persistence
By default, RDDs are recomputed each time you run an action on them. Spark allow you to control what is cached in memory
Cache
Shorthad for default storage level
Persist
Persistence can be customized with this method
Storage Level | Space Used | CPU time |
---|---|---|
MEMORY_ONLY (default) | High | Low |
MEMORY_ONLY_SER | Low | High |
MEMORY_AND_DISK | High | Medium |
MEMORY_AND_DISK_SER | Low | High |
DISK_ONLY | Low | High |
Cluster Topology
The driver program runs the Spark application, which creates a SparkContext upon start-up
The SparkContext connects to a cluster manager (e.g. Mesos/YARN) which allocates resources
Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application
Next, driver program sends your application code to the executors
Finally, SparkContext send tasks for the executors to run
Reduction Operations
Pair RDDs
In Spark, distributed key-value pairs are Pair RDDs that allow you to act on each key in parallel or regroup data across the network
Partitioning and Suffling
Shuffling
Move data from one node to another to be grouped with its key. This can be enormous hit
Rule of thumb: a shuffle can occur when the resulting RDD depends on other elements from the same RDD or another RDD.
groupByKey
reduceByKey
Partitioning
Kind of partitioning in Spark
Hash partitioning (groupByKey)
Range Partitioning (sortedByKey)
Hash partitioning
hash partitioning attemps to spread data evenly across partitions based on the key
Range partitioning
tuples with keys in the same rage appear on the same machine
Optimization with partitioners
partitioning can bring substantial performance gains, especially in the face of shuffles (ensure that data is not transmitted over the network to other machines)
Optimization using rage partitioning
Avoiding a Network Shuffle By Partitioning
reduceByKey running on a pre-partitioned RDD will cause the values to be computed locally
join called on two RDDs that are pre-partitioned with the same partitioner and cached on the same machine will cause the join to be computed locally
Wide and Narrow Dependencies
Transformations can have two kinds of dependencies
Narrow Dependencies: Each partition of the parent RDD is used by at most one partition of the child RDD.
map, mapValues, flatMap, filter
mapPartitions, mapPartitionsWithIndex
union
join (with co-partitioned inputs)
Wide Dependencies: Each partition of the parent RDD may be depended on bu multiple child partitions
cogroup
groupWith
join (with inputs not co-partitioned)
leftOuterJoin
rightOuterJoin
groupByKey
reduceByKey
combineByKey
distinct
intersection
repartition
coalesce
Lineage and Fault Tolerance
Lineages grapghs are the key to fault tolerance in Spark (recomputation)
toDebugString
SQL, Dataframes, and Dataset
SparkSession
DataFrame
References
Scala Version: https://www.scala-lang.org/download/all.html
Last updated