-
Notifications
You must be signed in to change notification settings - Fork 0
/
sparkNotes
141 lines (85 loc) · 3.45 KB
/
sparkNotes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
val someRDD = sc.parallelize(1 to 10)
someRDD.count()
partitiions are on eoer node in cluster mode.
In stand alone mode its one per processor core.
val secRDD = sc.parallelize(1 to 100,5)
Transformation
return type is always RDD
RDDs are not supposed to start execution. Only meta data is stored in form of DAG.
Supports lazy evaluation.
Parallelize() is used in dev and testing work and not in production work.
Actions:
Will always return some data or store o/p in store.
execution immedialtely start
Transformations and actions are the operations on RDD
combined have 80 different operators
// we can not mutate the RDD so created a new RDD with filtered content.
// write anonemous function or member function or lambda expression
11:29
val conf = new SparkConf().setAppName("Word Count").setMaster(appConf.getConfig("dev").getString("developmentMaster"))
val sc = new SparkContext(conf)
general Code snippet::
public static void main(String[] args){
SparkConf conf = new SparkConf().setMaster("local").setAppName("Test")
JavaSparkContext sc = new JavaSparkContext(conf)
JavaRDD<String> strRDD = sc.parallelize(Arrays.aslist("i am spark","he is not spark","dont know"))//or sc.textFile("file path
")
# <InputRDD type, returnType of a fucntion>
JavaRDD<String> filterRDD = strRDD.filter(new Function<String, Boolean>(){
public Boolean call(String str) throws Exception{
return str.contains("spark");
}
}
System.out.println(filterRDD.collect())
)
}
operators are same in java, python and scala
#lambda expression:
JavaRDD<String> filterRDD = strRDD.filter(x -> x.contains("spark"))
#transformations onsingle RDD
map()
distinct()
filter()
flatMap() <- flatten and gather every thing in one list
union() <- oneRDD.union(otherRDD)
intersecrtion()
substract()
cartesian()
#actions on single RDD
*output of an transformation isRDD. Output of action will be normal data or a collection.
collect() will need data on one machine
top() <- will return last record
take() <- will return first record
reduce() <- uses Function2<Int,Int,Int>
## Paired RDD
Tuples are refered from a Jar file in case of JAVA as its not suppurted in JAVA API
Transformation on pairedRDD
mapToPair() <- for mapping values in pairs
reduceByKey(func) <- in case of single RDD reduce() is action. while in pairedRDD its transformation. Because its processing intensive so transformation
groupByKey()
mapValues(func)
keys()
values()
sortByKey(true or false or comparator anunymous class) <- default is ascending(with true), if FALSE then its descending.
on closing the spark context the RDDs get closed.
#transformations on 2 paired RDDs
subbstractByKey() <- only key is matched
join() <- inner join. With common keys.
rightOuterJoin()
leftOuterJoin()
cogroup() <- group data from both RDDs sharing same key.
rdd.saveAsTextFile(path) <- can be used on actual cluster. It creates folder with given name and part_0001 is the actula file inside it.
All the actions on single RDD will work on paired RDD liek collect()
Below actions only for paired RDD.
countByKey()
collectByKey()
lookup()
Real time use cases:
Number of orders per day
the code can be run from master or any of the slaves. Same as in hadoop.
driver manager will work on spark master.
In stand laone mode yarn is not required.
But in hadoop we need yarn as the cluster manager.
speculative execution
partitioning
Steps at which the accumulators and the brodcast variables are sent to the executor