hadoop general

- schema on read vs RDBMS schema on write

- data flow

hadoop general

- splits,

split size tends to be HDFS block size to avoid split spanning two nodes which are difficult to data locality
data locality. same node -> same rack -> off-rack
one map task for each split
map output is in local, not in HDFS. because the output is intermediate result which will be thrown away after job is done. exception is no reducer job.
reduce output is in HDFS. the first replica is in local and rest of them are off-rack for reliability.
map output sorted locally and send across network to node where reduce task is running. then outputs from different map tasks are merged and passed to reduce function.

hadoop general

map tasks partition their output. each map task creates one partition for each reducer. one partition may have multiple keys, but for a given key, all its records are in the same partition. (can be customized by partition class)

hadoop general

combiner function to reduce network traffic. combiner function is run in the same node as map task. It may be called zero, one or multiple times for each output.