hadoop general
- schema on read vs RDBMS schema on write
- data flow
- splits,
- split size tends to be HDFS block size to avoid split spanning two nodes which are difficult to data locality
- data locality. same node -> same rack -> off-rack
- one map task for each split
- map output is in local, not in HDFS. because the output is intermediate result which will be thrown away after job is done. exception is no reducer job.
- reduce output is in HDFS. the first replica is in local and rest of them are off-rack for reliability.
- map output sorted locally and send across network to node where reduce task is running. then outputs from different map tasks are merged and passed to reduce function.
- map tasks partition their output. each map task creates one partition for each reducer. one partition may have multiple keys, but for a given key, all its records are in the same partition. (can be customized by partition class)
- combiner function to reduce network traffic. combiner function is run in the same node as map task. It may be called zero, one or multiple times for each output.