hadoop general

- schema on read vs RDBMS schema on write

- data flow

hadoop general

- splits,

  • split size tends to be HDFS block size to avoid split spanning two nodes which are difficult to data locality
  • data locality. same node -> same rack -> off-rack
  • one map task for each split
  • map output is in local, not in HDFS. because the output is intermediate result which will be thrown away after job is done. exception is no reducer job.
  • reduce output is in HDFS. the first replica is in local and rest of them are off-rack for reliability.
  • map output sorted locally and send across network to node where reduce task is running. then outputs from different map tasks are merged and passed to reduce function.

hadoop general


  • map tasks partition their output. each map task creates one partition for each reducer. one partition may have multiple keys, but for a given key, all its records are in the same partition. (can be customized by partition class)

hadoop general

  • combiner function to reduce network traffic. combiner function is run in the same node as map task. It may be called zero, one or multiple times for each output.