Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Grandl <>
Subject clarification regarding Tez DAGs
Date Mon, 28 Nov 2016 23:44:19 GMT
Hi all,
I am trying to get a better understanding of the DAGs generated by Hive atop Tez. However,
I have some more fundamental questions about the types of tasks/edges in a Tez DAG. 

1) In case of MapReduce:Map - takes records and generates <Key, Value> pairs.Reduce
- takes <Key, Value> pairs and reduce the list of the values for the same Key. 
Question:That means the reducerĀ  does not change the Keys right?
In case of Tez, things can be more complex:2) For example, Map tasks can be in the middle
of the DAG too. My understanding is that in this case the input is a set of <Key, Value>
pairs and the output can be a set of different <KeyX, ValueX> value pairs. 
Is this true for any type of input edge (scatter gather, broadcast, one to one)?

3) Reduce tasks can be in the middle as well. Can I assume that the reducer also can change
the keys? For example, in case of Map -> Reduce_1 -> Reduce_2 patterns, what is the
main reason of having Reduce_2? It is because the keys are changed by Reduce_2 while Reduce_1
preserve the ones from the Map?
4) On a related note. In case of Map_1 -> Map_2 patterns, it is possible Map_2 to preserve
the Keys generated by Map_1 or will be new keys?

4) If my guess that both Map and Reduce stages can eventually change the keys, what is the
main difference of having both Map and Reduce stages in the middle of the DAG (i.e. not input
stages or leaf stages).
Thanks,- Robert

View raw message