Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Grandl <rgra...@yahoo.com>
Subject Tez runtime repartitioning
Date Fri, 30 Dec 2016 20:10:22 GMT
Hi guys,
I know that Tez is able to automatically tune the number of downstream tasks based on statistics
regarding the output of the tasks from the upstream tasks. Upstream tasks can be map tasks,
downstream can be reduce tasks in MR parlance. 

It seems that Tez is somehow able to repartition the key ranges to adjust the new number of
downstream tasks computed at runtime based on these statistics. 

I have the following questions:1. When the upstream tasks are executing, how is the output
data partitioned? I assume it should assume certain key-range splitting into separate partitions
which are written to disk into intermediate files. 
2. After certain upstream tasks have finished and the number of  downstream tasks are adjusted
based on the expected output, then the data will basically be repartitioned. That means if
initially my data was going into 10 partitions, now it may go to 2 because Tez decided that
only 2 downstream tasks are enough to fetch the data. If this is the case, how the repartitioning
happens? What key ranges from the initial partitions will go to what key ranges from the computed
new partitions?
Thanks in advance,Robert


Mime
View raw message