Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: How to use all yarn cluster resources for one tez job
Date Sat, 11 Apr 2015 04:58:06 GMT

>         I have a hive full scan job , with hive on mr I can fully use
>the whole cluster's 1000 cpu vcores(I use the split size to make mapper
>tasks to be 1200),
> But in tez, tez only use around 700 vcores, I have also set the same
>hive split size. So how do I configure tez? to make tez fully use all the
>cluster resources?

If youıre on hive-1.0/later, the option to go wide is called
tez.grouping.split-waves.

With ORC, the regular MRv2 splits generates empty tasks (so that not all
map-tasks process valid ranges).

But to get it as wide as possible

set mapred.max.split.size=33554432
set tez.grouping.split-waves=1.7
set tez.grouping.min-size=16777216

should do the trick, the split-waves measures current queue capacity *
1.7x to go wider than the actual available capacity.

In previous versions (0.13/0.14), ³set² commands donıt work, so the
options are prefixed by the tez.am.* - you have to do

hive -hiveconf tez.am.grouping.split-waves=1.7 -hiveconf
tez.grouping.min-size=16777216 -hiveconf mapred.max.split.size=33554432


We hope to throw away these hacks in hive-1.2 & for this Prasanth checked
in a couple of different split strategies for ORC in hive-1.2.0
(ETL/BI/HYBRID) etc.

I will probably send out my slides about ORC (incl. new split gen) after
Hadoop Summit Europe, if you want more details.

Ideally, any tests with the latest code would help me fix anything thatıs
specific to your use-cases.


Cheers,
Gopal








Mime
View raw message