Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From xqfly...@163.com
Subject Re: Re: How to use all yarn cluster resources for one tez job
Date Sat, 11 Apr 2015 06:22:24 GMT
thanks gopal
i am new to tez, i am not famaliar with some words , what is 'Operator pipeline'?
do we have some design wiki for the execution detail?
regards



On 2015-04-11 13:58 , Gopal Vijayaraghavan Wrote:


> From the code, there is a locationHint there(I am not sure when the
>location hint is set),
> But it looks like task will reuse the container before going to ask new
>container.

Yes, then we end up being able to reuse a container which has a lot of
benefits when running 10,000 tasks of the same vertex.

We do not deserialize the Operator pipeline, as long as wešre getting a
new split of the same vertex, since the plan is identical & cache other
details which are same across splits.

Reuse produces the JIT performance improvements, particularly when dealing
with queries which take < 10s.

> And my network is 10GE, so I think network io is not a bottle neck for
>my case, so whether make the data aware hint is not that important.
> But if I can make all tasks to be run at the same time, then it will be
>good.

The overhead of non-locality is not the network layer, but the DFS Client
& IO pipeline - there is a lot of DFS audit logging and data checksum
verifications which happen for each read sequence.

On local disks, we bypass the HDFS Daemon after the block open and read
data off the disks directly via a unix domain socket mechanism called
short-circuit reads - therešs a highly optimized assembly crc32 checksum
implementation there that works really fast.

We can actually max out a 100Gbps switch with just shuffles - not the
single host ethernet card, but the 20 machines exchanging data, while
connected to a single switch.

That said, you need to tune kernel parameters to get that level of
performance - all dirs mounted noatime,nodiratime, nscd running on all
hosts, transparent huge page defrag turned off, increase somaxconn to >
4k, tune wmem_max/rmem_max sizes, enable tcp offload to the NIC, set
vm.dirty_ratio to 50% as well as vm.dirty_background_ratio=30 (yes, thatšs
what you need, sadly).

Anyway, all you need to max out is a single host to get a slowdown in
query performance.


That said, it is entirely possible that Tez is running your tasks in less
time than it waits for locality (~400ms).

I sent out the YARN swimlanes example to someone else earlier on the hive
lists

https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlane
s.sh


You can see your locality delays (look for blank gaps, between usage or a
staggering of the green line) -
http://people.apache.org/~gopalv/query27.svg

If you want to send me something to analyze, please send me ŗyarn logs
-applicationId <appid> | grep HISTORY˛, which is the raw data for that
chart.

>set tez.grouping.min-size=16777216
> [skater] What does this parameter mean? Do we have a wiki to trace the
>options?

That parameter configures the equivalent of CombineInputFormat for Tez.

Instead of combining file ranges, Tez groups existing InputSplit[]
together - this groups splits until it gets to 16Mb or until their count
add up to 1.7x cluster capacity (or a max-size of 1Gb by default).

This is necessary for HIVE ACID (insert, update, delete transactions) to
work correctly since each split gets more than 1 file - a base file + a
sequence of delta file.

We had some discussion around docs here recently -
http://mail-archives.apache.org/mod_mbox/tez-user/201504.mbox/%3CCAJqL3EKWi
Y4y9BxwLFoLBmu5yB4MSEjDW_59J98yK9joKx375w%40mail.gmail.com%3E

In hive-0.6, other than grouping, you can also set the min-held containers
to your maximum so that we donšt let any containers expire between queries.

This hurts concurrency, but since you want to run 1 big query on an entire
cluster, this would help (as Bikas said).

hive -hiveconf hive.prewarm.enabled=true -hiveconf
hive.prewarm.numcontainers=1000

if you want to do a land-grab on the cluster & hold onto those containers
indefinitely.

Since Tez does some predicate pruning during split generation, make sure
you turn on Œhive.optimize.index.filter=trueš and Œhive.optimize.ppd=trueš
to use the ROW INDEX data within ORC data.

For more hive specific questions, ask the hive user list.

Cheers,
Gopal


Mime
View raw message