Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: How to use all yarn cluster resources for one tez job
Date Sat, 11 Apr 2015 07:34:02 GMT


> i am new to tez, i am not famaliar with some words , what is 'Operator
>pipeline'?
> do we have some design wiki for the execution detail?

The operator pipeline is the part of execution which is not provided by
Tez (the name is the same for Hive-on-MR).

Tez provides an input/output and processors for each vertex, with each
vertex connected with edges to make a directed acyclic graph (DAG).

I said ³operator pipeline² because that is the component of Hive which
runs inside each processor in Tez, because your question was about Hive on
Tez & reuse of containers.

Tez provides something known as Object Registry which allows us to place
objects in a cache, which is cleared either when a new vertex is seen or
when a new DAG is seen.

When we reuse a container to run more than 1 task (so with 600 containers,
you can run 60,000 tasks - unlike MR which spins up 1 container per task),
we get to reuse some of that state which includes things like the Hive
Operator pipeline for a given vertex. This lets us run sub-second tasks,
because no time is wasted in loading classes or starting JVMs during reuse.

A container might run ³Map 1 (split 0)², ³Map 1 (split 1)², ³Map 2 (split
0)² in it - we try to not reload the whole hive SQL operators when
transitioning from ³Map 1 (split 0)² to ³Map 1 (split 0)².

As a stress test, I have run ~2500 tasks in 50 containers to test those
scenarios (each horizontal lane is a single YARN container, each box is a
task) - http://people.apache.org/~gopalv/query10.svg

If you look at the Operator pipeline at the same time as the Tez vertex
separation, you will notice that data movement within the Hive operators
are only within a single JVM, while the Tez edges can move data between
different processes/machines.

Tez does not really bother with the interior details of an operator
pipeline, so the view Tez has is more like this -
http://people.apache.org/~gopalv/q27-dag.svg


But for the sake of illustration, I have drawn out that for TPC-DS Query
27 - http://people.apache.org/~gopalv/q27-plan.svg

The dashed boxes contain Tez vertices and everything within a box is
implemented by Hive as SQL operators.

Thatıs a high level picture of how a data access engine uses Tez - Tez
handles the data transfers and the actual transformations are entirely
performed by the operators (those are owned by PIG/Hive/Cascading/Flink
etc.)

You can follow the Hive on Tez design docs on the hive wiki for more
details.

Cheers,
Gopal






Mime
View raw message