Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rohini Palaniswamy <>
Subject Re: Choosing between {FIFO, file, socket}, shared edge and one-to-n edge
Date Thu, 15 Dec 2016 23:35:18 GMT
2) Shared edge is being
implemented as part of this jira. Jeff was close to done with it, but it
did not get in. Zhiyuan Yang is now working on which allows multiple edges
between two vertices. Zhiyuan will most likely be working on TEZ-391 after
that as it will require some rework after TEZ-1190. The next release Tez
0.9 should hopefully have both done based on what I heard from Hitesh last.

3) This is a piece of cake and we do that heavily in Pig for multi query.
You can have as many edges as you want out of a vertex to multiple other
vertices. The only restriction now is that between two vertices there can
be only one edge which is being addressed in TEZ-1190.


On Thu, Dec 15, 2016 at 3:29 PM, Gopal Vijayaraghavan <>

> > same box, data could be piped between two; when they are on separate
> > boxes, data can be sent over the network.
> The data transfer modes are logically identical, whether the consumer is
> local or remote.
> The actual physical transfer follows either network (for remote) or
> reading directly off the output of the previous task (for local).
> This is not a FIFO stream record-by-record, but materialized into chunks
> (pipelined shuffle) and moved across.
> > (2) An edge property where same output of a vertex
> > could be consumed by multiple down-stream vertices.
> Do you mean vertices or tasks?
> The runtime physical explosion of vertex into tasks produces an expansion
> in the number of physical connections.
> The broadcast edge is the same data going to different tasks of the same
> vertex, like a hash-join.
> In case of wanting to move the same data to more than one logical vertex,
> that does not exist today but the "edge manager plugin" is a user plugin,
> so a Tez user can implement their own edges & outputs in user code to route
> data however they want.
> >  (3) An edge property where one node could generate
> > two different data streams; one stream goes to one sub-sequent
> Yup. See for example the way PIG uses Tez.
> latency-data-processing-with-big-data/19
> or in Hive multi-inserts
> <>
> Cheers,
> Gopal

View raw message