Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Choosing between {FIFO, file, socket}, shared edge and one-to-n edge
Date Thu, 15 Dec 2016 23:29:22 GMT

> same box, data could be piped between two; when they are on separate
> boxes, data can be sent over the network.

The data transfer modes are logically identical, whether the consumer is local or remote.

The actual physical transfer follows either network (for remote) or reading directly off the
output of the previous task (for local).

This is not a FIFO stream record-by-record, but materialized into chunks (pipelined shuffle)
and moved across.

> (2) An edge property where same output of a vertex
> could be consumed by multiple down-stream vertices.

Do you mean vertices or tasks? 

The runtime physical explosion of vertex into tasks produces an expansion in the number of
physical connections.

The broadcast edge is the same data going to different tasks of the same vertex, like a hash-join.


In case of wanting to move the same data to more than one logical vertex, that does not exist
today but the "edge manager plugin" is a user plugin, so a Tez user can implement their own
edges & outputs in user code to route data however they want.

>  (3) An edge property where one node could generate
> two different data streams; one stream goes to one sub-sequent

Yup. See for example the way PIG uses Tez.

http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-data-processing-with-big-data/19

or in Hive multi-inserts

<http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/16>

Cheers,
Gopal




Mime
View raw message