Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Capwell <>
Subject Re: DataMovementType impls
Date Fri, 25 Jul 2014 20:30:42 GMT
Its more of a persisted service atm.  Ill take a look at defining this the
way you spoke of.  Thanks!

On Fri, Jul 25, 2014 at 12:11 PM, Siddharth Seth <> wrote:

> Doing something like that would involve writing a new Outputs / Inputs, or
> modifying the existing ones to write to a different sink. We have
> prototyped such changes in the past - to write to HDFS as an example, and
> the changes are not very complicated.
> This involves changing how the existing Outputs write data, modifying
> DataMovementEvent payloads to contain relevant data (where to fetch from),
> and changing the Inputs to process this DataMovement payload to actually
> fetch the data.
> One thing to look at though - is that if you're writing directly to your
> own service - will the data be persisted there, until it's read be the
> downstream vertex - or does the data effectively need to be streamed
> through (consumers and producer tasks running independently of each other,
> or consumers and producer tasks must run at the same time).
> On Fri, Jul 25, 2014 at 12:03 PM, David Capwell <>
> wrote:
>> Was looking into saying that when two vertexes share data, that they can
>> choose to share that data over disk, or over our internal system (so share
>> over network).  In the cases where data persistence isn't needed and the
>> vertexes can be on the same node, then to ignore this system.
>> The use-case isn't really fleshed out at the moment.  Looking to
>> prototype to see how it would all play together.
>> On Fri, Jul 25, 2014 at 11:53 AM, Siddharth Seth <>
>> wrote:
>>> DataSourceType isn't really used at the moment. Eventually, it would
>>> serve more as a scheduling and failure recovery mechanism more than
>>> deciding how data gets persisted between stages. (This property could
>>> potentially be used by some of the Inputs/Outputs to alter the way they
>>> persist data - but that isn't currently on the cards).
>>> This primarily applies to data written on Edges - are you somehow
>>> looking to modify that, or use the data generated by an intermediate Vertex
>>> in a separate process ?
>>> Getting a little more info on the use case would be helpful in figuring
>>> out how Tez can be used. Are you looking to read data from this internal
>>> service, publish to it, or something else ?
>>> On Fri, Jul 25, 2014 at 11:36 AM, David Capwell <>
>>> wrote:
>>>> Sorry, copy/paste issue.  I was looking at DataSourceType and trying to
>>>> see how data gets saved and read between tasks.  The use-case is that we
>>>> have an internal service that might be helpful for us, so wanted to
>>>> prototype how possible it would be to share data over different mechanism.
>>>> On Fri, Jul 25, 2014 at 10:36 AM, Hitesh Shah <>
>>>> wrote:
>>>>> DataMovementEvent is a construct defined for an Input/Output pair to
>>>>> communicate with each other. The actual information being passed between
>>>>> the 2 is not understood by the framework except in that, it is a byte
>>>>> payload to be handed off from the source to the destination. Users are
>>>>> expected to create derived classes of this type but to use the payload
>>>>> within the object to pass information around.
>>>>> For example, most of the currently implemented Input-Output pairs (
>>>>> for shuffle/broadcast edges ) use the payload to pass the url specifying
>>>>> the location of the data to be fetched.
>>>>> thanks
>>>>> — HItesh
>>>>> On Jul 25, 2014, at 10:23 AM, David Capwell <>
>>>>> wrote:
>>>>> > So going through the code and not sure where the real logic of
>>>>> DataMovementType gets used.
>>>>> >
>>>>> > I see that in DagTypeConverts it can convert between
>>>>> DataMovementType and PlanEdgeDataMovementType, but once that happens
>>>>> don't really see a way to implement any of these types.  Where is the
>>>>> implementations defined? Is there any way to define my own impls?
>>>>> >
>>>>> > Thanks for your time.

View raw message