Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Zillmann <jzillm...@googlemail.com>
Subject Re: Task count
Date Mon, 28 Jul 2014 15:51:53 GMT
Hey Gopal,


On 25 Jul 2014, at 19:09, Gopal V <gopalv@apache.org> wrote:

> On 7/25/14, 3:20 PM, Johannes Zillmann wrote:
> 
>> Ok, will try this, thanks!
>> Can you say the jira number so i can track progress on that ? You know for which
version of
>> Tez this is planned ?
>> 
>> And no, no more use cases currently!
> 
> That JIRA is a fairly complex scale problem, so I will take a while to commit it, because
it needs extensive testing.
> 
> But I could possibly split out the vertex parallelism feature out into its own JIRA,
to satisfy the common requirements.

That would be awesome! Should i create it or do you ?

> 
> Just to clarify that again, this WIP patch provides the vertex parallelism for the current
vertex to each task.
> 
> So I'm paraphrasing your use-case as
> 
> "Map 1" outputs sample-count/parallelism("Map 1")
> 
> And
> 
> "Reducer 2" uses parallelism("Reducer 2") to decide details of runtime.
> 
> My patch waits until the runtime parallelism is set for "Map 1" and "Reducer 2" - input
tasks can use "-1" at DAG build time, to have the vertex manager set it up according to cluster
capacity/split sizes. Aggregation tasks can set it up between a min/max range, according to
data output sizes from the preceding vertex.
> 
> That is what Bikas confirmed on.

Ok sounds good. Tez is fairly new to me so iím still learning!

> 
> It seems to me that "scan until you find at least X records" for a sparse sampling case
is something that can benefit a lot from a custom vertex manager plugin.
> 
> So I would like to hear more about your use case and see if the work that goes into scan
short-circuiting/avoidance in "select * from table where x < 10 limit 10;" in hive would
help you as well.

So not sure if my use case is really short circuiting. Main use is just sample x out of y
records, but stream y completely through the pipeline.
What ticket is that Hive thing ?

Johannes

> 
> Cheers,
> Gopal
> 
>> On 24 Jul 2014, at 20:38, Bikas Saha <bikas@hortonworks.com> wrote:
>> 
>>> The patch should work for all types of vertices because it gives each task
>>> the total number of tasks for its vertex. Do you have any other use case?
>>> 
>>> Bikas
>>> 
>>> -----Original Message-----
>>> From: Johannes Zillmann [mailto:jzillmann@googlemail.com]
>>> Sent: Thursday, July 24, 2014 6:58 AM
>>> To: Gopal V
>>> Cc: user@tez.apache.org
>>> Subject: Re: Task count
>>> 
>>> Hey Gopal,
>>> 
>>> using the task count basically for 2 things (in mr for both the map stage
>>> and the reduce stage):
>>> - each task samples its output-data up to a certain number. This number is
>>> the desired sample count divided by the number of tasks
>>> - also we use the task count in some scenarios to let the last task (of a
>>> stage or a vertex) do some extra logic. That plays in combination of the
>>> task-index.
>>> 
>>> Looking at your patch it looks like it will do the job for kind of the
>>> map-like vertex but not for the aggregation vertex, right ?
>>> Also what jira issue is that ?
>>> 
>>> best
>>> Johannes
>>> 
>>> On 24 Jul 2014, at 07:40, Gopal V <gopalv@apache.org> wrote:
>>> 
>>>> On 7/23/14, 6:07 PM, Johannes Zillmann wrote:
>>>>> Hey Tez team,
>>>>> 
>>>>> is there some way to get the task count within a vertex from within a
>>> task ?
>>>>> Some equivalent to mapred.map.tasks and mapred.reduce.tasks for
>>> map-reduce ?
>>>> 
>>>> Could you explain the use-case for this particular requirement?
>>>> 
>>>> I intend to add the vertex parallelism to the task context as part of
>>> one of my WIP branches.
>>>> 
>>>> I uploaded my base patch-set as is (including the TODO markers).
>>>> 
>>>> 
>>> https://issues.apache.org/jira/secure/attachment/12657536/TEZ-broadcast-sh
>>> uffle%2Bvertex-parallelism.patch
>>>> 
>>>> If you can explain what you are actually looking to do with this
>>> information, perhaps I can roll the two feature reqs together.
>>>> 
>>>> Cheers,
>>>> Gopal
>>>> 
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>> 
>> 
>> 
> 


Mime
View raw message