Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hitesh Shah <hit...@apache.org>
Subject Re: get input size for each task
Date Wed, 09 Jul 2014 14:47:37 GMT
For root vertices ( ones which read from HDFS ), the no. of tasks can be decided based on the
size of the input data though in most cases, it is now decided based on a combination of available
cluster resources as well as the input data. There are some special cases in Hive where a
vertex could have an HDFS input as well as an edge from another vertex where the decision
making is non trivial. For intermediate vertices, no. of tasks is determined either by the
user or at runtime based on size of inbound data.

In any case, the input data is understood by the InputFormat and the specialized class decides
how to “split” this data set. Furthermore, based on additional grouping logic, multiple splits
based on their data's location could be combined together to define a single unit of work
for a given task. This grouping is defined based on a min/max data size range and available
cluster resources. All in all, there are no guarantees that they will be of an equal input
size. It is the common case for FileInputFormat which splits the data ( one split per block
) as long as grouping is not enabled.

thanks
— Hitesh


On Jul 8, 2014, at 8:29 PM, Grandl Robert <rgrandl@yahoo.com> wrote:

> Hitesh,
> 
> With respect to the below comment: So a vertex will have a number of tasks, which is
decided strictly based on the input data the vertex has to process ? Also, it is guaranteed
that every task will have same input size ? (all except the last one probably).
> 
> Thanks,
> Robert
> 
> Correct. The hierarchy is dag -> vertex -> task -> task attempt ( each relationship
being a 1:N ).
> Vertex defines a stage of common processing logic applied on a parallel data set. A task
represents processing of a subset of the data set.
> 
> 
> On Monday, July 7, 2014 10:37 AM, Hitesh Shah <hitesh@apache.org> wrote:
> 
> 
> Correct. The hierarchy is dag -> vertex -> task -> task attempt ( each relationship
being a 1:N ).
> Vertex defines a stage of common processing logic applied on a parallel data set. A task
represents processing of a subset of the data set.
> 
> thanks
> — Hitesh
> 
> On Jul 7, 2014, at 9:40 AM, Grandl Robert <rgrandl@yahoo.com> wrote:
> 
> > Another dumb question: A vertex can have multiple tasks(not task attempts), for
different input blocks, right ? So a vertex entity is kind of a stage abstraction, not a task
abstraction, right ?
> > 
> 
> 


Mime
View raw message