Thanks a lot for your detailed answer.
When you say "it is now decided based on a combination of available cluster resources as well
as the input data": what do you mean by available cluster resources ? Total resources in the
cluster(like number nodes * capability of each node), or instantaneous available resources
based on current workload on each such node.
So to be clear: even for the same job and same total input size, the number of tasks and such
input size per task will differ for consecutive runs ?
Also, it seems the number of tasks for a vertex can change at runtime(not about duplicates,
but distinct tasks) ?
thanks,
Robert
On Wednesday, July 9, 2014 7:48 AM, Hitesh Shah <hitesh@apache.org> wrote:
For root vertices ( ones which read from HDFS ), the no. of tasks can be decided based on
the size of the input data though in most cases, it is now decided based on a combination
of available cluster resources as well as the input data. There are some special cases in
Hive where a vertex could have an HDFS input as well as an edge from another vertex where
the decision making is non trivial. For intermediate vertices, no. of tasks is determined
either by the user or at runtime based on size of inbound data.
In any case, the input data is understood by the InputFormat and the specialized class decides
how to “split” this data set. Furthermore, based on additional grouping logic, multiple
splits based on their data's location could be combined together to define a single unit of
work for a given task. This grouping is defined based on a min/max data size range and available
cluster resources. All in all, there are no guarantees that they will be of an equal input
size. It is the common case for FileInputFormat which splits the data ( one split per block
) as long as grouping is not enabled.
thanks
— Hitesh
On Jul 8, 2014, at 8:29 PM, Grandl Robert <rgrandl@yahoo.com> wrote:
> Hitesh,
>
> With respect to the below comment: So a vertex will have a number of tasks, which is
decided strictly based on the input data the vertex has to process ? Also, it is guaranteed
that every task will have same input size ? (all except the last one probably).
>
> Thanks,
> Robert
>
> Correct. The hierarchy is dag > vertex > task > task attempt ( each relationship
being a 1:N ).
> Vertex defines a stage of common processing logic applied on a parallel data set. A task
represents processing of a subset of the data set.
>
>
> On Monday, July 7, 2014 10:37 AM, Hitesh Shah <hitesh@apache.org> wrote:
>
>
> Correct. The hierarchy is dag > vertex > task > task attempt ( each relationship
being a 1:N ).
> Vertex defines a stage of common processing logic applied on a parallel data set. A task
represents processing of a subset of the data set.
>
> thanks
> — Hitesh
>
> On Jul 7, 2014, at 9:40 AM, Grandl Robert <rgrandl@yahoo.com> wrote:
>
> > Another dumb question: A vertex can have multiple tasks(not task attempts), for
different input blocks, right ? So a vertex entity is kind of a stage abstraction, not a task
abstraction, right ?
> >
>
>
