Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hitesh Shah <>
Subject Re: streamed splitting
Date Thu, 12 Mar 2015 14:44:38 GMT
Hello Johannes, 

This is something we have discussed quite often but have not got around to implementing this.
There might be an open jira related to “pipelining” of splits. If you cannot find it, please
go ahead and create one.

The general issues with these are:
   - how to handle dynamic creation of tasks as splits get created
   - how to decide how many splits and which splits a single task should handle
   - involves some facet of grouping to do optimal allocations of newly created splits based
on available containers. Size of groups could be different e.g a single group slit consist
of either 5 data local splits or 2 rack-local splits or 1 off-rack split when assigning dynamically
to a given container.
   - the single task limit also plays into how you handle fault tolerance and recovery 
   - given that split creation is now dynamic, if the AM crashes in a scenario when not all
splits were created but some were already processed, the next attempt when it recovers needs
to handle it in a such way to ensure correctness of data processing.

— Hitesh

On Mar 12, 2015, at 2:38 AM, Johannes Zillmann <> wrote:

> Hey guys,
> dump question. With Tez can i have a input-initializaer which don’t require to create
every split before starting the processing of already created splits ?
> Means if i have a lot of splits and my splitting process takes a long time, can the workers
start working already while still doing the splitting ?
> Johannes

View raw message