Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hitesh Shah <hit...@apache.org>
Subject Re: Tez Job fails - waiting for AM container to be allocated
Date Sat, 18 Jun 2016 17:38:01 GMT
Hi Ananda,

Yes - looks like the RM assigned a container for the Tez AM. Next up would be to search for
“container_e54_1466115469995_0142_01_000001” in the nodemanager logs on host  usw2stdpwo12.glassdoor.local.


Also, did the app logs of application_1466115469995_0142 shed any light ( obtained via bin/yarn
logs -applicationId application_1466115469995_0142 )?

— Hitesh

> On Jun 17, 2016, at 11:31 PM, Anandha L Ranganathan <analog.sony@gmail.com> wrote:
> 
> Hitesh,
> 
> This is the information, I see in the RM logs.  There are enough resources available
on that NM. 
> 
> 
> 2016-06-17 19:04:50,406 INFO  scheduler.SchedulerNode (SchedulerNode.java:allocateContainer(154))
- Assigned container container_e54_1466115469995_0142_01_000001 of capacity <memory:5120,
vCores:1> on host usw2stdpwo12.glassdoor.local:45454, which has 1 containers, <memory:5120,
vCores:1> used and <memory:22528, vCores:6> available after allocation
> 2016-06-17 19:04:50,406 INFO  capacity.LeafQueue (LeafQueue.java:assignContainer(1633))
- assignedContainer application attempt=appattempt_1466115469995_0142_000001 container=Container:
[ContainerId: container_e54_1466115469995_0142_01_000001, NodeId: usw2stdpwo12.glassdoor.local:45454,
NodeHttpAddress: usw2stdpwo12.glassdoor.local:8042, Resource: <memory:5120, vCores:1>,
Priority: 0, Token: null, ] queue=default: capacity=0.2, absoluteCapacity=0.2, usedResources=<memory:10240,
vCores:2>, usedCapacity=0.61731374, absoluteUsedCapacity=0.12345679, numApps=3, numContainers=2
clusterResource=<memory:82944, vCores:21> type=OFF_SWITCH
> 2016-06-17 19:04:50,407 INFO  security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200))
- Sending NMToken for nodeId : usw2stdpwo12.glassdoor.local:45454 for container
> 
> On Fri, Jun 17, 2016 at 6:38 PM, Hitesh Shah <hitesh@apache.org> wrote:
> -dev@tez for now.
> 
> Hello Anandha,
> 
> The usual issue with this is a lack of resources. e.g. no cluster capacity to launch
the AM, queue configs not allowing another AM to launch, the memory size configured for the
AM is too large such that it cannot be scheduled on any existing node, etc.
> 
> Can you search for this string “1466115469995_0142” within the ResourceManager logs?
That should shed some more light on what is going on.
> 
> thanks
> — Hitesh
> 
> 
> > On Jun 17, 2016, at 6:30 PM, Anandha L Ranganathan <analog.sony@gmail.com>
wrote:
> >
> > Yes.  sufficient resources  are available for that job.  No other jobs are running
and only this job is running.
> >
> >
> >
> > On Fri, Jun 17, 2016 at 5:16 PM, Jeff Zhang <zjffdu@gmail.com> wrote:
> > Please check RM UI whether you have sufficient resources for your app
> >
> >
> > On Sat, Jun 18, 2016 at 7:35 AM, Anandha L Ranganathan <analog.sony@gmail.com>
wrote:
> > I am upgrading one of our cluster from HDP 2.2 to HDP 2.4.0. version.
> >
> >
> >
> > The status I see in the Application monitoring URL is
> >
> > YARN Applicaiton Status: ACCEPTED: waiting for AM container to be
> > allocated, launched and register with RM.  But when we submit the MR job,
> > then it is running fine.
> >
> > It waits in that state for sometime(300 seconds) and dies and the service
> > check is failed.  All nodes are live and Active status.
> >
> >
> >
> > We try to run the job manually , and the job stops at this point.
> >
> > hadoop --config /usr/hdp/2.4.0.0-169/hadoop/conf jar
> > /usr/hdp/current/tez-client/tez-examples*.jar orderedwordcount
> > /tmp/tezsmokeinput/sample-tez-test /tmp/tezsmokeoutput1/
> > WARNING: Use "yarn jar" to launch YARN applications.
> > 16/06/17 19:04:47 INFO client.TezClient: Tez Client Version: [
> > component=tez-api, version=0.7.0.2.4.0.0-169,
> > revision=3c1431f45faaca982ecc8dad13a107787b834696,
> > SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git,
> > buildTime=20160210-0711 ]
> > 16/06/17 19:04:47 INFO impl.TimelineClientImpl: Timeline service
> > address: http://usw2stdpma03.glassdoor.local:8188/ws/v1/timeline/
> > 16/06/17 <http://usw2stdpma03.glassdoor.local:8188/ws/v1/timeline/16/06/17>
> > 19:04:48 INFO client.RMProxy: Connecting to ResourceManager at
> > usw2stdpma03.glassdoor.local/172.17.212.107:8050
> > 16/06/17 19:04:48 INFO client.TezClient: Using
> > org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager to
> > manage Timeline ACLs
> > 16/06/17 19:04:48 INFO impl.TimelineClientImpl: Timeline service
> > address: http://usw2stdpma03.glassdoor.local:8188/ws/v1/timeline/
> > 16/06/17 <http://usw2stdpma03.glassdoor.local:8188/ws/v1/timeline/16/06/17>
> > 19:04:49 INFO examples.OrderedWordCount: Running OrderedWordCount
> > 16/06/17 19:04:49 INFO client.TezClient: Submitting DAG application
> > with id: application_1466115469995_0142
> > 16/06/17 19:04:49 INFO client.TezClientUtils: Using tez.lib.uris value
> > from configuration: /hdp/apps/2.4.0.0-169/tez/tez.tar.gz
> > 16/06/17 19:04:49 INFO client.TezClient: Stage directory
> > /tmp/root/staging doesn't exist and is created
> > 16/06/17 19:04:49 INFO client.TezClient: Tez system stage directory
> > hdfs://dfs-nameservices/tmp/root/staging/.tez/application_1466115469995_0142
> > doesn't exist and is created
> > 16/06/17 19:04:49 INFO acls.ATSHistoryACLPolicyManager: Created
> > Timeline Domain for History ACLs,
> > domainId=Tez_ATS_application_1466115469995_0142
> > 16/06/17 19:04:50 INFO client.TezClient: Submitting DAG to YARN,
> > applicationId=application_1466115469995_0142,
> > dagName=OrderedWordCount, callerContext={ context=TezExamples,
> > callerType=null, callerId=null }
> > 16/06/17 19:04:50 INFO impl.YarnClientImpl: Submitted application
> > application_1466115469995_0142
> > 16/06/17 19:04:50 INFO client.TezClient: The url to track the Tez AM:
> > http://usw2stdpma03.glassdoor.local:8088/proxy/application_1466115469995_0142/
> > 16/06/17 <http://usw2stdpma03.glassdoor.local:8088/proxy/application_1466115469995_0142/16/06/17>
> > 19:04:50 INFO impl.TimelineClientImpl: Timeline service address:
> > http://usw2stdpma03.glassdoor.local:8188/ws/v1/timeline/
> > 16/06/17 <http://usw2stdpma03.glassdoor.local:8188/ws/v1/timeline/16/06/17>
> > 19:04:50 INFO client.RMProxy: Connecting to ResourceManager at
> > usw2stdpma03.glassdoor.local/172.17.212.107:8050
> > 16/06/17 19:04:51 INFO client.DAGClientImpl: Waiting for DAG to start running
> >
> >
> >
> > how do I fix this problem ?
> >
> > Thanks
> > Anand
> >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
> 
> 


Mime
View raw message