Mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sungwoo Park <glap...@gmail.com>
Subject Re: Question on Tez 0.6 and Tez 0.7
Date Mon, 13 Jun 2016 03:50:01 GMT
Yes, we also think that that's what is happening. When there is 'Vertex
re-running', we see Exceptions like:

2016-05-13 09:45:28,280 WARN [ResponseProcessor for block
BP-358076261-10.1.91.4-1462195174702:blk_1075135562_1396252]
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block
BP-358076261-10.1.91.4-14621951747\
02:blk_1075135562_1396252
java.net.SocketTimeoutException: 65000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.1.91.11:37025 remote=/
10.1.91.11:50010]
  at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
  at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
  at java.io.FilterInputStream.read(FilterInputStream.java:83)
  at java.io.FilterInputStream.read(FilterInputStream.java:83)
  at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2201)
  at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
  at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:867)
2016-05-13 09:45:28,283 INFO [TezChild] task.TezTaskRunner: Encounted an
error while executing task: attempt_1462251281999_0859_1_01_000063_0
java.io.IOException: All datanodes 10.1.91.11:50010 are bad. Aborting...
  at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1206)
  at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1004)
  at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:548)

However, this still does not explain that Tez 0.7 somehow runs much slower
than Tez 0.6.2 when there arises no such Exception. As an example, see the
summary of results below.  Tez 0.7 takes more than 1100 seconds to sort
40GB per node (where there is no Vertex re-running and no Exception other
than at the end of the run), whereas Tez 0.6.2 spends just less than 800
seconds. These results were obtained from the same cluster, using the
default settings by HDP 2.4 (installed with Ambari), with very few
differences in configuration settings.

I believe that we missed some important settings for Tez 0.7 (because we
did not install Hadoop and HDFS manually). Let me try the same experiment
in other clusters and see what happens.

Thanks for the response,

--- Sungwoo

tez-0.6.2-40GB_10nodes
id              time            num_containers  mem             core
     diag
0               761             239             525320249       82608

1               767             239             527612323       83271

2               736             239             520229980       82317



tez-0.7.0-40GB_10nodes

id              time            num_containers  mem             core
     diag
0               1405            339             953706595       136624
     Vertex re-running, vertexName=initialmap,
vertexId=vertex_1462251281999_0859_1_00
1               1157            239             828765079       118293

2               1219            239             833052604       118151



On Mon, Jun 13, 2016 at 6:21 AM, Bikas Saha <bikas@apache.org> wrote:

> From vertex re-running messages it seems like you may be seeing errors in
> fetching shuffle data to reducers from mappers. This results in re-running
> map vertex tasks to produce only the missing data – with the corresponding
> vertex re-running message.
>
>
>
> Bikas
>
>
>
> *From:* Rajesh Balamohan [mailto:rajesh.balamohan@gmail.com]
> *Sent:* Saturday, June 11, 2016 5:14 AM
> *To:* user@tez.apache.org
> *Subject:* Re: Question on Tez 0.6 and Tez 0.7
>
>
>
> Can you share the exceptions that happened in tez 0.7 runs.
>
> -Rajesh.B
>
> On 11-Jun-2016 16:28, "Sungwoo Park" <glapark@gmail.com> wrote:
>
> Hello,
>
>
>
> I have a question about the performance difference between Tez 0.6.2 and
> Tez 0.7.0.
>
>
>
> This is what we did:
>
>
>
> 1. Installed HDP 2.4 on a 10-node cluster with default settings. No other
> particular changes were made to the
>
> default settings recommended by HDP 2.4.
>
>
>
> 2. Ran TeraSort using Tez 0.6.2 and Tez 0.7.0, and compared the running
> time.
>
>
>
> Each experiment specifies the amount of input data per node. For example,
> 10GB_per_node means a total of
>
> 100GB input because there are 10 data nodes in the cluster.
>
>
>
> We've found that Tez 0.7.0 runs consistently slower than Tez 0.6.2,
> producing 'Vertex re-running' errors quite
>
> often when the size of input data per node is over 40GB. Even when there
> is no 'Vertex re-running', Tez 0.7.0
>
> took much longer than Tez 0.6.2.
>
>
>
> We know that Tez 0.7.0 runs faster than Tez 0.6.2, because on a cluster of
> 44 nodes (with only 24GB memory per
>
> node), Tez 0.7.0 finished TeraSort almost as fast as Tez 0.6.2. We are
> trying to figure out what we missed in
>
> the experiments on the 11-node cluster.
>
>
>
> Any help here would be appreciated. Thanks a lot.
>
>
>
> Sungwoo Park
>
>
>
> ----- Configuration
>
>
>
> HDP 2.4
>
> 11 nodes, 10 data nodes, each with 96GB memory, 6 x 500GB HDDs
>
> same HDFS, Yarn, MR
>
>
>
> Each mapper container uses 5GB.
>
> Each reducer container uses 10GB.
>
>
>
> Configurations specific to tez-0.6.0
>
> tez.runtime.sort.threads = 2
>
>
>
> Configurations specicfic to tez-0.7.0
>
> tez.grouping.max-size = 1073741824
>
> tez.runtime.sorter.class = PIPELINED
>
> tez.runtime.pipelined.sorter.sort.threads = 2
>
>
>
> ----- TEZ-0.6.2
>
>
>
> 10GB_per_node
>
> id              time            num_containers  mem             core
>      diag
>
> 0               212             239             144695261       21873
>
> 1               204             239             139582665       20945
>
> 2               211             239             143477178       21700
>
>
>
> 20GB_per_node
>
> id              time            num_containers  mem             core
>      diag
>
> 0               392             239             272528515       42367
>
> 1               402             239             273085026       42469
>
> 2               410             239             270118502       42111
>
>
>
> 40GB_per_node
>
> id              time            num_containers  mem             core
>      diag
>
> 0               761             239             525320249       82608
>
> 1               767             239             527612323       83271
>
> 2               736             239             520229980       82317
>
>
>
> 80GB_per_node
>
> id              time            num_containers  mem             core
>      diag
>
> 0               1564            239             1123903845      173915
>
> 1               1666            239             1161079968      178656
>
> 2               1628            239             1146656912      175998
>
>
>
> 160GB_per_node
>
> id              time            num_containers  mem             core
>      diag
>
> 0               3689            239             2523160230      377563
>
> 1               3796            240             2610411363      388928
>
> 2               3624            239             2546652697      381400
>
>
>
> ----- TEZ-0.7.0
>
>
>
> 10GB_per_node
>
> id              time            num_containers  mem             core
>      diag
>
> 0               262             239             179373935       26223
>
> 1               259             239             179375665       25767
>
> 2               271             239             186946086       26516
>
>
>
> 20GB_per_node
>
> id              time            num_containers  mem             core
>      diag
>
> 0               572             239             380034060       55515
>
> 1               533             239             364082337       53555
>
> 2               515             239             356570788       52762
>
>
>
> 40GB_per_node
>
> id              time            num_containers  mem             core
>      diag
>
> 0               1405            339             953706595       136624
>      Vertex re-running
>
> 1               1157            239             828765079       118293
>
> 2               1219            239             833052604       118151
>
>
>
> 80GB_per_node
>
> id              time            num_containers  mem             core
>      diag
>
> 0               3046            361             1999047193      279635
>      Vertex re-running
>
> 1               2967            337             2079807505      290171
>        Vertex re-running
>
> 2               3138            355             2030176406      282875
>        Vertex re-running
>
>
>
> 160GB_per_node
>
> id              time            num_containers  mem             core
>      diag
>
> 0               6832            436             4524472859      634518
>        Vertex re-running
>
> 1               6233            365             4123693672      573259
>      Vertex re-running
>
> 2               6133            379             4121812899      579044
>      Vertex re-running
>
>

Mime
View raw message