hi, All I'm encounter a memory issue with my flink job on AWS EMR(current flink version 1.6.2) , I would like to find the root cause so I'm trying JITWatch on my local standalone cluster but I can not use it on EMR . after I add following config on flink-conf.yaml : env.java.opts: "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly" I got error 2020-05-07 16:24:53,368 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli - Error while running the Flink Yarn session. java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813) Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) ... 2 more Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. How can I fix this issue to enable JITWatch or which tool will be a proper way to analyze flink mem dump on EMR ? Thanks Jacky Du |
Hi Jacky, Did you try it without -XX:LogFile=${FLINK_LOG_PREFIX}.jit ? Probably, Flink can't write to this location. Also, you can try other tools described at https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/application_profiling.html Regards,
Roman On Mon, May 11, 2020 at 5:02 PM Jacky D <[hidden email]> wrote:
|
---------- Forwarded message --------- 发件人: Jacky D <[hidden email]> Date: 2020年5月11日周一 下午3:12 Subject: Re: Flink Memory analyze on AWS EMR To: Khachatryan Roman <[hidden email]> Hi, Roman Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC . I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts " ? Thanks Jacky |
Hey Jacky, The error says "The YARN application unexpectedly switched to state FAILED during deployment.". Have you tried retrieving the YARN application logs? Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail? Best, Robert On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:
|
Hi,Robert Yes , I tried to retrieve more log info from yarn UI , the full logs showing below , this happens when I try to create a flink yarn session on emr when set up jitwatch configuration . 2020-05-11 19:06:09,552 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli - Error while running the Flink Yarn session. java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813) Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) ... 2 more Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. Diagnostics from YARN: Application application_1584459865196_0165 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1584459865196_0165_000001 exited with exitCode: 1 Failing this attempt.Diagnostics: Exception from container-launch. Container id: container_1584459865196_0165_01_000001 Exit code: 1 Exception message: Usage: java [-options] class [args...] (to execute a class) or java [-options] -jar jarfile [args...] (to execute a jar file) where options include: -d32 use a 32-bit data model if available -d64 use a 64-bit data model if available -server to select the "server" VM The default VM is server, because you are running on a server-class machine. -cp <class search path of directories and zip/jar files> -classpath <class search path of directories and zip/jar files> A : separated list of directories, JAR archives, and ZIP archives to search for class files. -D<name>=<value> set a system property -verbose:[class|gc|jni] enable verbose output -version print product version and exit -version:<value> Warning: this feature is deprecated and will be removed in a future release. require the specified version to run -showversion print product version and continue -jre-restrict-search | -no-jre-restrict-search Warning: this feature is deprecated and will be removed in a future release. include/exclude user private JREs in the version search -? -help print this help message -X print help on non-standard options -ea[:<packagename>...|:<classname>] -enableassertions[:<packagename>...|:<classname>] enable assertions with specified granularity -da[:<packagename>...|:<classname>] -disableassertions[:<packagename>...|:<classname>] disable assertions with specified granularity -esa | -enablesystemassertions enable system assertions -dsa | -disablesystemassertions disable system assertions -agentlib:<libname>[=<options>] load native agent library <libname>, e.g. -agentlib:hprof see also, -agentlib:jdwp=help and -agentlib:hprof=help -agentpath:<pathname>[=<options>] load native agent library by full pathname -javaagent:<jarpath>[=<options>] load Java programming language agent, see java.lang.instrument -splash:<imagepath> show splash screen with specified image See http://www.oracle.com/technetwork/java/javase/documentation/index.html for more details. Thanks Jacky Robert Metzger <[hidden email]> 于2020年5月11日周一 下午3:42写道:
|
Thanks a lot for posting the full output. It seems that Flink is passing an invalid list of arguments to the JVM. Can you - set the root log level in conf/log4j-yarn-session.properties to DEBUG - then launch the YARN session - share the log file of the yarn session on the mailing list? I'm particularly interested in the line printed here, as it shows the JVM invocation. On Mon, May 11, 2020 at 9:56 PM Jacky D <[hidden email]> wrote:
|
hi, Robert Thanks so much for quick reply , I changed the log level to debug and attach the log file . Thanks Jacky Robert Metzger <[hidden email]> 于2020年5月11日周一 下午4:14写道:
memErrorLog.log (17K) Download Attachment |
Hi Jacky, Could you search for "Application Master start command:" in the debug log and post the result and a few lines before & after that? This is not included in the clip of attached log file. Thank you~ Xintong Song On Tue, May 12, 2020 at 5:33 AM Jacky D <[hidden email]> wrote:
|
hi, Xintong Thanks for reply , I attached those lines below for application master start command : 2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory - Crypto codec org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec is not available. 2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory - Using crypto codec org.apache.hadoop.crypto.JceAesCtrCryptoCodec. 2020-05-11 21:16:16,636 DEBUG org.apache.hadoop.hdfs.DataStreamer - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 0 offsetInBlock: 0 lastPacketInBlock: false lastByteOffsetInBlock: 1697 2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer - DFSClient seqno: 0 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0 2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 1 offsetInBlock: 1697 lastPacketInBlock: true lastByteOffsetInBlock: 1697 2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer - DFSClient seqno: 1 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0 2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer - Closing old block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 2020-05-11 21:16:16,641 DEBUG org.apache.hadoop.ipc.Client - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #70 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete 2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #70 2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: complete took 2ms 2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #71 org.apache.hadoop.hdfs.protocol.ClientProtocol.setTimes 2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.Client - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #71 2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: setTimes took 2ms 2020-05-11 21:16:16,647 DEBUG org.apache.hadoop.ipc.Client - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #72 org.apache.hadoop.hdfs.protocol.ClientProtocol.setPermission 2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.Client - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #72 2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: setPermission took 2ms 2020-05-11 21:16:16,654 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor - Application Master start command: $JAVA_HOME/bin/java -Xmx424m "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly" -Dlog.file="<LOG_DIR>/jobmanager.log" -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint 1> <LOG_DIR>/jobmanager.out 2> <LOG_DIR>/jobmanager.err 2020-05-11 21:16:16,654 DEBUG org.apache.hadoop.ipc.Client - stopping client from cache: org.apache.hadoop.ipc.Client@28194a50 2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setApplicationTags. 2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setAttemptFailuresValidityInterval. 2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setKeepContainersAcrossApplicationAttempts. 2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setNodeLabelExpression. Xintong Song <[hidden email]> 于2020年5月11日周一 下午10:11写道:
|
Hi Jacky, I suspect that the quotes are the actual issue. Could you try to remove them? See also [1]. On Tue, May 12, 2020 at 4:03 PM Jacky D <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
hi, Arvid thanks for the advice , I removed the quotes and it do created a yarn session on EMR , but I didn't find any jit log file generated . The config with quotes is working on standalone cluster . I also tried to dynamic pass the property within the yarn session command : flink-yarn-session -n 1 -d -nm testSession -yD env.java.opts="-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly" but get same result , session created , but can not find any jit log file under container log . Thanks Jacky Arvid Heise <[hidden email]> 于2020年5月12日周二 下午12:57写道:
|
Hi Jacky, I don't think ${FLINK_LOG_PREFIX} is available for Flink Yarn deployment. This is just my guess, that the actual file name becomes ".jit". You can try to verify that by looking for the hidden file. If it is indeed this problem, you can try to replace "${FLINK_LOG_PREFIX}" with "<LOG_DIR>/your-file-name.jit". The token "<LOG_DIR>" should be replaced with proper log directory path by Yarn automatically. I noticed that the usage of ${FLINK_LOG_PREFIX} is recommended by Flink's documentation [1]. This is IMO a bit misleading. I'll try to file an issue to improve the docs. Thank you~ Xintong Song On Wed, May 13, 2020 at 2:45 AM Jacky D <[hidden email]> wrote:
|
Hi, Xintong Thanks for point it out, after I set up the log path it's working now . so , for conclusion . on emr , to set up jitwatch in flink-conf.yaml, we should not include quotes and give a path to output the jit log file . this is different from setting it on standalone cluster . example : env.java.opts: -XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=/tmp/flinkmemdump.jit -XX:+PrintAssembly Thanks everyone involved in this discussion! Jacky Xintong Song <[hidden email]> 于2020年5月12日周二 下午10:41写道:
|
Free forum by Nabble | Edit this page |