Flink Memory analyze on AWS EMR

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink Memory analyze on AWS EMR

Jacky Du
hi, All 

I'm encounter a memory issue with my flink job on AWS EMR(current flink version 1.6.2) , I would like to find the root cause so I'm trying JITWatch on my local standalone cluster but I can not use it on EMR . after I add following config on flink-conf.yaml :

env.java.opts: "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly"

I got error 

2020-05-07 16:24:53,368 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. 

How can I fix this issue to enable JITWatch or which tool will be a proper way to analyze flink mem dump on EMR  ? 

Thanks 
Jacky Du
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

r_khachatryan
Hi Jacky,

Did you try it without  -XX:LogFile=${FLINK_LOG_PREFIX}.jit ?
Probably, Flink can't write to this location.


Regards,
Roman


On Mon, May 11, 2020 at 5:02 PM Jacky D <[hidden email]> wrote:
hi, All 

I'm encounter a memory issue with my flink job on AWS EMR(current flink version 1.6.2) , I would like to find the root cause so I'm trying JITWatch on my local standalone cluster but I can not use it on EMR . after I add following config on flink-conf.yaml :

env.java.opts: "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly"

I got error 

2020-05-07 16:24:53,368 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment. 

How can I fix this issue to enable JITWatch or which tool will be a proper way to analyze flink mem dump on EMR  ? 

Thanks 
Jacky Du
Reply | Threaded
Open this post in threaded view
|

Fwd: Flink Memory analyze on AWS EMR

Jacky Du


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

rmetzger0
Hey Jacky,

The error says "The YARN application unexpectedly switched to state FAILED during deployment.".
Have you tried retrieving the YARN application logs?
Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail?

Best,
Robert


On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

Jacky Du
Hi,Robert 

Yes , I tried to retrieve more log info from yarn UI , the full logs showing below , this happens when I try to create a flink yarn session on emr when set up jitwatch configuration .

2020-05-11 19:06:09,552 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1584459865196_0165 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1584459865196_0165_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584459865196_0165_01_000001
Exit code: 1
Exception message: Usage: java [-options] class [args...]
           (to execute a class)
   or  java [-options] -jar jarfile [args...]
           (to execute a jar file)
where options include:
    -d32   use a 32-bit data model if available
    -d64   use a 64-bit data model if available
    -server   to select the "server" VM
                  The default VM is server,
                  because you are running on a server-class machine.


    -cp <class search path of directories and zip/jar files>
    -classpath <class search path of directories and zip/jar files>
                  A : separated list of directories, JAR archives,
                  and ZIP archives to search for class files.
    -D<name>=<value>
                  set a system property
    -verbose:[class|gc|jni]
                  enable verbose output
    -version      print product version and exit
    -version:<value>
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  require the specified version to run
    -showversion  print product version and continue
    -jre-restrict-search | -no-jre-restrict-search
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  include/exclude user private JREs in the version search
    -? -help      print this help message
    -X            print help on non-standard options
    -ea[:<packagename>...|:<classname>]
    -enableassertions[:<packagename>...|:<classname>]
                  enable assertions with specified granularity
    -da[:<packagename>...|:<classname>]
    -disableassertions[:<packagename>...|:<classname>]
                  disable assertions with specified granularity
    -esa | -enablesystemassertions
                  enable system assertions
    -dsa | -disablesystemassertions
                  disable system assertions
    -agentlib:<libname>[=<options>]
                  load native agent library <libname>, e.g. -agentlib:hprof
                  see also, -agentlib:jdwp=help and -agentlib:hprof=help
    -agentpath:<pathname>[=<options>]
                  load native agent library by full pathname
    -javaagent:<jarpath>[=<options>]
                  load Java programming language agent, see java.lang.instrument
    -splash:<imagepath>
                  show splash screen with specified image

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午3:42写道:
Hey Jacky,

The error says "The YARN application unexpectedly switched to state FAILED during deployment.".
Have you tried retrieving the YARN application logs?
Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail?

Best,
Robert


On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

rmetzger0
Thanks a lot for posting the full output.

It seems that Flink is passing an invalid list of arguments to the JVM. 
Can you 
- set the root log level in conf/log4j-yarn-session.properties to DEBUG
- then launch the YARN session
- share the log file of the yarn session on the mailing list?

I'm particularly interested in the line printed here, as it shows the JVM invocation.


On Mon, May 11, 2020 at 9:56 PM Jacky D <[hidden email]> wrote:
Hi,Robert 

Yes , I tried to retrieve more log info from yarn UI , the full logs showing below , this happens when I try to create a flink yarn session on emr when set up jitwatch configuration .

2020-05-11 19:06:09,552 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1584459865196_0165 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1584459865196_0165_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584459865196_0165_01_000001
Exit code: 1
Exception message: Usage: java [-options] class [args...]
           (to execute a class)
   or  java [-options] -jar jarfile [args...]
           (to execute a jar file)
where options include:
    -d32   use a 32-bit data model if available
    -d64   use a 64-bit data model if available
    -server   to select the "server" VM
                  The default VM is server,
                  because you are running on a server-class machine.


    -cp <class search path of directories and zip/jar files>
    -classpath <class search path of directories and zip/jar files>
                  A : separated list of directories, JAR archives,
                  and ZIP archives to search for class files.
    -D<name>=<value>
                  set a system property
    -verbose:[class|gc|jni]
                  enable verbose output
    -version      print product version and exit
    -version:<value>
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  require the specified version to run
    -showversion  print product version and continue
    -jre-restrict-search | -no-jre-restrict-search
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  include/exclude user private JREs in the version search
    -? -help      print this help message
    -X            print help on non-standard options
    -ea[:<packagename>...|:<classname>]
    -enableassertions[:<packagename>...|:<classname>]
                  enable assertions with specified granularity
    -da[:<packagename>...|:<classname>]
    -disableassertions[:<packagename>...|:<classname>]
                  disable assertions with specified granularity
    -esa | -enablesystemassertions
                  enable system assertions
    -dsa | -disablesystemassertions
                  disable system assertions
    -agentlib:<libname>[=<options>]
                  load native agent library <libname>, e.g. -agentlib:hprof
                  see also, -agentlib:jdwp=help and -agentlib:hprof=help
    -agentpath:<pathname>[=<options>]
                  load native agent library by full pathname
    -javaagent:<jarpath>[=<options>]
                  load Java programming language agent, see java.lang.instrument
    -splash:<imagepath>
                  show splash screen with specified image

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午3:42写道:
Hey Jacky,

The error says "The YARN application unexpectedly switched to state FAILED during deployment.".
Have you tried retrieving the YARN application logs?
Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail?

Best,
Robert


On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

Jacky Du
hi, Robert

Thanks so much for quick reply  , I changed the log level to debug  and attach the log file .

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午4:14写道:
Thanks a lot for posting the full output.

It seems that Flink is passing an invalid list of arguments to the JVM. 
Can you 
- set the root log level in conf/log4j-yarn-session.properties to DEBUG
- then launch the YARN session
- share the log file of the yarn session on the mailing list?

I'm particularly interested in the line printed here, as it shows the JVM invocation.


On Mon, May 11, 2020 at 9:56 PM Jacky D <[hidden email]> wrote:
Hi,Robert 

Yes , I tried to retrieve more log info from yarn UI , the full logs showing below , this happens when I try to create a flink yarn session on emr when set up jitwatch configuration .

2020-05-11 19:06:09,552 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1584459865196_0165 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1584459865196_0165_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584459865196_0165_01_000001
Exit code: 1
Exception message: Usage: java [-options] class [args...]
           (to execute a class)
   or  java [-options] -jar jarfile [args...]
           (to execute a jar file)
where options include:
    -d32   use a 32-bit data model if available
    -d64   use a 64-bit data model if available
    -server   to select the "server" VM
                  The default VM is server,
                  because you are running on a server-class machine.


    -cp <class search path of directories and zip/jar files>
    -classpath <class search path of directories and zip/jar files>
                  A : separated list of directories, JAR archives,
                  and ZIP archives to search for class files.
    -D<name>=<value>
                  set a system property
    -verbose:[class|gc|jni]
                  enable verbose output
    -version      print product version and exit
    -version:<value>
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  require the specified version to run
    -showversion  print product version and continue
    -jre-restrict-search | -no-jre-restrict-search
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  include/exclude user private JREs in the version search
    -? -help      print this help message
    -X            print help on non-standard options
    -ea[:<packagename>...|:<classname>]
    -enableassertions[:<packagename>...|:<classname>]
                  enable assertions with specified granularity
    -da[:<packagename>...|:<classname>]
    -disableassertions[:<packagename>...|:<classname>]
                  disable assertions with specified granularity
    -esa | -enablesystemassertions
                  enable system assertions
    -dsa | -disablesystemassertions
                  disable system assertions
    -agentlib:<libname>[=<options>]
                  load native agent library <libname>, e.g. -agentlib:hprof
                  see also, -agentlib:jdwp=help and -agentlib:hprof=help
    -agentpath:<pathname>[=<options>]
                  load native agent library by full pathname
    -javaagent:<jarpath>[=<options>]
                  load Java programming language agent, see java.lang.instrument
    -splash:<imagepath>
                  show splash screen with specified image

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午3:42写道:
Hey Jacky,

The error says "The YARN application unexpectedly switched to state FAILED during deployment.".
Have you tried retrieving the YARN application logs?
Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail?

Best,
Robert


On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 

memErrorLog.log (17K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

Xintong Song
Hi Jacky,

Could you search for "Application Master start command:" in the debug log and post the result and a few lines before & after that? This is not included in the clip of attached log file.

Thank you~

Xintong Song



On Tue, May 12, 2020 at 5:33 AM Jacky D <[hidden email]> wrote:
hi, Robert

Thanks so much for quick reply  , I changed the log level to debug  and attach the log file .

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午4:14写道:
Thanks a lot for posting the full output.

It seems that Flink is passing an invalid list of arguments to the JVM. 
Can you 
- set the root log level in conf/log4j-yarn-session.properties to DEBUG
- then launch the YARN session
- share the log file of the yarn session on the mailing list?

I'm particularly interested in the line printed here, as it shows the JVM invocation.


On Mon, May 11, 2020 at 9:56 PM Jacky D <[hidden email]> wrote:
Hi,Robert 

Yes , I tried to retrieve more log info from yarn UI , the full logs showing below , this happens when I try to create a flink yarn session on emr when set up jitwatch configuration .

2020-05-11 19:06:09,552 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1584459865196_0165 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1584459865196_0165_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584459865196_0165_01_000001
Exit code: 1
Exception message: Usage: java [-options] class [args...]
           (to execute a class)
   or  java [-options] -jar jarfile [args...]
           (to execute a jar file)
where options include:
    -d32   use a 32-bit data model if available
    -d64   use a 64-bit data model if available
    -server   to select the "server" VM
                  The default VM is server,
                  because you are running on a server-class machine.


    -cp <class search path of directories and zip/jar files>
    -classpath <class search path of directories and zip/jar files>
                  A : separated list of directories, JAR archives,
                  and ZIP archives to search for class files.
    -D<name>=<value>
                  set a system property
    -verbose:[class|gc|jni]
                  enable verbose output
    -version      print product version and exit
    -version:<value>
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  require the specified version to run
    -showversion  print product version and continue
    -jre-restrict-search | -no-jre-restrict-search
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  include/exclude user private JREs in the version search
    -? -help      print this help message
    -X            print help on non-standard options
    -ea[:<packagename>...|:<classname>]
    -enableassertions[:<packagename>...|:<classname>]
                  enable assertions with specified granularity
    -da[:<packagename>...|:<classname>]
    -disableassertions[:<packagename>...|:<classname>]
                  disable assertions with specified granularity
    -esa | -enablesystemassertions
                  enable system assertions
    -dsa | -disablesystemassertions
                  disable system assertions
    -agentlib:<libname>[=<options>]
                  load native agent library <libname>, e.g. -agentlib:hprof
                  see also, -agentlib:jdwp=help and -agentlib:hprof=help
    -agentpath:<pathname>[=<options>]
                  load native agent library by full pathname
    -javaagent:<jarpath>[=<options>]
                  load Java programming language agent, see java.lang.instrument
    -splash:<imagepath>
                  show splash screen with specified image

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午3:42写道:
Hey Jacky,

The error says "The YARN application unexpectedly switched to state FAILED during deployment.".
Have you tried retrieving the YARN application logs?
Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail?

Best,
Robert


On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

Jacky Du
hi, Xintong 

Thanks for reply , I attached those lines below for application master start command : 


2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory                    - Crypto codec org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec is not available.
2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory                    - Using crypto codec org.apache.hadoop.crypto.JceAesCtrCryptoCodec.
2020-05-11 21:16:16,636 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 0 offsetInBlock: 0 lastPacketInBlock: false lastByteOffsetInBlock: 1697
2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DFSClient seqno: 0 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 1 offsetInBlock: 1697 lastPacketInBlock: true lastByteOffsetInBlock: 1697
2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DFSClient seqno: 1 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - Closing old block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315
2020-05-11 21:16:16,641 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #70 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #70
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: complete took 2ms
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #71 org.apache.hadoop.hdfs.protocol.ClientProtocol.setTimes
2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #71
2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: setTimes took 2ms
2020-05-11 21:16:16,647 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #72 org.apache.hadoop.hdfs.protocol.ClientProtocol.setPermission
2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #72
2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: setPermission took 2ms
2020-05-11 21:16:16,654 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor           - Application Master start command: $JAVA_HOME/bin/java -Xmx424m "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly" -Dlog.file="<LOG_DIR>/jobmanager.log" -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint  1> <LOG_DIR>/jobmanager.out 2> <LOG_DIR>/jobmanager.err
2020-05-11 21:16:16,654 DEBUG org.apache.hadoop.ipc.Client                                  - stopping client from cache: org.apache.hadoop.ipc.Client@28194a50
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setApplicationTags.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setAttemptFailuresValidityInterval.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setKeepContainersAcrossApplicationAttempts.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setNodeLabelExpression.

Xintong Song <[hidden email]> 于2020年5月11日周一 下午10:11写道:
Hi Jacky,

Could you search for "Application Master start command:" in the debug log and post the result and a few lines before & after that? This is not included in the clip of attached log file.

Thank you~

Xintong Song



On Tue, May 12, 2020 at 5:33 AM Jacky D <[hidden email]> wrote:
hi, Robert

Thanks so much for quick reply  , I changed the log level to debug  and attach the log file .

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午4:14写道:
Thanks a lot for posting the full output.

It seems that Flink is passing an invalid list of arguments to the JVM. 
Can you 
- set the root log level in conf/log4j-yarn-session.properties to DEBUG
- then launch the YARN session
- share the log file of the yarn session on the mailing list?

I'm particularly interested in the line printed here, as it shows the JVM invocation.


On Mon, May 11, 2020 at 9:56 PM Jacky D <[hidden email]> wrote:
Hi,Robert 

Yes , I tried to retrieve more log info from yarn UI , the full logs showing below , this happens when I try to create a flink yarn session on emr when set up jitwatch configuration .

2020-05-11 19:06:09,552 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1584459865196_0165 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1584459865196_0165_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584459865196_0165_01_000001
Exit code: 1
Exception message: Usage: java [-options] class [args...]
           (to execute a class)
   or  java [-options] -jar jarfile [args...]
           (to execute a jar file)
where options include:
    -d32   use a 32-bit data model if available
    -d64   use a 64-bit data model if available
    -server   to select the "server" VM
                  The default VM is server,
                  because you are running on a server-class machine.


    -cp <class search path of directories and zip/jar files>
    -classpath <class search path of directories and zip/jar files>
                  A : separated list of directories, JAR archives,
                  and ZIP archives to search for class files.
    -D<name>=<value>
                  set a system property
    -verbose:[class|gc|jni]
                  enable verbose output
    -version      print product version and exit
    -version:<value>
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  require the specified version to run
    -showversion  print product version and continue
    -jre-restrict-search | -no-jre-restrict-search
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  include/exclude user private JREs in the version search
    -? -help      print this help message
    -X            print help on non-standard options
    -ea[:<packagename>...|:<classname>]
    -enableassertions[:<packagename>...|:<classname>]
                  enable assertions with specified granularity
    -da[:<packagename>...|:<classname>]
    -disableassertions[:<packagename>...|:<classname>]
                  disable assertions with specified granularity
    -esa | -enablesystemassertions
                  enable system assertions
    -dsa | -disablesystemassertions
                  disable system assertions
    -agentlib:<libname>[=<options>]
                  load native agent library <libname>, e.g. -agentlib:hprof
                  see also, -agentlib:jdwp=help and -agentlib:hprof=help
    -agentpath:<pathname>[=<options>]
                  load native agent library by full pathname
    -javaagent:<jarpath>[=<options>]
                  load Java programming language agent, see java.lang.instrument
    -splash:<imagepath>
                  show splash screen with specified image

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午3:42写道:
Hey Jacky,

The error says "The YARN application unexpectedly switched to state FAILED during deployment.".
Have you tried retrieving the YARN application logs?
Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail?

Best,
Robert


On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

Arvid Heise-3
Hi Jacky,

I suspect that the quotes are the actual issue. Could you try to remove them? See also [1].


On Tue, May 12, 2020 at 4:03 PM Jacky D <[hidden email]> wrote:
hi, Xintong 

Thanks for reply , I attached those lines below for application master start command : 


2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory                    - Crypto codec org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec is not available.
2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory                    - Using crypto codec org.apache.hadoop.crypto.JceAesCtrCryptoCodec.
2020-05-11 21:16:16,636 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 0 offsetInBlock: 0 lastPacketInBlock: false lastByteOffsetInBlock: 1697
2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DFSClient seqno: 0 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 1 offsetInBlock: 1697 lastPacketInBlock: true lastByteOffsetInBlock: 1697
2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DFSClient seqno: 1 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - Closing old block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315
2020-05-11 21:16:16,641 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #70 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #70
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: complete took 2ms
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #71 org.apache.hadoop.hdfs.protocol.ClientProtocol.setTimes
2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #71
2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: setTimes took 2ms
2020-05-11 21:16:16,647 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #72 org.apache.hadoop.hdfs.protocol.ClientProtocol.setPermission
2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #72
2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: setPermission took 2ms
2020-05-11 21:16:16,654 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor           - Application Master start command: $JAVA_HOME/bin/java -Xmx424m "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly" -Dlog.file="<LOG_DIR>/jobmanager.log" -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint  1> <LOG_DIR>/jobmanager.out 2> <LOG_DIR>/jobmanager.err
2020-05-11 21:16:16,654 DEBUG org.apache.hadoop.ipc.Client                                  - stopping client from cache: org.apache.hadoop.ipc.Client@28194a50
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setApplicationTags.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setAttemptFailuresValidityInterval.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setKeepContainersAcrossApplicationAttempts.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setNodeLabelExpression.

Xintong Song <[hidden email]> 于2020年5月11日周一 下午10:11写道:
Hi Jacky,

Could you search for "Application Master start command:" in the debug log and post the result and a few lines before & after that? This is not included in the clip of attached log file.

Thank you~

Xintong Song



On Tue, May 12, 2020 at 5:33 AM Jacky D <[hidden email]> wrote:
hi, Robert

Thanks so much for quick reply  , I changed the log level to debug  and attach the log file .

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午4:14写道:
Thanks a lot for posting the full output.

It seems that Flink is passing an invalid list of arguments to the JVM. 
Can you 
- set the root log level in conf/log4j-yarn-session.properties to DEBUG
- then launch the YARN session
- share the log file of the yarn session on the mailing list?

I'm particularly interested in the line printed here, as it shows the JVM invocation.


On Mon, May 11, 2020 at 9:56 PM Jacky D <[hidden email]> wrote:
Hi,Robert 

Yes , I tried to retrieve more log info from yarn UI , the full logs showing below , this happens when I try to create a flink yarn session on emr when set up jitwatch configuration .

2020-05-11 19:06:09,552 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1584459865196_0165 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1584459865196_0165_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584459865196_0165_01_000001
Exit code: 1
Exception message: Usage: java [-options] class [args...]
           (to execute a class)
   or  java [-options] -jar jarfile [args...]
           (to execute a jar file)
where options include:
    -d32   use a 32-bit data model if available
    -d64   use a 64-bit data model if available
    -server   to select the "server" VM
                  The default VM is server,
                  because you are running on a server-class machine.


    -cp <class search path of directories and zip/jar files>
    -classpath <class search path of directories and zip/jar files>
                  A : separated list of directories, JAR archives,
                  and ZIP archives to search for class files.
    -D<name>=<value>
                  set a system property
    -verbose:[class|gc|jni]
                  enable verbose output
    -version      print product version and exit
    -version:<value>
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  require the specified version to run
    -showversion  print product version and continue
    -jre-restrict-search | -no-jre-restrict-search
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  include/exclude user private JREs in the version search
    -? -help      print this help message
    -X            print help on non-standard options
    -ea[:<packagename>...|:<classname>]
    -enableassertions[:<packagename>...|:<classname>]
                  enable assertions with specified granularity
    -da[:<packagename>...|:<classname>]
    -disableassertions[:<packagename>...|:<classname>]
                  disable assertions with specified granularity
    -esa | -enablesystemassertions
                  enable system assertions
    -dsa | -disablesystemassertions
                  disable system assertions
    -agentlib:<libname>[=<options>]
                  load native agent library <libname>, e.g. -agentlib:hprof
                  see also, -agentlib:jdwp=help and -agentlib:hprof=help
    -agentpath:<pathname>[=<options>]
                  load native agent library by full pathname
    -javaagent:<jarpath>[=<options>]
                  load Java programming language agent, see java.lang.instrument
    -splash:<imagepath>
                  show splash screen with specified image

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午3:42写道:
Hey Jacky,

The error says "The YARN application unexpectedly switched to state FAILED during deployment.".
Have you tried retrieving the YARN application logs?
Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail?

Best,
Robert


On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng   
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

Jacky Du
hi, Arvid 

thanks for the advice  ,  I removed the quotes and it do created a yarn session on EMR , but I didn't find any jit log file generated .  

The config with quotes is working on standalone cluster . I also tried to dynamic pass the property within the yarn session command : 

flink-yarn-session -n 1 -d -nm testSession -yD env.java.opts="-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly"


but get same result , session created , but can not find any jit log file under container log . 


Thanks 

Jacky 


Arvid Heise <[hidden email]> 于2020年5月12日周二 下午12:57写道:
Hi Jacky,

I suspect that the quotes are the actual issue. Could you try to remove them? See also [1].


On Tue, May 12, 2020 at 4:03 PM Jacky D <[hidden email]> wrote:
hi, Xintong 

Thanks for reply , I attached those lines below for application master start command : 


2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory                    - Crypto codec org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec is not available.
2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory                    - Using crypto codec org.apache.hadoop.crypto.JceAesCtrCryptoCodec.
2020-05-11 21:16:16,636 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 0 offsetInBlock: 0 lastPacketInBlock: false lastByteOffsetInBlock: 1697
2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DFSClient seqno: 0 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 1 offsetInBlock: 1697 lastPacketInBlock: true lastByteOffsetInBlock: 1697
2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DFSClient seqno: 1 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - Closing old block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315
2020-05-11 21:16:16,641 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #70 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #70
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: complete took 2ms
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #71 org.apache.hadoop.hdfs.protocol.ClientProtocol.setTimes
2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #71
2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: setTimes took 2ms
2020-05-11 21:16:16,647 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #72 org.apache.hadoop.hdfs.protocol.ClientProtocol.setPermission
2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #72
2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: setPermission took 2ms
2020-05-11 21:16:16,654 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor           - Application Master start command: $JAVA_HOME/bin/java -Xmx424m "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly" -Dlog.file="<LOG_DIR>/jobmanager.log" -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint  1> <LOG_DIR>/jobmanager.out 2> <LOG_DIR>/jobmanager.err
2020-05-11 21:16:16,654 DEBUG org.apache.hadoop.ipc.Client                                  - stopping client from cache: org.apache.hadoop.ipc.Client@28194a50
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setApplicationTags.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setAttemptFailuresValidityInterval.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setKeepContainersAcrossApplicationAttempts.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setNodeLabelExpression.

Xintong Song <[hidden email]> 于2020年5月11日周一 下午10:11写道:
Hi Jacky,

Could you search for "Application Master start command:" in the debug log and post the result and a few lines before & after that? This is not included in the clip of attached log file.

Thank you~

Xintong Song



On Tue, May 12, 2020 at 5:33 AM Jacky D <[hidden email]> wrote:
hi, Robert

Thanks so much for quick reply  , I changed the log level to debug  and attach the log file .

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午4:14写道:
Thanks a lot for posting the full output.

It seems that Flink is passing an invalid list of arguments to the JVM. 
Can you 
- set the root log level in conf/log4j-yarn-session.properties to DEBUG
- then launch the YARN session
- share the log file of the yarn session on the mailing list?

I'm particularly interested in the line printed here, as it shows the JVM invocation.


On Mon, May 11, 2020 at 9:56 PM Jacky D <[hidden email]> wrote:
Hi,Robert 

Yes , I tried to retrieve more log info from yarn UI , the full logs showing below , this happens when I try to create a flink yarn session on emr when set up jitwatch configuration .

2020-05-11 19:06:09,552 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1584459865196_0165 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1584459865196_0165_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584459865196_0165_01_000001
Exit code: 1
Exception message: Usage: java [-options] class [args...]
           (to execute a class)
   or  java [-options] -jar jarfile [args...]
           (to execute a jar file)
where options include:
    -d32   use a 32-bit data model if available
    -d64   use a 64-bit data model if available
    -server   to select the "server" VM
                  The default VM is server,
                  because you are running on a server-class machine.


    -cp <class search path of directories and zip/jar files>
    -classpath <class search path of directories and zip/jar files>
                  A : separated list of directories, JAR archives,
                  and ZIP archives to search for class files.
    -D<name>=<value>
                  set a system property
    -verbose:[class|gc|jni]
                  enable verbose output
    -version      print product version and exit
    -version:<value>
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  require the specified version to run
    -showversion  print product version and continue
    -jre-restrict-search | -no-jre-restrict-search
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  include/exclude user private JREs in the version search
    -? -help      print this help message
    -X            print help on non-standard options
    -ea[:<packagename>...|:<classname>]
    -enableassertions[:<packagename>...|:<classname>]
                  enable assertions with specified granularity
    -da[:<packagename>...|:<classname>]
    -disableassertions[:<packagename>...|:<classname>]
                  disable assertions with specified granularity
    -esa | -enablesystemassertions
                  enable system assertions
    -dsa | -disablesystemassertions
                  disable system assertions
    -agentlib:<libname>[=<options>]
                  load native agent library <libname>, e.g. -agentlib:hprof
                  see also, -agentlib:jdwp=help and -agentlib:hprof=help
    -agentpath:<pathname>[=<options>]
                  load native agent library by full pathname
    -javaagent:<jarpath>[=<options>]
                  load Java programming language agent, see java.lang.instrument
    -splash:<imagepath>
                  show splash screen with specified image

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午3:42写道:
Hey Jacky,

The error says "The YARN application unexpectedly switched to state FAILED during deployment.".
Have you tried retrieving the YARN application logs?
Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail?

Best,
Robert


On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng   
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

Xintong Song
Hi Jacky,

I don't think ${FLINK_LOG_PREFIX} is available for Flink Yarn deployment. This is just my guess, that the actual file name becomes ".jit". You can try to verify that by looking for the hidden file.

If it is indeed this problem, you can try to replace "${FLINK_LOG_PREFIX}" with "<LOG_DIR>/your-file-name.jit". The token "<LOG_DIR>" should be replaced with proper log directory path by Yarn automatically.

I noticed that the usage of ${FLINK_LOG_PREFIX} is recommended by Flink's documentation [1]. This is IMO a bit misleading. I'll try to file an issue to improve the docs.

Thank you~

Xintong Song



On Wed, May 13, 2020 at 2:45 AM Jacky D <[hidden email]> wrote:
hi, Arvid 

thanks for the advice  ,  I removed the quotes and it do created a yarn session on EMR , but I didn't find any jit log file generated .  

The config with quotes is working on standalone cluster . I also tried to dynamic pass the property within the yarn session command : 

flink-yarn-session -n 1 -d -nm testSession -yD env.java.opts="-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly"


but get same result , session created , but can not find any jit log file under container log . 


Thanks 

Jacky 


Arvid Heise <[hidden email]> 于2020年5月12日周二 下午12:57写道:
Hi Jacky,

I suspect that the quotes are the actual issue. Could you try to remove them? See also [1].


On Tue, May 12, 2020 at 4:03 PM Jacky D <[hidden email]> wrote:
hi, Xintong 

Thanks for reply , I attached those lines below for application master start command : 


2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory                    - Crypto codec org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec is not available.
2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory                    - Using crypto codec org.apache.hadoop.crypto.JceAesCtrCryptoCodec.
2020-05-11 21:16:16,636 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 0 offsetInBlock: 0 lastPacketInBlock: false lastByteOffsetInBlock: 1697
2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DFSClient seqno: 0 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 1 offsetInBlock: 1697 lastPacketInBlock: true lastByteOffsetInBlock: 1697
2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DFSClient seqno: 1 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - Closing old block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315
2020-05-11 21:16:16,641 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #70 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #70
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: complete took 2ms
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #71 org.apache.hadoop.hdfs.protocol.ClientProtocol.setTimes
2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #71
2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: setTimes took 2ms
2020-05-11 21:16:16,647 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #72 org.apache.hadoop.hdfs.protocol.ClientProtocol.setPermission
2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #72
2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: setPermission took 2ms
2020-05-11 21:16:16,654 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor           - Application Master start command: $JAVA_HOME/bin/java -Xmx424m "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly" -Dlog.file="<LOG_DIR>/jobmanager.log" -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint  1> <LOG_DIR>/jobmanager.out 2> <LOG_DIR>/jobmanager.err
2020-05-11 21:16:16,654 DEBUG org.apache.hadoop.ipc.Client                                  - stopping client from cache: org.apache.hadoop.ipc.Client@28194a50
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setApplicationTags.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setAttemptFailuresValidityInterval.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setKeepContainersAcrossApplicationAttempts.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setNodeLabelExpression.

Xintong Song <[hidden email]> 于2020年5月11日周一 下午10:11写道:
Hi Jacky,

Could you search for "Application Master start command:" in the debug log and post the result and a few lines before & after that? This is not included in the clip of attached log file.

Thank you~

Xintong Song



On Tue, May 12, 2020 at 5:33 AM Jacky D <[hidden email]> wrote:
hi, Robert

Thanks so much for quick reply  , I changed the log level to debug  and attach the log file .

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午4:14写道:
Thanks a lot for posting the full output.

It seems that Flink is passing an invalid list of arguments to the JVM. 
Can you 
- set the root log level in conf/log4j-yarn-session.properties to DEBUG
- then launch the YARN session
- share the log file of the yarn session on the mailing list?

I'm particularly interested in the line printed here, as it shows the JVM invocation.


On Mon, May 11, 2020 at 9:56 PM Jacky D <[hidden email]> wrote:
Hi,Robert 

Yes , I tried to retrieve more log info from yarn UI , the full logs showing below , this happens when I try to create a flink yarn session on emr when set up jitwatch configuration .

2020-05-11 19:06:09,552 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1584459865196_0165 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1584459865196_0165_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584459865196_0165_01_000001
Exit code: 1
Exception message: Usage: java [-options] class [args...]
           (to execute a class)
   or  java [-options] -jar jarfile [args...]
           (to execute a jar file)
where options include:
    -d32   use a 32-bit data model if available
    -d64   use a 64-bit data model if available
    -server   to select the "server" VM
                  The default VM is server,
                  because you are running on a server-class machine.


    -cp <class search path of directories and zip/jar files>
    -classpath <class search path of directories and zip/jar files>
                  A : separated list of directories, JAR archives,
                  and ZIP archives to search for class files.
    -D<name>=<value>
                  set a system property
    -verbose:[class|gc|jni]
                  enable verbose output
    -version      print product version and exit
    -version:<value>
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  require the specified version to run
    -showversion  print product version and continue
    -jre-restrict-search | -no-jre-restrict-search
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  include/exclude user private JREs in the version search
    -? -help      print this help message
    -X            print help on non-standard options
    -ea[:<packagename>...|:<classname>]
    -enableassertions[:<packagename>...|:<classname>]
                  enable assertions with specified granularity
    -da[:<packagename>...|:<classname>]
    -disableassertions[:<packagename>...|:<classname>]
                  disable assertions with specified granularity
    -esa | -enablesystemassertions
                  enable system assertions
    -dsa | -disablesystemassertions
                  disable system assertions
    -agentlib:<libname>[=<options>]
                  load native agent library <libname>, e.g. -agentlib:hprof
                  see also, -agentlib:jdwp=help and -agentlib:hprof=help
    -agentpath:<pathname>[=<options>]
                  load native agent library by full pathname
    -javaagent:<jarpath>[=<options>]
                  load Java programming language agent, see java.lang.instrument
    -splash:<imagepath>
                  show splash screen with specified image

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午3:42写道:
Hey Jacky,

The error says "The YARN application unexpectedly switched to state FAILED during deployment.".
Have you tried retrieving the YARN application logs?
Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail?

Best,
Robert


On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng   
Reply | Threaded
Open this post in threaded view
|

Re: Flink Memory analyze on AWS EMR

Jacky Du
Hi, Xintong 

Thanks for point it out, after I set up the log path it's working now . 
so , for conclusion . 

on emr , to set up jitwatch in flink-conf.yaml, we should not include quotes and give a path to output the jit log file . this is different from setting it on standalone cluster .
example : 
env.java.opts: -XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=/tmp/flinkmemdump.jit -XX:+PrintAssembly

Thanks everyone involved in this discussion!

Jacky

Xintong Song <[hidden email]> 于2020年5月12日周二 下午10:41写道:
Hi Jacky,

I don't think ${FLINK_LOG_PREFIX} is available for Flink Yarn deployment. This is just my guess, that the actual file name becomes ".jit". You can try to verify that by looking for the hidden file.

If it is indeed this problem, you can try to replace "${FLINK_LOG_PREFIX}" with "<LOG_DIR>/your-file-name.jit". The token "<LOG_DIR>" should be replaced with proper log directory path by Yarn automatically.

I noticed that the usage of ${FLINK_LOG_PREFIX} is recommended by Flink's documentation [1]. This is IMO a bit misleading. I'll try to file an issue to improve the docs.

Thank you~

Xintong Song



On Wed, May 13, 2020 at 2:45 AM Jacky D <[hidden email]> wrote:
hi, Arvid 

thanks for the advice  ,  I removed the quotes and it do created a yarn session on EMR , but I didn't find any jit log file generated .  

The config with quotes is working on standalone cluster . I also tried to dynamic pass the property within the yarn session command : 

flink-yarn-session -n 1 -d -nm testSession -yD env.java.opts="-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly"


but get same result , session created , but can not find any jit log file under container log . 


Thanks 

Jacky 


Arvid Heise <[hidden email]> 于2020年5月12日周二 下午12:57写道:
Hi Jacky,

I suspect that the quotes are the actual issue. Could you try to remove them? See also [1].


On Tue, May 12, 2020 at 4:03 PM Jacky D <[hidden email]> wrote:
hi, Xintong 

Thanks for reply , I attached those lines below for application master start command : 


2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory                    - Crypto codec org.apache.hadoop.crypto.OpensslAesCtrCryptoCodec is not available.
2020-05-11 21:16:16,635 DEBUG org.apache.hadoop.util.PerformanceAdvisory                    - Using crypto codec org.apache.hadoop.crypto.JceAesCtrCryptoCodec.
2020-05-11 21:16:16,636 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 0 offsetInBlock: 0 lastPacketInBlock: false lastByteOffsetInBlock: 1697
2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DFSClient seqno: 0 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2020-05-11 21:16:16,637 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DataStreamer block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315 sending packet packet seqno: 1 offsetInBlock: 1697 lastPacketInBlock: true lastByteOffsetInBlock: 1697
2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - DFSClient seqno: 1 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
2020-05-11 21:16:16,638 DEBUG org.apache.hadoop.hdfs.DataStreamer                           - Closing old block BP-1519523618-98.94.65.144-1581106168138:blk_1073745139_4315
2020-05-11 21:16:16,641 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #70 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #70
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: complete took 2ms
2020-05-11 21:16:16,643 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #71 org.apache.hadoop.hdfs.protocol.ClientProtocol.setTimes
2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #71
2020-05-11 21:16:16,645 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: setTimes took 2ms
2020-05-11 21:16:16,647 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop sending #72 org.apache.hadoop.hdfs.protocol.ClientProtocol.setPermission
2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.Client                                  - IPC Client (1954985045) connection to ip-98-94-65-144.ec2.internal/98.94.65.144:8020 from hadoop got value #72
2020-05-11 21:16:16,648 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine                       - Call: setPermission took 2ms
2020-05-11 21:16:16,654 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor           - Application Master start command: $JAVA_HOME/bin/java -Xmx424m "-XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading -XX:+LogCompilation -XX:LogFile=${FLINK_LOG_PREFIX}.jit -XX:+PrintAssembly" -Dlog.file="<LOG_DIR>/jobmanager.log" -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint  1> <LOG_DIR>/jobmanager.out 2> <LOG_DIR>/jobmanager.err
2020-05-11 21:16:16,654 DEBUG org.apache.hadoop.ipc.Client                                  - stopping client from cache: org.apache.hadoop.ipc.Client@28194a50
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setApplicationTags.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setAttemptFailuresValidityInterval.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setKeepContainersAcrossApplicationAttempts.
2020-05-11 21:16:16,656 DEBUG org.apache.flink.yarn.AbstractYarnClusterDescriptor$ApplicationSubmissionContextReflector  - org.apache.hadoop.yarn.api.records.ApplicationSubmissionContext supports method setNodeLabelExpression.

Xintong Song <[hidden email]> 于2020年5月11日周一 下午10:11写道:
Hi Jacky,

Could you search for "Application Master start command:" in the debug log and post the result and a few lines before & after that? This is not included in the clip of attached log file.

Thank you~

Xintong Song



On Tue, May 12, 2020 at 5:33 AM Jacky D <[hidden email]> wrote:
hi, Robert

Thanks so much for quick reply  , I changed the log level to debug  and attach the log file .

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午4:14写道:
Thanks a lot for posting the full output.

It seems that Flink is passing an invalid list of arguments to the JVM. 
Can you 
- set the root log level in conf/log4j-yarn-session.properties to DEBUG
- then launch the YARN session
- share the log file of the yarn session on the mailing list?

I'm particularly interested in the line printed here, as it shows the JVM invocation.


On Mon, May 11, 2020 at 9:56 PM Jacky D <[hidden email]> wrote:
Hi,Robert 

Yes , I tried to retrieve more log info from yarn UI , the full logs showing below , this happens when I try to create a flink yarn session on emr when set up jitwatch configuration .

2020-05-11 19:06:09,552 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli                 - Error while running the Flink Yarn session.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1862)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:813)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster
at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:429)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:610)
at org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$2(FlinkYarnSessionCli.java:813)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
... 2 more
Caused by: org.apache.flink.yarn.AbstractYarnClusterDescriptor$YarnDeploymentException: The YARN application unexpectedly switched to state FAILED during deployment.
Diagnostics from YARN: Application application_1584459865196_0165 failed 1 times (global limit =2; local limit is =1) due to AM Container for appattempt_1584459865196_0165_000001 exited with  exitCode: 1
Failing this attempt.Diagnostics: Exception from container-launch.
Container id: container_1584459865196_0165_01_000001
Exit code: 1
Exception message: Usage: java [-options] class [args...]
           (to execute a class)
   or  java [-options] -jar jarfile [args...]
           (to execute a jar file)
where options include:
    -d32   use a 32-bit data model if available
    -d64   use a 64-bit data model if available
    -server   to select the "server" VM
                  The default VM is server,
                  because you are running on a server-class machine.


    -cp <class search path of directories and zip/jar files>
    -classpath <class search path of directories and zip/jar files>
                  A : separated list of directories, JAR archives,
                  and ZIP archives to search for class files.
    -D<name>=<value>
                  set a system property
    -verbose:[class|gc|jni]
                  enable verbose output
    -version      print product version and exit
    -version:<value>
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  require the specified version to run
    -showversion  print product version and continue
    -jre-restrict-search | -no-jre-restrict-search
                  Warning: this feature is deprecated and will be removed
                  in a future release.
                  include/exclude user private JREs in the version search
    -? -help      print this help message
    -X            print help on non-standard options
    -ea[:<packagename>...|:<classname>]
    -enableassertions[:<packagename>...|:<classname>]
                  enable assertions with specified granularity
    -da[:<packagename>...|:<classname>]
    -disableassertions[:<packagename>...|:<classname>]
                  disable assertions with specified granularity
    -esa | -enablesystemassertions
                  enable system assertions
    -dsa | -disablesystemassertions
                  disable system assertions
    -agentlib:<libname>[=<options>]
                  load native agent library <libname>, e.g. -agentlib:hprof
                  see also, -agentlib:jdwp=help and -agentlib:hprof=help
    -agentpath:<pathname>[=<options>]
                  load native agent library by full pathname
    -javaagent:<jarpath>[=<options>]
                  load Java programming language agent, see java.lang.instrument
    -splash:<imagepath>
                  show splash screen with specified image

Thanks 
Jacky

Robert Metzger <[hidden email]> 于2020年5月11日周一 下午3:42写道:
Hey Jacky,

The error says "The YARN application unexpectedly switched to state FAILED during deployment.".
Have you tried retrieving the YARN application logs?
Does the YARN UI / resource manager logs reveal anything on the reason for the deployment to fail?

Best,
Robert


On Mon, May 11, 2020 at 9:34 PM Jacky D <[hidden email]> wrote:


---------- Forwarded message ---------
发件人: Jacky D <[hidden email]>
Date: 2020年5月11日周一 下午3:12
Subject: Re: Flink Memory analyze on AWS EMR
To: Khachatryan Roman <[hidden email]>


Hi, Roman 

Thanks for quick response , I tried without logFIle option but failed with same error , I'm currently using flink 1.6 https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/application_profiling.html, so I can only use Jitwatch or JMC .  I guess those tools only available on Standalone cluster ? as document mentioned "Each standalone JobManager, TaskManager, HistoryServer, and ZooKeeper daemon redirects stdout and stderr to a file with a .out filename suffix and writes internal logging to a file with a .log suffix. Java options configured by the user in env.java.opts" ? 

Thanks 
Jacky 


--

Arvid Heise | Senior Java Developer


Follow us @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng