Restart strategy defined in flink-conf.yaml is ignored

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Restart strategy defined in flink-conf.yaml is ignored

Alexander Smirnov
Hello,

I've defined restart strategy in flink-conf.yaml as none. WebUI / Job Manager section confirms that.
But looks like this setting is disregarded.

When I go into job's configuration in the WebUI, in the Execution Configuration section I can see:
    Max. number of execution retries          Restart with fixed delay (10000 ms). #2147483647 restart attempts.

Do you think it is a bug?

Alex
Reply | Threaded
Open this post in threaded view
|

Re: Restart strategy defined in flink-conf.yaml is ignored

Piotr Nowojski
Hi,

Can you provide more details, like post your configuration/log files/screen shots from web UI and Flink version being used?

Piotrek

> On 5 Apr 2018, at 06:07, Alexander Smirnov <[hidden email]> wrote:
>
> Hello,
>
> I've defined restart strategy in flink-conf.yaml as none. WebUI / Job Manager section confirms that.
> But looks like this setting is disregarded.
>
> When I go into job's configuration in the WebUI, in the Execution Configuration section I can see:
>     Max. number of execution retries          Restart with fixed delay (10000 ms). #2147483647 restart attempts.
>
> Do you think it is a bug?
>
> Alex

Reply | Threaded
Open this post in threaded view
|

Re: Restart strategy defined in flink-conf.yaml is ignored

Alexander Smirnov
Hi Piotr,

I'm using Flink 1.4.2

it's a standard flink distribution downloaded and unpacked.

added the following lines to conf/flink-conf.yaml:
restart-strategy: none
state.backend: rocksdb
state.backend.fs.checkpointdir: file:///tmp/nfsrecovery/flink-checkpoints-metadata
state.backend.rocksdb.checkpointdir: file:///tmp/nfsrecovery/flink-checkpoints-rocksdb


here's the code:

public class FailedJob
{
    static final Logger LOGGER = LoggerFactory.getLogger(FailedJob.class);

    public static void main( String[] args ) throws Exception
    {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        
        env.enableCheckpointing(5000,
                CheckpointingMode.EXACTLY_ONCE);
        
        DataStream<String> stream = env.fromCollection(Arrays.asList("test"));

        stream.map(new MapFunction<String, String>(){
            @Override
            public String map(String obj) {
                throw new NullPointerException("NPE");
            } 
        });

        env.execute("Failed job");
    }
}

attaching screenshots, please let me know if more info is needed

Alex


 

On Thu, Apr 5, 2018 at 5:35 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

Can you provide more details, like post your configuration/log files/screen shots from web UI and Flink version being used?

Piotrek

> On 5 Apr 2018, at 06:07, Alexander Smirnov <[hidden email]> wrote:
>
> Hello,
>
> I've defined restart strategy in flink-conf.yaml as none. WebUI / Job Manager section confirms that.
> But looks like this setting is disregarded.
>
> When I go into job's configuration in the WebUI, in the Execution Configuration section I can see:
>     Max. number of execution retries          Restart with fixed delay (10000 ms). #<a href="tel:(214)%20748-3647" value="+12147483647" target="_blank">2147483647 restart attempts.
>
> Do you think it is a bug?
>
> Alex


jobmanager.png (172K) Download Attachment
execution_config.png (136K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Restart strategy defined in flink-conf.yaml is ignored

Alexander Smirnov
jobmanager.log:

2018-04-05 22:37:28,348 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: restart-strategy, none
2018-04-05 22:37:28,353 INFO  org.apache.flink.core.fs.FileSystem                           - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2018-04-05 22:37:28,506 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager without high-availability
2018-04-05 22:37:28,510 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on localhost:6123 with execution mode CLUSTER
2018-04-05 22:37:28,517 INFO  org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2018-04-05 22:37:28,546 INFO  org.apache.flink.runtime.security.SecurityUtils               - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2018-04-05 22:37:28,591 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Trying to start actor system at localhost:6123
2018-04-05 22:37:28,981 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-04-05 22:37:29,027 INFO  akka.remote.Remoting                                          - Starting remoting
2018-04-05 22:37:29,129 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@localhost:6123]
2018-04-05 22:37:29,135 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Actor system started at akka.tcp://flink@localhost:6123
2018-04-05 22:37:29,148 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-04-05 22:37:29,152 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager web frontend
2018-04-05 22:37:29,161 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of JobManager log file: /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.log
2018-04-05 22:37:29,161 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of JobManager stdout file: /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.out
2018-04-05 22:37:29,162 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-901a3fb7-d366-4f90-b75c-1e1f8038ed37 for the web interface files
2018-04-05 22:37:29,162 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Created directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-21e5d8a8-7967-40f0-97d7-a803d9bd5913 for web frontend JAR file uploads.
2018-04-05 22:37:29,447 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at localhost:8081
2018-04-05 22:37:29,447 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
2018-04-05 22:37:29,452 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/blobStore-6777e862-0c2c-4679-a42f-b1921baa5236
2018-04-05 22:37:29,453 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:60697 - max concurrent requests: 50 - max backlog: 1000
2018-04-05 22:37:29,533 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive
2018-04-05 22:37:29,533 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://flink@localhost:6123/user/jobmanager.
2018-04-05 22:37:29,544 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@localhost:6123/user/jobmanager was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
2018-04-05 22:37:29,545 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Trying to associate with JobManager leader akka.tcp://flink@localhost:6123/user/jobmanager
2018-04-05 22:37:29,552 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-853250886] - leader session 00000000-0000-0000-0000-000000000000
2018-04-05 22:37:30,495 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - TaskManager f0b0370186ab3c865db63fe60ca68e08 has started.
2018-04-05 22:37:30,497 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at 192.168.0.26 (akka.tcp://[hidden email]:60696/user/taskmanager) as 2972a72a7223e63bb5a4fedd159c0b78. Current number of registered hosts is 1. Current number of alive task slots is 1.
2018-04-05 22:38:29,355 INFO  org.apache.flink.runtime.client.JobClient                     - Checking and uploading JAR files
2018-04-05 22:38:29,639 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 43ecfe9cb258b7f624aad9868d306edb (Failed job).
2018-04-05 22:38:29,643 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Using restart strategy FixedDelayRestartStrategy(maxNumberRestartAttempts=2147483647, delayBetweenRestartAttempts=10000) for 43ecfe9cb258b7f624aad9868d306edb.
2018-04-05 22:38:29,656 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job recovers via failover strategy: full graph restart



On Thu, Apr 5, 2018 at 10:35 PM Alexander Smirnov <[hidden email]> wrote:
Hi Piotr,

I'm using Flink 1.4.2

it's a standard flink distribution downloaded and unpacked.

added the following lines to conf/flink-conf.yaml:
restart-strategy: none
state.backend: rocksdb
state.backend.fs.checkpointdir: file:///tmp/nfsrecovery/flink-checkpoints-metadata
state.backend.rocksdb.checkpointdir: file:///tmp/nfsrecovery/flink-checkpoints-rocksdb


here's the code:

public class FailedJob
{
    static final Logger LOGGER = LoggerFactory.getLogger(FailedJob.class);

    public static void main( String[] args ) throws Exception
    {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        
        env.enableCheckpointing(5000,
                CheckpointingMode.EXACTLY_ONCE);
        
        DataStream<String> stream = env.fromCollection(Arrays.asList("test"));

        stream.map(new MapFunction<String, String>(){
            @Override
            public String map(String obj) {
                throw new NullPointerException("NPE");
            } 
        });

        env.execute("Failed job");
    }
}

attaching screenshots, please let me know if more info is needed

Alex


 

On Thu, Apr 5, 2018 at 5:35 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

Can you provide more details, like post your configuration/log files/screen shots from web UI and Flink version being used?

Piotrek

> On 5 Apr 2018, at 06:07, Alexander Smirnov <[hidden email]> wrote:
>
> Hello,
>
> I've defined restart strategy in flink-conf.yaml as none. WebUI / Job Manager section confirms that.
> But looks like this setting is disregarded.
>
> When I go into job's configuration in the WebUI, in the Execution Configuration section I can see:
>     Max. number of execution retries          Restart with fixed delay (10000 ms). #<a href="tel:(214)%20748-3647" value="+12147483647" target="_blank">2147483647 restart attempts.
>
> Do you think it is a bug?
>
> Alex

Reply | Threaded
Open this post in threaded view
|

Re: Restart strategy defined in flink-conf.yaml is ignored

Piotr Nowojski
Hi,

Thanks for the details! I can confirm this behaviour. flink-conf.yaml restart-strategy value is being completely ignored (regardless of it’s value) when user enables checkpointing:

env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE);

I suspect this is a bug, but I have to confirm it.

Thanks, Piotrek

On 5 Apr 2018, at 12:40, Alexander Smirnov <[hidden email]> wrote:

jobmanager.log:

2018-04-05 22:37:28,348 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: restart-strategy, none
2018-04-05 22:37:28,353 INFO  org.apache.flink.core.fs.FileSystem                           - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2018-04-05 22:37:28,506 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager without high-availability
2018-04-05 22:37:28,510 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on localhost:6123 with execution mode CLUSTER
2018-04-05 22:37:28,517 INFO  org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2018-04-05 22:37:28,546 INFO  org.apache.flink.runtime.security.SecurityUtils               - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2018-04-05 22:37:28,591 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Trying to start actor system at localhost:6123
2018-04-05 22:37:28,981 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-04-05 22:37:29,027 INFO  akka.remote.Remoting                                          - Starting remoting
2018-04-05 22:37:29,129 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[<a href="akka.tcp://flink@localhost:6123" class="">akka.tcp://flink@localhost:6123]
2018-04-05 22:37:29,135 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Actor system started at <a href="akka.tcp://flink@localhost:6123" class="">akka.tcp://flink@localhost:6123
2018-04-05 22:37:29,148 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-04-05 22:37:29,152 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager web frontend
2018-04-05 22:37:29,161 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of JobManager log file: /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.log
2018-04-05 22:37:29,161 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of JobManager stdout file: /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.out
2018-04-05 22:37:29,162 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-901a3fb7-d366-4f90-b75c-1e1f8038ed37 for the web interface files
2018-04-05 22:37:29,162 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Created directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-21e5d8a8-7967-40f0-97d7-a803d9bd5913 for web frontend JAR file uploads.
2018-04-05 22:37:29,447 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at localhost:8081
2018-04-05 22:37:29,447 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
2018-04-05 22:37:29,452 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/blobStore-6777e862-0c2c-4679-a42f-b1921baa5236
2018-04-05 22:37:29,453 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:60697 - max concurrent requests: 50 - max backlog: 1000
2018-04-05 22:37:29,533 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist <a href="akka://flink/user/archive" class="">akka://flink/user/archive
2018-04-05 22:37:29,533 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at <a href="akka.tcp://flink@localhost:6123/user/jobmanager" class="">akka.tcp://flink@localhost:6123/user/jobmanager.
2018-04-05 22:37:29,544 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager <a href="akka.tcp://flink@localhost:6123/user/jobmanager" class="">akka.tcp://flink@localhost:6123/user/jobmanager was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
2018-04-05 22:37:29,545 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Trying to associate with JobManager leader <a href="akka.tcp://flink@localhost:6123/user/jobmanager" class="">akka.tcp://flink@localhost:6123/user/jobmanager
2018-04-05 22:37:29,552 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Resource Manager associating with leading JobManager Actor[<a href="akka://flink/user/jobmanager#-853250886" class="">akka://flink/user/jobmanager#-853250886] - leader session 00000000-0000-0000-0000-000000000000
2018-04-05 22:37:30,495 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - TaskManager f0b0370186ab3c865db63fe60ca68e08 has started.
2018-04-05 22:37:30,497 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at 192.168.0.26 (<a href="akka.tcp://flink@mb-sr-asmirnov.local:60696/user/taskmanager" class="">akka.tcp://flink@...:60696/user/taskmanager) as 2972a72a7223e63bb5a4fedd159c0b78. Current number of registered hosts is 1. Current number of alive task slots is 1.
2018-04-05 22:38:29,355 INFO  org.apache.flink.runtime.client.JobClient                     - Checking and uploading JAR files
2018-04-05 22:38:29,639 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 43ecfe9cb258b7f624aad9868d306edb (Failed job).
2018-04-05 22:38:29,643 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Using restart strategy FixedDelayRestartStrategy(maxNumberRestartAttempts=2147483647, delayBetweenRestartAttempts=10000) for 43ecfe9cb258b7f624aad9868d306edb.
2018-04-05 22:38:29,656 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job recovers via failover strategy: full graph restart



On Thu, Apr 5, 2018 at 10:35 PM Alexander Smirnov <[hidden email]> wrote:
Hi Piotr,

I'm using Flink 1.4.2

it's a standard flink distribution downloaded and unpacked.

added the following lines to conf/flink-conf.yaml:
restart-strategy: none
state.backend: rocksdb
state.backend.rocksdb.checkpointdir: file:///tmp/nfsrecovery/flink-checkpoints-rocksdb


here's the code:

public class FailedJob
{
    static final Logger LOGGER = LoggerFactory.getLogger(FailedJob.class);

    public static void main( String[] args ) throws Exception
    {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        
        env.enableCheckpointing(5000,
                CheckpointingMode.EXACTLY_ONCE);
        
        DataStream<String> stream = env.fromCollection(Arrays.asList("test"));

        stream.map(new MapFunction<String, String>(){
            @Override
            public String map(String obj) {
                throw new NullPointerException("NPE");
            } 
        });

        env.execute("Failed job");
    }
}

attaching screenshots, please let me know if more info is needed

Alex


 

On Thu, Apr 5, 2018 at 5:35 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

Can you provide more details, like post your configuration/log files/screen shots from web UI and Flink version being used?

Piotrek

> On 5 Apr 2018, at 06:07, Alexander Smirnov <[hidden email]> wrote:
>
> Hello,
>
> I've defined restart strategy in flink-conf.yaml as none. WebUI / Job Manager section confirms that.
> But looks like this setting is disregarded.
>
> When I go into job's configuration in the WebUI, in the Execution Configuration section I can see:
>     Max. number of execution retries          Restart with fixed delay (10000 ms). #<a href="tel:(214)%20748-3647" value="+12147483647" target="_blank" class="">2147483647 restart attempts.
>
> Do you think it is a bug?
>
> Alex


Reply | Threaded
Open this post in threaded view
|

Re: Restart strategy defined in flink-conf.yaml is ignored

Alexander Smirnov
Thanks Piotr,

I've created a JIRA issue to track it: https://issues.apache.org/jira/browse/FLINK-9143

Alex


On Thu, Apr 5, 2018 at 11:28 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

Thanks for the details! I can confirm this behaviour. flink-conf.yaml restart-strategy value is being completely ignored (regardless of it’s value) when user enables checkpointing:

env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE);

I suspect this is a bug, but I have to confirm it.

Thanks, Piotrek

On 5 Apr 2018, at 12:40, Alexander Smirnov <[hidden email]> wrote:

jobmanager.log:

2018-04-05 22:37:28,348 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: restart-strategy, none
2018-04-05 22:37:28,353 INFO  org.apache.flink.core.fs.FileSystem                           - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2018-04-05 22:37:28,506 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager without high-availability
2018-04-05 22:37:28,510 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on localhost:6123 with execution mode CLUSTER
2018-04-05 22:37:28,517 INFO  org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2018-04-05 22:37:28,546 INFO  org.apache.flink.runtime.security.SecurityUtils               - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2018-04-05 22:37:28,591 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Trying to start actor system at localhost:6123
2018-04-05 22:37:28,981 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-04-05 22:37:29,027 INFO  akka.remote.Remoting                                          - Starting remoting
2018-04-05 22:37:29,129 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@localhost:6123]
2018-04-05 22:37:29,135 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Actor system started at akka.tcp://flink@localhost:6123
2018-04-05 22:37:29,148 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-04-05 22:37:29,152 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager web frontend
2018-04-05 22:37:29,161 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of JobManager log file: /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.log
2018-04-05 22:37:29,161 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of JobManager stdout file: /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.out
2018-04-05 22:37:29,162 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-901a3fb7-d366-4f90-b75c-1e1f8038ed37 for the web interface files
2018-04-05 22:37:29,162 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Created directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-21e5d8a8-7967-40f0-97d7-a803d9bd5913 for web frontend JAR file uploads.
2018-04-05 22:37:29,447 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at localhost:8081
2018-04-05 22:37:29,447 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
2018-04-05 22:37:29,452 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/blobStore-6777e862-0c2c-4679-a42f-b1921baa5236
2018-04-05 22:37:29,453 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:60697 - max concurrent requests: 50 - max backlog: 1000
2018-04-05 22:37:29,533 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive
2018-04-05 22:37:29,533 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://flink@localhost:6123/user/jobmanager.
2018-04-05 22:37:29,544 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@localhost:6123/user/jobmanager was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
2018-04-05 22:37:29,545 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Trying to associate with JobManager leader akka.tcp://flink@localhost:6123/user/jobmanager
2018-04-05 22:37:29,552 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-853250886] - leader session 00000000-0000-0000-0000-000000000000
2018-04-05 22:37:30,495 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - TaskManager f0b0370186ab3c865db63fe60ca68e08 has started.
2018-04-05 22:37:30,497 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at 192.168.0.26 (akka.tcp://[hidden email]:60696/user/taskmanager) as 2972a72a7223e63bb5a4fedd159c0b78. Current number of registered hosts is 1. Current number of alive task slots is 1.
2018-04-05 22:38:29,355 INFO  org.apache.flink.runtime.client.JobClient                     - Checking and uploading JAR files
2018-04-05 22:38:29,639 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 43ecfe9cb258b7f624aad9868d306edb (Failed job).
2018-04-05 22:38:29,643 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Using restart strategy FixedDelayRestartStrategy(maxNumberRestartAttempts=<a href="tel:(214)%20748-3647" value="+12147483647" target="_blank">2147483647, delayBetweenRestartAttempts=10000) for 43ecfe9cb258b7f624aad9868d306edb.
2018-04-05 22:38:29,656 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job recovers via failover strategy: full graph restart



On Thu, Apr 5, 2018 at 10:35 PM Alexander Smirnov <[hidden email]> wrote:
Hi Piotr,

I'm using Flink 1.4.2

it's a standard flink distribution downloaded and unpacked.

added the following lines to conf/flink-conf.yaml:
restart-strategy: none
state.backend: rocksdb
state.backend.rocksdb.checkpointdir: file:///tmp/nfsrecovery/flink-checkpoints-rocksdb


here's the code:

public class FailedJob
{
    static final Logger LOGGER = LoggerFactory.getLogger(FailedJob.class);

    public static void main( String[] args ) throws Exception
    {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        
        env.enableCheckpointing(5000,
                CheckpointingMode.EXACTLY_ONCE);
        
        DataStream<String> stream = env.fromCollection(Arrays.asList("test"));

        stream.map(new MapFunction<String, String>(){
            @Override
            public String map(String obj) {
                throw new NullPointerException("NPE");
            } 
        });

        env.execute("Failed job");
    }
}

attaching screenshots, please let me know if more info is needed

Alex


 

On Thu, Apr 5, 2018 at 5:35 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

Can you provide more details, like post your configuration/log files/screen shots from web UI and Flink version being used?

Piotrek

> On 5 Apr 2018, at 06:07, Alexander Smirnov <[hidden email]> wrote:
>
> Hello,
>
> I've defined restart strategy in flink-conf.yaml as none. WebUI / Job Manager section confirms that.
> But looks like this setting is disregarded.
>
> When I go into job's configuration in the WebUI, in the Execution Configuration section I can see:
>     Max. number of execution retries          Restart with fixed delay (10000 ms). #<a href="tel:(214)%20748-3647" value="+12147483647" target="_blank">2147483647 restart attempts.
>
> Do you think it is a bug?
>
> Alex


Reply | Threaded
Open this post in threaded view
|

Re: Restart strategy defined in flink-conf.yaml is ignored

Piotr Nowojski
Thanks!

On 6 Apr 2018, at 00:30, Alexander Smirnov <[hidden email]> wrote:

Thanks Piotr,

I've created a JIRA issue to track it: https://issues.apache.org/jira/browse/FLINK-9143

Alex


On Thu, Apr 5, 2018 at 11:28 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

Thanks for the details! I can confirm this behaviour. flink-conf.yaml restart-strategy value is being completely ignored (regardless of it’s value) when user enables checkpointing:

env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE);

I suspect this is a bug, but I have to confirm it.

Thanks, Piotrek

On 5 Apr 2018, at 12:40, Alexander Smirnov <[hidden email]> wrote:

jobmanager.log:

2018-04-05 22:37:28,348 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: restart-strategy, none
2018-04-05 22:37:28,353 INFO  org.apache.flink.core.fs.FileSystem                           - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2018-04-05 22:37:28,506 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager without high-availability
2018-04-05 22:37:28,510 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on localhost:6123 with execution mode CLUSTER
2018-04-05 22:37:28,517 INFO  org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2018-04-05 22:37:28,546 INFO  org.apache.flink.runtime.security.SecurityUtils               - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2018-04-05 22:37:28,591 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Trying to start actor system at localhost:6123
2018-04-05 22:37:28,981 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2018-04-05 22:37:29,027 INFO  akka.remote.Remoting                                          - Starting remoting
2018-04-05 22:37:29,129 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@localhost:6123]
2018-04-05 22:37:29,135 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Actor system started at akka.tcp://flink@localhost:6123
2018-04-05 22:37:29,148 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2018-04-05 22:37:29,152 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager web frontend
2018-04-05 22:37:29,161 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of JobManager log file: /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.log
2018-04-05 22:37:29,161 INFO  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined location of JobManager stdout file: /Users/asmirnov/flink-1.4.2/log/flink-jobmanager-0.out
2018-04-05 22:37:29,162 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-901a3fb7-d366-4f90-b75c-1e1f8038ed37 for the web interface files
2018-04-05 22:37:29,162 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Created directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/flink-web-21e5d8a8-7967-40f0-97d7-a803d9bd5913 for web frontend JAR file uploads.
2018-04-05 22:37:29,447 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at localhost:8081
2018-04-05 22:37:29,447 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
2018-04-05 22:37:29,452 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /var/folders/5s/yj6g5wd90h158whcb_483hhhq7t4sw/T/blobStore-6777e862-0c2c-4679-a42f-b1921baa5236
2018-04-05 22:37:29,453 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:60697 - max concurrent requests: 50 - max backlog: 1000
2018-04-05 22:37:29,533 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive
2018-04-05 22:37:29,533 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://flink@localhost:6123/user/jobmanager.
2018-04-05 22:37:29,544 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@localhost:6123/user/jobmanager was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
2018-04-05 22:37:29,545 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Trying to associate with JobManager leader akka.tcp://flink@localhost:6123/user/jobmanager
2018-04-05 22:37:29,552 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-853250886] - leader session 00000000-0000-0000-0000-000000000000
2018-04-05 22:37:30,495 INFO  org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager  - TaskManager f0b0370186ab3c865db63fe60ca68e08 has started.
2018-04-05 22:37:30,497 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at 192.168.0.26 (akka.tcp://[hidden email]:60696/user/taskmanager) as 2972a72a7223e63bb5a4fedd159c0b78. Current number of registered hosts is 1. Current number of alive task slots is 1.
2018-04-05 22:38:29,355 INFO  org.apache.flink.runtime.client.JobClient                     - Checking and uploading JAR files
2018-04-05 22:38:29,639 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 43ecfe9cb258b7f624aad9868d306edb (Failed job).
2018-04-05 22:38:29,643 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Using restart strategy FixedDelayRestartStrategy(maxNumberRestartAttempts=<a href="tel:(214)%20748-3647" value="+12147483647" target="_blank" class="">2147483647, delayBetweenRestartAttempts=10000) for 43ecfe9cb258b7f624aad9868d306edb.
2018-04-05 22:38:29,656 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job recovers via failover strategy: full graph restart



On Thu, Apr 5, 2018 at 10:35 PM Alexander Smirnov <[hidden email]> wrote:
Hi Piotr,

I'm using Flink 1.4.2

it's a standard flink distribution downloaded and unpacked.

added the following lines to conf/flink-conf.yaml:
restart-strategy: none
state.backend: rocksdb
state.backend.rocksdb.checkpointdir: file:///tmp/nfsrecovery/flink-checkpoints-rocksdb


here's the code:

public class FailedJob
{
    static final Logger LOGGER = LoggerFactory.getLogger(FailedJob.class);

    public static void main( String[] args ) throws Exception
    {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        
        env.enableCheckpointing(5000,
                CheckpointingMode.EXACTLY_ONCE);
        
        DataStream<String> stream = env.fromCollection(Arrays.asList("test"));

        stream.map(new MapFunction<String, String>(){
            @Override
            public String map(String obj) {
                throw new NullPointerException("NPE");
            } 
        });

        env.execute("Failed job");
    }
}

attaching screenshots, please let me know if more info is needed

Alex


 

On Thu, Apr 5, 2018 at 5:35 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

Can you provide more details, like post your configuration/log files/screen shots from web UI and Flink version being used?

Piotrek

> On 5 Apr 2018, at 06:07, Alexander Smirnov <[hidden email]> wrote:
>
> Hello,
>
> I've defined restart strategy in flink-conf.yaml as none. WebUI / Job Manager section confirms that.
> But looks like this setting is disregarded.
>
> When I go into job's configuration in the WebUI, in the Execution Configuration section I can see:
>     Max. number of execution retries          Restart with fixed delay (10000 ms). #<a href="tel:(214)%20748-3647" value="+12147483647" target="_blank" class="">2147483647 restart attempts.
>
> Do you think it is a bug?
>
> Alex