Hi All,
We've been using Flink 1.3.2 for a while now, but recently failed to deploy our fat jar to the cluster. The deployment only works when we remove 2 arbitrary operators, thus giving us the impression our job is too large. However, we only changed some case classes and serializers (to support Avro) compared to a working version of our jar. I'll provide some context below. *Streaming operators used: *(same list as when deploy worked) - 9 Incoming streams from Kafka (all parsed from JSON -> Case Classes) - 6 Stateful Joins (extend CoProcessFunction) - 4 Stateful Processors (extend ProcessFunction) - 5 Maps - 2 Filters - 1 Union of 3 Streams - 1 Sink to Kafka (Case class -> JSON) *Changes made:* - add extended Type Serializer for Avro support - add companion objects to case classes for translation to Avro Generic Records - alter state full functions to use above changes *what does work:* - remove 2 arbitrary operators and deploy fat jar - run full program using sbt run locally Could it be that somehow the complexity causes the job deploy as jar to fail? We simply get a timeout from Flinks CLI when trying to deploy, even when extending the timeout to several minutes. Any help would be very much appreciated! Thanks, Niels -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Niels, There should be no size constraints on the complexity of an application or the size of a JAR file.Apparently, it has to spend more time on planning / submitting the application than before. Have you tried to increase the akka.client.timeout parameter? If that does not help, it would be good to learn what the JobManager is doing after the application was submitted. Either it just takes longer than before such that the client timeout is exceeded or it might even get stuck in some kind of deadlock (which would be bad). In that case it might help to take a few stacktraces of JM process after the application was submitted to check if the threads are making progress. I'll also include Till who is more familiar with the submission process and JM planning and coordination. Best, Fabian 2018-02-27 9:31 GMT+01:00 Niels <[hidden email]>: Hi All, |
Hi Niels, the size of the jar does not play a role for Flink. What could be a problem is that the serialized `JobGraph` (user code with closures) is larger than 10 MB and, thus, exceeds the maximum default framesize of Akka. In such a case, it cannot be sent to the `JobMaster`. You can control the framesize via `akka.framesize`. In order to debug the problem properly, I would need access to the client log and the JobManager logs if possible. Cheers, Till On Tue, Feb 27, 2018 at 11:05 AM, Fabian Hueske <[hidden email]> wrote:
|
Hi Till,
I've just tried to set on the *client*: akka.client.timeout: 300s On the *cluster*: akka.ask.timeout: 30s akka.lookup.timeout: 30s akka.client.timeout: 300s akka.framesize: 104857600b #(10x the original of 10MB) akka.log.lifecycle.events: true Still gives me the same issue, the fat jar isn't deployed. See the attached files for the logs of the jobmanager and the deployer. Let me know if I can provide you with any additional info. Thanks for your help! Cheers, Niels Flink_deploy_log.txt <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1147/Flink_deploy_log.txt> flink_jobmanager_log.txt <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1147/flink_jobmanager_log.txt> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
In case it's useful I've found how to enable bit more debug logging on the
jobmanager: flink_jobmanager_log.txt <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1147/flink_jobmanager_log.txt> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
In reply to this post by Niels
Any updates on this one? I'm seeing similar issues with 1.3.3 and the batch api.
Main difference is that I have even more operators ~850, mostly maps and filters with one cogroup. I don't really want to set a akka.client.timeout for anything more than 10 minutes seeing that it still fails with that amount. The akka.framesize is already 500Mb... akka.framesize: 524288000b akka.ask.timeout: 10min akka.client.timeout: 10min akka.lookup.timeout: 10min Thanks, Regina -----Original Message----- From: Niels [mailto:[hidden email]] Sent: Tuesday, February 27, 2018 11:40 AM To: [hidden email] Subject: Re: Fat jar fails deployment (streaming job too large) Hi Till, I've just tried to set on the *client*: akka.client.timeout: 300s On the *cluster*: akka.ask.timeout: 30s akka.lookup.timeout: 30s akka.client.timeout: 300s akka.framesize: 104857600b #(10x the original of 10MB) akka.log.lifecycle.events: true Still gives me the same issue, the fat jar isn't deployed. See the attached files for the logs of the jobmanager and the deployer. Let me know if I can provide you with any additional info. Thanks for your help! Cheers, Niels Flink_deploy_log.txt <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_Flink-5Fdeploy-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=HxWMISxclHHjDET_E_zY-P95lt5mvMxU7YfGx9vyFcg&e= > flink_jobmanager_log.txt <https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_file_t1147_flink-5Fjobmanager-5Flog.txt&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=8PvIcLRPFokJ5XOPsczSatUddfM-xd6eG_FxaDlHEBk&e= > -- Sent from: https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink-2Duser-2Dmailing-2Dlist-2Darchive.2336050.n4.nabble.com_&d=DwICAg&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=vus_2CMQfE0wKmJ4Q_gOWWsBmKlgzMeEwtqShIeKvak&m=p4nMsVlOWZXkIxtRMVt11ovf0gctuHFZJfzvDgpvyKk&s=yX4z6UV1AFsAQtJsVquzujhFio0CgYr-tAIoroUXP8E&e= |
Short answer: could be that your job is simply too big to be serialised, distributed and deserialised in the given time and you would have to increase timeouts even more.
Long answer: Do you have the same problem when you try to submit smaller job? Does your cluster work for simpler jobs? Try cutting down/simplifying your job up to the point it works. Maybe you will be able to pin down one single operator that’s causing the problem (one that have for example huge static data structure). If so, you might be able to optimise your operators in some way. Maybe some operator is doing some weird things and causing problems. You could also try to approach this problem from other direction (as previously suggested by Fabian). Try to profile/find out what the cluster is doing, where is the problem. Job Manager? One Task Manager? All of the Task Managers? Is there high cpu usage somewhere? Maybe one thread somewhere is overloaded? High network usage? After identifying potential problematic JVM’s, you could attach a code profiler or print stack traces to further pin down the problem. Piotrek
|
Free forum by Nabble | Edit this page |