Hi All,
We are currently on a very old version of flink 1.4.0 and it has worked pretty well. But lately we have been facing checkpoint timeout issues. We would like to minimize any changes to the current pipelines and go ahead with the migration. With that said our first pick was to migrate to 1.5.6 and later migrate to a newer version. Do you guys think a more recent version like 1.6 or 1.7 might work? We did try 1.8 but it requires some changes in the pipelines. When we tried 1.5.6 with docker compose we were unable to get the task manager attached to jobmanager. Are there some specific configurations required for newer versions? Logs: 8-28 07:36:30.834 [main] INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 2020-08-28 07:36:30.853 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address jobmanager/172.21.0.8:6123. 2020-08-28 07:36:31.279 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/172.21.0.8:6123 2020-08-28 07:36:31.280 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.281 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.281 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.282 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:31.283 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.284 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:31.684 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/172.21.0.8:6123 2020-08-28 07:36:31.686 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.687 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.688 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.688 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:31.689 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.690 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:32.490 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/172.21.0.8:6123 2020-08-28 07:36:32.491 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:32.493 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:32.494 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:32.495 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:32.496 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:32.497 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:34.099 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/172.21.0.8:6123 2020-08-28 07:36:34.100 [main] INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TaskManager will use hostname/address 'e6f9104cdc61' (172.21.0.9) for communication. Flink Conf
jobmanager.rpc.address: jobmanager rest.address: jobmanager Thanks |
Hi Navneeth
First of all, I suggest to upgrade Flink version to latest version.
And you could refer here [1] for the savepoint compatibility when upgrading Flink.
For the problem that cannot connect address, you could login your pod and run 'nslookup jobmanager' to see whether the host could be resolved.
You can also check the service of 'jobmanager' whether work as expected via 'kubectl get svc' .
Best
Yun Tang
From: Navneeth Krishnan <[hidden email]>
Sent: Friday, August 28, 2020 17:00 To: user <[hidden email]> Subject: Flink Migration Hi All,
We are currently on a very old version of flink 1.4.0 and it has worked pretty well. But lately we have been facing checkpoint timeout issues. We would like to minimize any changes to the current pipelines and go ahead with the migration. With that said
our first pick was to migrate to 1.5.6 and later migrate to a newer version.
Do you guys think a more recent version like 1.6 or 1.7 might work? We did try 1.8 but it requires some changes in the pipelines.
When we tried 1.5.6 with docker compose we were unable to get the task manager attached to jobmanager. Are there some specific configurations required for newer versions?
Logs:
8-28 07:36:30.834 [main] INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 2020-08-28 07:36:30.853 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address jobmanager/172.21.0.8:6123. 2020-08-28 07:36:31.279 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/172.21.0.8:6123 2020-08-28 07:36:31.280 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.281 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.281 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.282 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:31.283 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.284 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:31.684 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/172.21.0.8:6123 2020-08-28 07:36:31.686 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.687 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.688 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.688 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:31.689 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:31.690 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:32.490 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/172.21.0.8:6123 2020-08-28 07:36:32.491 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:32.493 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:32.494 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:32.495 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:32.496 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.21.0.9': Connection refused (Connection refused) 2020-08-28 07:36:32.497 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed) 2020-08-28 07:36:34.099 [main] INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/172.21.0.8:6123 2020-08-28 07:36:34.100 [main] INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TaskManager will use hostname/address 'e6f9104cdc61' (172.21.0.9) for communication.
Flink Conf
jobmanager.rpc.address: jobmanager rest.address: jobmanager
Thanks |
Hi Navneeth, if everything worked before and you just experience later issues, it would be interesting to know if your state size grew over time. An application over time usually needs gradually more resources. If the
user base of your company grows, so does the amount of messages (be it
click stream, page impressions, or any kind of transactions). Often
time, also the operator state grows. Sometimes, it's just that the
events themselves become more complex and thus you need more overall
bandwidth. This means that from time to time, you need to increase the
memory of Flink (for state) or the number of compute nodes (to handle
more events). In the same way, you need to make sure that your sink
scales as well. If you fail to keep up with the demand, the
application gradually becomes more unstable (for example by running out of memory repeatedly). I'm suspecting that this may happen in your case. First, it's important to understand what the bottleneck is. Web UI should help to narrow it down quickly. You can also share your insights and we can discuss further strategies. If nothing works out, I also recommend an upgrade. Your best migration path would be to use Flink 1.7, which should allow a smoother transition for state [1]. I'd guess that afterwards, you should be able to migrate to 1.11 with almost no code changes. On Fri, Aug 28, 2020 at 1:43 PM Yun Tang <[hidden email]> wrote:
-- Arvid Heise | Senior Java Developer Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng |
Free forum by Nabble | Edit this page |