I'm interested in instrumenting an Apache Flink application so that we can monitor exceptions. I was wondering what the best practices are here? Is there a good way to observe all the exceptions inside of a Flink application, including Flink internals?
We are currently thinking of using Bugsnag, which has some steps to integrate with java applications: https://docs.bugsnag.com/platforms/java/other/, which works fine for uncaught exceptions in the job manager / pipeline driver context, but doesn't catch anything outside of that.
We're also interested in reporting on exceptions that occur in the job execution context, eg. in task managers.
Any tips/suggestions? I'd love to learn more about exception tracking and handling in Flink :)