sanity check in production

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

sanity check in production

burgesschen
Hello everyone,

Our team run into an issue that testing new deployment of flink job is difficult as explained below


Goal:
When we are deploying new version of a flink job in production. we want to be able to have the job process some test messages and verify the output to make sure that the job is running correctly. (sanity check)

Problem:
The tests messages interfere with the watermark of the flink job, potentially causing it dropping real messages.

Possible solutions:
1. have a separate watermark for the test messages
  (looks not supported by the current framework)

2. run a separate Flink job (same code) in production for sanity check before actual deployment
  (high operational costs)

3. cancel the running production job with a save point, run a new job with the save point, do sanity check and mess up the watermark of the new job, kill the new job, do actual deployment with the same save point.
  (high operational costs)

Any idea is appreciated, thanks!