(DEPRECATED) Apache Flink User Mailing List archive.

Flink Job and Watermarking

Classic

List

Threaded

2 messages Options

Kaustubh Rudrawar

Flink Job and Watermarking

Hi,

I'm writing a job that wants to make an HTTP request once a watermark has reached all tasks of an operator. It would be great if this could be determined from outside the Flink job, but I don't think it's possible to access watermark information for the job as a whole. Below is a workaround I've come up with:

Read messages from Kafka using the provided KafkaSource. Event time will be defined as a timestamp within the message.
Key the stream based on an id from the message.
DedupOperator that dedupes messages. This operator will run with a parallelism of N.
An operator that persists the messages to S3. It doesn't need to output anything - it should ideally be a Sink (if it were a sink we could use the StreamingFileSink).
Implement an operator that will make an HTTP request once processWatermark is called for time T. A parallelism of 1 will be used for this operator as it will do very little work. Because it has a parallelism of 1, the operator in step 4 cannot send anything to it as it could become a throughput bottleneck.

Does this implementation seem like a valid workaround? Any other alternatives I should consider?

Thanks for your help,

Kaustubh

Chesnay Schepler

Re: Flink Job and Watermarking

Have you considered using the metric system to access the current watermarks for each operator? (see https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#io)

On 08.02.2019 03:19, Kaustubh Rudrawar wrote:

Hi,

I'm writing a job that wants to make an HTTP request once a watermark has reached all tasks of an operator. It would be great if this could be determined from outside the Flink job, but I don't think it's possible to access watermark information for the job as a whole. Below is a workaround I've come up with:

Read messages from Kafka using the provided KafkaSource. Event time will be defined as a timestamp within the message.

Key the stream based on an id from the message.

DedupOperator that dedupes messages. This operator will run with a parallelism of N.

An operator that persists the messages to S3. It doesn't need to output anything - it should ideally be a Sink (if it were a sink we could use the StreamingFileSink).

Implement an operator that will make an HTTP request once processWatermark is called for time T. A parallelism of 1 will be used for this operator as it will do very little work. Because it has a parallelism of 1, the operator in step 4 cannot send anything to it as it could become a throughput bottleneck.

Does this implementation seem like a valid workaround? Any other alternatives I should consider?

Thanks for your help,

Kaustubh