Login  Register

Re: Cannot see all events in window apply() for big input

Posted by Till Rohrmann on Nov 08, 2016; 2:46pm
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Cannot-see-all-events-in-window-apply-for-big-input-tp9945p9986.html

Hi Sendoh,

Flink should actually never lose data unless it is so late that it arrives after the allowed lateness. This should be independent of the total data size.

The watermarks are indeed global and not bound to a specific input element or a group. So for example if you create the watermarks from the timestamp information of your events and you have the following input event sequence: (eventA, 01-01), (eventB, 02-01), (eventC, 01-02). Then you would generate the watermark W(02-01) after the second event. The third event would then be a late element and if it exceeds the allowed lateness, then it will be discarded.

What you have to make sure is that the events in your queue have a monotonically increasing timestamp if you generate the watermarks from a timestamp field of the events.

Cheers,
Till

On Tue, Nov 8, 2016 at 3:37 PM, Sendoh <[hidden email]> wrote:
Hi,

Would the issue be events are too out of ordered and the watermark is
global?

We want to count event per event type per day, and the data looks like:

eventA, 10-29-XX
eventB,, 11-02-XX
eventB,, 11-02-XX
eventB,, 11-03-XX
eventB,, 11-04-XX
....
....
eventA, 10-29-XX
eventA, 10-30-XX
eventA, 10-30-XX
.
.
.
eventA, 11-04-XX


eventA is much much larger than eventB,
and it looks like we lost the count of eventA at 10-29 and 10-30 while we
have count of eventA at 11-04-XX.
Could it be the problem that watermark is gloabal rather than per event?

Best,

Sendoh



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Cannot-see-all-events-in-window-apply-for-big-input-tp9945p9985.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.