(DEPRECATED) Apache Flink User Mailing List archive.

Flink the right tool for the job ? Huge Data window lateness

Classic

List

Threaded

7 messages Options

Patrick Brunmayr

Flink the right tool for the job ? Huge Data window lateness

Hello

I've done my first steps with Flink and i am very impressed of its capabilities. Thank you for that :) I want to use it for a project we are currently working on. After reading some documentation

i am not sure if it's the right tool for the job. We have an IoT application in which we are monitoring machines in production plants. The machines have sensors attached and they are sending
their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute basis.

Following requirements must be fulfilled

Lateness

We have to allow lateness for 7 days because machines can have down time due network issues, maintenance or something else. If thats the case buffering of data happens localy on the machine and once they
are online again all data will be sent to the broker. This can result in some relly heavy burst.
Out of order

Events come out of order due this lateness issues
Last write wins

Machines are not stateful and can not guarantee exactly once sending of their data. It can happen that sometimes events are sent twice. In that case the last event wins and should override the previous one.
Events are unique due a sensor_id and a timestamp
Computations per minute

We can not wait until the windows ends and have to do computations on a per minute basis. For example aggregating data per sensor and writing it to a db

My biggest concern in that case is the huge lateness. Keeping data for 7 days would result in 10080 data points for just one sensor! Multiplying that by 10.000 sensors would result in 100800000 datapoints which Flink
would have to handle in its state. The number of sensors are constantly growing so will the number of data points

So my questions are

Is Flink the right tool for the Job ?
Is that lateness an issue ?
How can i implement the Last write wins ?
How to tune flink to handle that growing load of sensors and data points ?
Hardware requirements, storage and memory size ?

I don't want to maintain two code base for batch and streaming because the operations are all equal. The only difference is the time range! Thats the reason i wanted to do all this with Flink Streaming.

Hope you can guide me in the right direction

Thx

Tzu-Li (Gordon) Tai

Re: Flink the right tool for the job ? Huge Data window lateness

Hi Patrick,

Thanks a lot for feedback on your use case! At a first glance, I would say that Flink can definitely solve the issues you are evaluating.

I’ll try to explain them, and point you to some docs / articles that can further explain in detail:

- Lateness

The 7-day lateness shouldn’t be a problem. We definitely recommend

using RocksDB as the state backend for such a use case, as you

mentioned correctly, the state would be kept for a long time.

The heavy burst when your locally buffered data on machines are

sent to Kafka once they come back online shouldn’t be a problem either;

since Flink is a pure data streaming engine, it handles backpressure

naturally without any additional mechanisms (I would recommend

taking a look at http://data-artisans.com/how-flink-handles-backpressure/).

- Out of Order

That’s exactly what event time processing is for :-) As long as the event

comes in before the allowed lateness for windows, the event will still fall

into its corresponding event time window. So, even with the heavy burst of

the your late machine data, they will still be aggregated in the correct windows.

You can look into event time in Flink with more detail in the event time docs:

https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html

- Last write wins

Your operators that does the aggregations simply need to be able to reprocess

results if it sees an event with the same id come in. Now, if results are sent out

of Flink and stored in an external db, if you can design the db writes to be idempotent,

then it’ll effectively be a “last write wins”. It depends mostly on your pipeline and

use case.

- Computations per minute

I think you can simply do this by having two separate window operators.

One that works on your longer window, and another on a per-minute basis.

Hope this helps!

- Gordon

On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr ([hidden email]) wrote:

Hello

I've done my first steps with Flink and i am very impressed of its capabilities. Thank you for that :) I want to use it for a project we are currently working on. After reading some documentation

i am not sure if it's the right tool for the job. We have an IoT application in which we are monitoring machines in production plants. The machines have sensors attached and they are sending
their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute basis.

Following requirements must be fulfilled

Lateness

We have to allow lateness for 7 days because machines can have down time due network issues, maintenance or something else. If thats the case buffering of data happens localy on the machine and once they
are online again all data will be sent to the broker. This can result in some relly heavy burst.

Out of order

Events come out of order due this lateness issues

Last write wins

Machines are not stateful and can not guarantee exactly once sending of their data. It can happen that sometimes events are sent twice. In that case the last event wins and should override the previous one.
Events are unique due a sensor_id and a timestamp

Computations per minute

We can not wait until the windows ends and have to do computations on a per minute basis. For example aggregating data per sensor and writing it to a db

My biggest concern in that case is the huge lateness. Keeping data for 7 days would result in 10080 data points for just one sensor! Multiplying that by 10.000 sensors would result in 100800000 datapoints which Flink
would have to handle in its state. The number of sensors are constantly growing so will the number of data points

So my questions are

Is Flink the right tool for the Job ?

Is that lateness an issue ?

How can i implement the Last write wins ?

How to tune flink to handle that growing load of sensors and data points ?

Hardware requirements, storage and memory size ?

I don't want to maintain two code base for batch and streaming because the operations are all equal. The only difference is the time range! Thats the reason i wanted to do all this with Flink Streaming.

Hope you can guide me in the right direction

Thx

rmetzger0

Re: Flink the right tool for the job ? Huge Data window lateness

Hi,

sounds like a cool project.

What's the size of one data point?

If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206 gigabytes of state. That's something one or two machines (depending on the disk throughput) should be able to handle.

If possible, I would recommend you to do an experiment using a prototype to see how many machines you need for your workload.

On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:

Hi Patrick,

Thanks a lot for feedback on your use case! At a first glance, I would say that Flink can definitely solve the issues you are evaluating.

I’ll try to explain them, and point you to some docs / articles that can further explain in detail:

- Lateness

The 7-day lateness shouldn’t be a problem. We definitely recommend
using RocksDB as the state backend for such a use case, as you
mentioned correctly, the state would be kept for a long time.
The heavy burst when your locally buffered data on machines are
sent to Kafka once they come back online shouldn’t be a problem either;
since Flink is a pure data streaming engine, it handles backpressure
naturally without any additional mechanisms (I would recommend
taking a look at http://data-artisans.com/how-flink-handles-backpressure/).

- Out of Order

That’s exactly what event time processing is for :-) As long as the event
comes in before the allowed lateness for windows, the event will still fall
into its corresponding event time window. So, even with the heavy burst of
the your late machine data, they will still be aggregated in the correct windows.
You can look into event time in Flink with more detail in the event time docs:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html

- Last write wins

Your operators that does the aggregations simply need to be able to reprocess
results if it sees an event with the same id come in. Now, if results are sent out
of Flink and stored in an external db, if you can design the db writes to be idempotent,
then it’ll effectively be a “last write wins”. It depends mostly on your pipeline and
use case.
- Computations per minute
I think you can simply do this by having two separate window operators.
One that works on your longer window, and another on a per-minute basis.
Hope this helps!
- Gordon

On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr ([hidden email]) wrote:

Hello

I've done my first steps with Flink and i am very impressed of its capabilities. Thank you for that :) I want to use it for a project we are currently working on. After reading some documentation

i am not sure if it's the right tool for the job. We have an IoT application in which we are monitoring machines in production plants. The machines have sensors attached and they are sending
their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute basis.

Following requirements must be fulfilled

Lateness

We have to allow lateness for 7 days because machines can have down time due network issues, maintenance or something else. If thats the case buffering of data happens localy on the machine and once they
are online again all data will be sent to the broker. This can result in some relly heavy burst.

Out of order

Events come out of order due this lateness issues

Last write wins

Machines are not stateful and can not guarantee exactly once sending of their data. It can happen that sometimes events are sent twice. In that case the last event wins and should override the previous one.
Events are unique due a sensor_id and a timestamp

Computations per minute

We can not wait until the windows ends and have to do computations on a per minute basis. For example aggregating data per sensor and writing it to a db

My biggest concern in that case is the huge lateness. Keeping data for 7 days would result in 10080 data points for just one sensor! Multiplying that by 10.000 sensors would result in 100800000 datapoints which Flink
would have to handle in its state. The number of sensors are constantly growing so will the number of data points

So my questions are

Is Flink the right tool for the Job ?

Is that lateness an issue ?

How can i implement the Last write wins ?

How to tune flink to handle that growing load of sensors and data points ?

Hardware requirements, storage and memory size ?

I don't want to maintain two code base for batch and streaming because the operations are all equal. The only difference is the time range! Thats the reason i wanted to do all this with Flink Streaming.

Hope you can guide me in the right direction

Thx

Patrick Brunmayr

Re: Flink the right tool for the job ? Huge Data window lateness

Yes it is and would be nice to handle this with Flink :)

- Size of data point

The size of a data point is basically just a simple case class with two fields in a 64bit OS

case class MachineData(sensorId: String, eventTime:Long)

- Last write wins

We have cassandra as data warehouse but i was hoping i could solve that issue in the state level rather than in the db level. The reason beeing is one could send me the same events
over and over again and this will cause that state to blow up until out of memory. Secondly by doing aggregations per sensor results will be wrong due multiple events with the same
timestamp.

thx

2017-02-24 17:47 GMT+01:00 Robert Metzger <[hidden email]>:

Hi,
sounds like a cool project.

What's the size of one data point?
If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206 gigabytes of state. That's something one or two machines (depending on the disk throughput) should be able to handle.

If possible, I would recommend you to do an experiment using a prototype to see how many machines you need for your workload.

On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi Patrick,

Thanks a lot for feedback on your use case! At a first glance, I would say that Flink can definitely solve the issues you are evaluating.

I’ll try to explain them, and point you to some docs / articles that can further explain in detail:

- Lateness

The 7-day lateness shouldn’t be a problem. We definitely recommend
using RocksDB as the state backend for such a use case, as you
mentioned correctly, the state would be kept for a long time.
The heavy burst when your locally buffered data on machines are
sent to Kafka once they come back online shouldn’t be a problem either;
since Flink is a pure data streaming engine, it handles backpressure
naturally without any additional mechanisms (I would recommend
taking a look at http://data-artisans.com/how-flink-handles-backpressure/).

- Out of Order

That’s exactly what event time processing is for :-) As long as the event
comes in before the allowed lateness for windows, the event will still fall
into its corresponding event time window. So, even with the heavy burst of
the your late machine data, they will still be aggregated in the correct windows.
You can look into event time in Flink with more detail in the event time docs:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html

- Last write wins

Your operators that does the aggregations simply need to be able to reprocess
results if it sees an event with the same id come in. Now, if results are sent out
of Flink and stored in an external db, if you can design the db writes to be idempotent,
then it’ll effectively be a “last write wins”. It depends mostly on your pipeline and
use case.
- Computations per minute
I think you can simply do this by having two separate window operators.
One that works on your longer window, and another on a per-minute basis.
Hope this helps!
- Gordon

On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr ([hidden email]) wrote:

Hello

I've done my first steps with Flink and i am very impressed of its capabilities. Thank you for that :) I want to use it for a project we are currently working on. After reading some documentation

i am not sure if it's the right tool for the job. We have an IoT application in which we are monitoring machines in production plants. The machines have sensors attached and they are sending
their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute basis.

Following requirements must be fulfilled

Lateness

We have to allow lateness for 7 days because machines can have down time due network issues, maintenance or something else. If thats the case buffering of data happens localy on the machine and once they
are online again all data will be sent to the broker. This can result in some relly heavy burst.

Out of order

Events come out of order due this lateness issues

Last write wins

Machines are not stateful and can not guarantee exactly once sending of their data. It can happen that sometimes events are sent twice. In that case the last event wins and should override the previous one.
Events are unique due a sensor_id and a timestamp

Computations per minute

We can not wait until the windows ends and have to do computations on a per minute basis. For example aggregating data per sensor and writing it to a db

My biggest concern in that case is the huge lateness. Keeping data for 7 days would result in 10080 data points for just one sensor! Multiplying that by 10.000 sensors would result in 100800000 datapoints which Flink
would have to handle in its state. The number of sensors are constantly growing so will the number of data points

So my questions are

Is Flink the right tool for the Job ?

Is that lateness an issue ?

How can i implement the Last write wins ?

How to tune flink to handle that growing load of sensors and data points ?

Hardware requirements, storage and memory size ?

I don't want to maintain two code base for batch and streaming because the operations are all equal. The only difference is the time range! Thats the reason i wanted to do all this with Flink Streaming.

Hope you can guide me in the right direction

Thx

Timur Shenkao

Re: Flink the right tool for the job ? Huge Data window lateness

Hi,

100 million rows is small load, especially for 1 week.

I suspect that your load would be quite evenly distributed during the day as it's plant not humans.

If you look for reliability, make 2 Kafka servers at least where each topic has 6 partitions. And separate Hadoop cluster for Flink.

As for duplicate messages, it's not a problem of Flink or Cassandra. It's a logical problem, i.e. it's up to you how to achieve exactly once semantics.

I advise you to use some storage anyway for reliability and failover.

Sincerely yours, Timur Shenkao

On Friday, February 24, 2017, Patrick Brunmayr <[hidden email]> wrote:

Hi

Yes it is and would be nice to handle this with Flink :)

- Size of data point

The size of a data point is basically just a simple case class with two fields in a 64bit OS

case class MachineData(sensorId: String, eventTime:Long)

- Last write wins

We have cassandra as data warehouse but i was hoping i could solve that issue in the state level rather than in the db level. The reason beeing is one could send me the same events
over and over again and this will cause that state to blow up until out of memory. Secondly by doing aggregations per sensor results will be wrong due multiple events with the same
timestamp.

thx

2017-02-24 17:47 GMT+01:00 Robert Metzger <<a href="javascript:_e(%7B%7D,'cvml','rmetzger@apache.org');" target="_blank">rmetzger@...>:
Hi,
sounds like a cool project.

What's the size of one data point?
If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206 gigabytes of state. That's something one or two machines (depending on the disk throughput) should be able to handle.

If possible, I would recommend you to do an experiment using a prototype to see how many machines you need for your workload.

On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <<a href="javascript:_e(%7B%7D,'cvml','tzulitai@apache.org');" target="_blank">tzulitai@...> wrote:
Hi Patrick,

Thanks a lot for feedback on your use case! At a first glance, I would say that Flink can definitely solve the issues you are evaluating.

I’ll try to explain them, and point you to some docs / articles that can further explain in detail:

- Lateness

The 7-day lateness shouldn’t be a problem. We definitely recommend
using RocksDB as the state backend for such a use case, as you
mentioned correctly, the state would be kept for a long time.
The heavy burst when your locally buffered data on machines are
sent to Kafka once they come back online shouldn’t be a problem either;
since Flink is a pure data streaming engine, it handles backpressure
naturally without any additional mechanisms (I would recommend
taking a look at http://data-artisans.com/how-flink-handles-backpressure/).

- Out of Order

That’s exactly what event time processing is for :-) As long as the event
comes in before the allowed lateness for windows, the event will still fall
into its corresponding event time window. So, even with the heavy burst of
the your late machine data, they will still be aggregated in the correct windows.
You can look into event time in Flink with more detail in the event time docs:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html

- Last write wins

Your operators that does the aggregations simply need to be able to reprocess
results if it sees an event with the same id come in. Now, if results are sent out
of Flink and stored in an external db, if you can design the db writes to be idempotent,
then it’ll effectively be a “last write wins”. It depends mostly on your pipeline and
use case.
- Computations per minute
I think you can simply do this by having two separate window operators.
One that works on your longer window, and another on a per-minute basis.
Hope this helps!
- Gordon

On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr (<a href="javascript:_e(%7B%7D,'cvml','jay@kpibench.com');" target="_blank">jay@...) wrote:

Hello

I've done my first steps with Flink and i am very impressed of its capabilities. Thank you for that :) I want to use it for a project we are currently working on. After reading some documentation

i am not sure if it's the right tool for the job. We have an IoT application in which we are monitoring machines in production plants. The machines have sensors attached and they are sending
their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute basis.

Following requirements must be fulfilled

Lateness

We have to allow lateness for 7 days because machines can have down time due network issues, maintenance or something else. If thats the case buffering of data happens localy on the machine and once they
are online again all data will be sent to the broker. This can result in some relly heavy burst.

Out of order

Events come out of order due this lateness issues

Last write wins

Machines are not stateful and can not guarantee exactly once sending of their data. It can happen that sometimes events are sent twice. In that case the last event wins and should override the previous one.
Events are unique due a sensor_id and a timestamp

Computations per minute

We can not wait until the windows ends and have to do computations on a per minute basis. For example aggregating data per sensor and writing it to a db

My biggest concern in that case is the huge lateness. Keeping data for 7 days would result in 10080 data points for just one sensor! Multiplying that by 10.000 sensors would result in 100800000 datapoints which Flink
would have to handle in its state. The number of sensors are constantly growing so will the number of data points

So my questions are

Is Flink the right tool for the Job ?

Is that lateness an issue ?

How can i implement the Last write wins ?

How to tune flink to handle that growing load of sensors and data points ?

Hardware requirements, storage and memory size ?

I don't want to maintain two code base for batch and streaming because the operations are all equal. The only difference is the time range! Thats the reason i wanted to do all this with Flink Streaming.

Hope you can guide me in the right direction

Thx

Aljoscha Krettek

Re: Flink the right tool for the job ? Huge Data window lateness

Hi,

just to throw in my 2 cents: if your window operations don't require that all elements are kept as they are you can greatly reduce your state size by using a ReduceFunction on your window. With this, the state size would essentially become <per-item-size> * <num keys> * <num windows>.

Best,

Aljoscha

On Sun, 26 Feb 2017 at 14:16 Timur Shenkao <[hidden email]> wrote:

Hi,

100 million rows is small load, especially for 1 week.
I suspect that your load would be quite evenly distributed during the day as it's plant not humans.
If you look for reliability, make 2 Kafka servers at least where each topic has 6 partitions. And separate Hadoop cluster for Flink.

As for duplicate messages, it's not a problem of Flink or Cassandra. It's a logical problem, i.e. it's up to you how to achieve exactly once semantics.
I advise you to use some storage anyway for reliability and failover.

Sincerely yours, Timur Shenkao

On Friday, February 24, 2017, Patrick Brunmayr <[hidden email]> wrote:
Hi

Yes it is and would be nice to handle this with Flink :)

- Size of data point

The size of a data point is basically just a simple case class with two fields in a 64bit OS

case class MachineData(sensorId: String, eventTime:Long)

- Last write wins

We have cassandra as data warehouse but i was hoping i could solve that issue in the state level rather than in the db level. The reason beeing is one could send me the same events
over and over again and this will cause that state to blow up until out of memory. Secondly by doing aggregations per sensor results will be wrong due multiple events with the same
timestamp.

thx

2017-02-24 17:47 GMT+01:00 Robert Metzger <[hidden email]>:
Hi,
sounds like a cool project.

What's the size of one data point?
If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206 gigabytes of state. That's something one or two machines (depending on the disk throughput) should be able to handle.

If possible, I would recommend you to do an experiment using a prototype to see how many machines you need for your workload.

On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi Patrick,

Thanks a lot for feedback on your use case! At a first glance, I would say that Flink can definitely solve the issues you are evaluating.

I’ll try to explain them, and point you to some docs / articles that can further explain in detail:

- Lateness

The 7-day lateness shouldn’t be a problem. We definitely recommend
using RocksDB as the state backend for such a use case, as you
mentioned correctly, the state would be kept for a long time.
The heavy burst when your locally buffered data on machines are
sent to Kafka once they come back online shouldn’t be a problem either;
since Flink is a pure data streaming engine, it handles backpressure
naturally without any additional mechanisms (I would recommend
taking a look at http://data-artisans.com/how-flink-handles-backpressure/).

- Out of Order

That’s exactly what event time processing is for :-) As long as the event
comes in before the allowed lateness for windows, the event will still fall
into its corresponding event time window. So, even with the heavy burst of
the your late machine data, they will still be aggregated in the correct windows.
You can look into event time in Flink with more detail in the event time docs:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html

- Last write wins

Your operators that does the aggregations simply need to be able to reprocess
results if it sees an event with the same id come in. Now, if results are sent out
of Flink and stored in an external db, if you can design the db writes to be idempotent,
then it’ll effectively be a “last write wins”. It depends mostly on your pipeline and
use case.
- Computations per minute
I think you can simply do this by having two separate window operators.
One that works on your longer window, and another on a per-minute basis.
Hope this helps!
- Gordon

On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr ([hidden email]) wrote:

Hello

I've done my first steps with Flink and i am very impressed of its capabilities. Thank you for that :) I want to use it for a project we are currently working on. After reading some documentation

i am not sure if it's the right tool for the job. We have an IoT application in which we are monitoring machines in production plants. The machines have sensors attached and they are sending
their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute basis.

Following requirements must be fulfilled

Lateness

We have to allow lateness for 7 days because machines can have down time due network issues, maintenance or something else. If thats the case buffering of data happens localy on the machine and once they
are online again all data will be sent to the broker. This can result in some relly heavy burst.

Out of order

Events come out of order due this lateness issues

Last write wins

Machines are not stateful and can not guarantee exactly once sending of their data. It can happen that sometimes events are sent twice. In that case the last event wins and should override the previous one.
Events are unique due a sensor_id and a timestamp

Computations per minute

We can not wait until the windows ends and have to do computations on a per minute basis. For example aggregating data per sensor and writing it to a db

My biggest concern in that case is the huge lateness. Keeping data for 7 days would result in 10080 data points for just one sensor! Multiplying that by 10.000 sensors would result in 100800000 datapoints which Flink
would have to handle in its state. The number of sensors are constantly growing so will the number of data points

So my questions are

Is Flink the right tool for the Job ?

Is that lateness an issue ?

How can i implement the Last write wins ?

How to tune flink to handle that growing load of sensors and data points ?

Hardware requirements, storage and memory size ?

I don't want to maintain two code base for batch and streaming because the operations are all equal. The only difference is the time range! Thats the reason i wanted to do all this with Flink Streaming.

Hope you can guide me in the right direction

Thx

Stephan Ewen

Re: Flink the right tool for the job ? Huge Data window lateness

Also FYI: Current work includes incremental checkpointing so that large state checkpoints require less bandwidth and storage.

On Mon, Feb 27, 2017 at 5:53 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
just to throw in my 2 cents: if your window operations don't require that all elements are kept as they are you can greatly reduce your state size by using a ReduceFunction on your window. With this, the state size would essentially become <per-item-size> * <num keys> * <num windows>.

Best,
Aljoscha

On Sun, 26 Feb 2017 at 14:16 Timur Shenkao <[hidden email]> wrote:
Hi,

100 million rows is small load, especially for 1 week.
I suspect that your load would be quite evenly distributed during the day as it's plant not humans.
If you look for reliability, make 2 Kafka servers at least where each topic has 6 partitions. And separate Hadoop cluster for Flink.

As for duplicate messages, it's not a problem of Flink or Cassandra. It's a logical problem, i.e. it's up to you how to achieve exactly once semantics.
I advise you to use some storage anyway for reliability and failover.

Sincerely yours, Timur Shenkao

On Friday, February 24, 2017, Patrick Brunmayr <[hidden email]> wrote:
Hi

Yes it is and would be nice to handle this with Flink :)

- Size of data point

The size of a data point is basically just a simple case class with two fields in a 64bit OS

case class MachineData(sensorId: String, eventTime:Long)

- Last write wins

We have cassandra as data warehouse but i was hoping i could solve that issue in the state level rather than in the db level. The reason beeing is one could send me the same events
over and over again and this will cause that state to blow up until out of memory. Secondly by doing aggregations per sensor results will be wrong due multiple events with the same
timestamp.

thx

2017-02-24 17:47 GMT+01:00 Robert Metzger <[hidden email]>:
Hi,
sounds like a cool project.

What's the size of one data point?
If one datapoint is 2 kb, you'll have 100 800 000 * 2048 bytes = 206 gigabytes of state. That's something one or two machines (depending on the disk throughput) should be able to handle.

If possible, I would recommend you to do an experiment using a prototype to see how many machines you need for your workload.

On Fri, Feb 24, 2017 at 5:41 PM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi Patrick,

Thanks a lot for feedback on your use case! At a first glance, I would say that Flink can definitely solve the issues you are evaluating.

I’ll try to explain them, and point you to some docs / articles that can further explain in detail:

- Lateness

The 7-day lateness shouldn’t be a problem. We definitely recommend
using RocksDB as the state backend for such a use case, as you
mentioned correctly, the state would be kept for a long time.
The heavy burst when your locally buffered data on machines are
sent to Kafka once they come back online shouldn’t be a problem either;
since Flink is a pure data streaming engine, it handles backpressure
naturally without any additional mechanisms (I would recommend
taking a look at http://data-artisans.com/how-flink-handles-backpressure/).

- Out of Order

That’s exactly what event time processing is for :-) As long as the event
comes in before the allowed lateness for windows, the event will still fall
into its corresponding event time window. So, even with the heavy burst of
the your late machine data, they will still be aggregated in the correct windows.
You can look into event time in Flink with more detail in the event time docs:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html

- Last write wins

Your operators that does the aggregations simply need to be able to reprocess
results if it sees an event with the same id come in. Now, if results are sent out
of Flink and stored in an external db, if you can design the db writes to be idempotent,
then it’ll effectively be a “last write wins”. It depends mostly on your pipeline and
use case.
- Computations per minute
I think you can simply do this by having two separate window operators.
One that works on your longer window, and another on a per-minute basis.
Hope this helps!
- Gordon

On February 24, 2017 at 10:49:14 PM, Patrick Brunmayr ([hidden email]) wrote:

Hello

I've done my first steps with Flink and i am very impressed of its capabilities. Thank you for that :) I want to use it for a project we are currently working on. After reading some documentation

i am not sure if it's the right tool for the job. We have an IoT application in which we are monitoring machines in production plants. The machines have sensors attached and they are sending
their data to a broker ( Kafka, Azure Iot Hub ) currently on a per minute basis.

Following requirements must be fulfilled

Lateness

We have to allow lateness for 7 days because machines can have down time due network issues, maintenance or something else. If thats the case buffering of data happens localy on the machine and once they
are online again all data will be sent to the broker. This can result in some relly heavy burst.

Out of order

Events come out of order due this lateness issues

Last write wins

Machines are not stateful and can not guarantee exactly once sending of their data. It can happen that sometimes events are sent twice. In that case the last event wins and should override the previous one.
Events are unique due a sensor_id and a timestamp

Computations per minute

We can not wait until the windows ends and have to do computations on a per minute basis. For example aggregating data per sensor and writing it to a db

My biggest concern in that case is the huge lateness. Keeping data for 7 days would result in 10080 data points for just one sensor! Multiplying that by 10.000 sensors would result in 100800000 datapoints which Flink
would have to handle in its state. The number of sensors are constantly growing so will the number of data points

So my questions are

Is Flink the right tool for the Job ?

Is that lateness an issue ?

How can i implement the Last write wins ?

How to tune flink to handle that growing load of sensors and data points ?

Hardware requirements, storage and memory size ?

I don't want to maintain two code base for batch and streaming because the operations are all equal. The only difference is the time range! Thats the reason i wanted to do all this with Flink Streaming.

Hope you can guide me in the right direction

Thx