using updating shared data

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

using updating shared data

avilevi
Hi,
I have a list (couple of thousands text lines) that I need to use in my map function. I read this article about broadcasting variables  or using distributed cache however I need to update this list from time to time, and if I understood correctly it is not possible on broadcast or cache without restarting the job. Is there idiomatic way to achieve this? A db seems to be an overkill for that and I do want to be cheap on io/network calls as much as possible.

Cheers 
Avi

Reply | Threaded
Open this post in threaded view
|

Re: using updating shared data

miki haiat
Im trying to understand  your  use case.
What is the source  of the data ? FS ,KAFKA else ?


On Tue, Jan 1, 2019 at 6:29 PM Avi Levi <[hidden email]> wrote:
Hi,
I have a list (couple of thousands text lines) that I need to use in my map function. I read this article about broadcasting variables  or using distributed cache however I need to update this list from time to time, and if I understood correctly it is not possible on broadcast or cache without restarting the job. Is there idiomatic way to achieve this? A db seems to be an overkill for that and I do want to be cheap on io/network calls as much as possible.

Cheers 
Avi

Reply | Threaded
Open this post in threaded view
|

Re: using updating shared data

Till Rohrmann
Hi Avi,

you could use Flink's broadcast state pattern [1]. You would need to use the DataStream API but it allows you to have two streams (input and control stream) where the control stream is broadcasted to all sub tasks. So by ingesting messages into the control stream you can send model updates to all sub tasks.


Cheers,
Till

On Tue, Jan 1, 2019 at 6:49 PM miki haiat <[hidden email]> wrote:
Im trying to understand  your  use case.
What is the source  of the data ? FS ,KAFKA else ?


On Tue, Jan 1, 2019 at 6:29 PM Avi Levi <[hidden email]> wrote:
Hi,
I have a list (couple of thousands text lines) that I need to use in my map function. I read this article about broadcasting variables  or using distributed cache however I need to update this list from time to time, and if I understood correctly it is not possible on broadcast or cache without restarting the job. Is there idiomatic way to achieve this? A db seems to be an overkill for that and I do want to be cheap on io/network calls as much as possible.

Cheers 
Avi

Reply | Threaded
Open this post in threaded view
|

Re: using updating shared data

avilevi
Thanks Till I will defiantly going to check it. just to make sure that I got you correctly. you are suggesting the the list that I want to broadcast will be broadcasted via control stream and it will be than be kept in the relevant operator state correct ? and updates (CRUD) on that list will be preformed via the control stream. correct ?
BR
Avi

On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann <[hidden email]> wrote:
Hi Avi,

you could use Flink's broadcast state pattern [1]. You would need to use the DataStream API but it allows you to have two streams (input and control stream) where the control stream is broadcasted to all sub tasks. So by ingesting messages into the control stream you can send model updates to all sub tasks.


Cheers,
Till

On Tue, Jan 1, 2019 at 6:49 PM miki haiat <[hidden email]> wrote:
Im trying to understand  your  use case.
What is the source  of the data ? FS ,KAFKA else ?


On Tue, Jan 1, 2019 at 6:29 PM Avi Levi <[hidden email]> wrote:
Hi,
I have a list (couple of thousands text lines) that I need to use in my map function. I read this article about broadcasting variables  or using distributed cache however I need to update this list from time to time, and if I understood correctly it is not possible on broadcast or cache without restarting the job. Is there idiomatic way to achieve this? A db seems to be an overkill for that and I do want to be cheap on io/network calls as much as possible.

Cheers 
Avi

Reply | Threaded
Open this post in threaded view
|

Re: using updating shared data

Till Rohrmann
Yes exactly Avi.

Cheers,
Till

On Wed, Jan 2, 2019 at 5:42 PM Avi Levi <[hidden email]> wrote:
Thanks Till I will defiantly going to check it. just to make sure that I got you correctly. you are suggesting the the list that I want to broadcast will be broadcasted via control stream and it will be than be kept in the relevant operator state correct ? and updates (CRUD) on that list will be preformed via the control stream. correct ?
BR
Avi

On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann <[hidden email]> wrote:
Hi Avi,

you could use Flink's broadcast state pattern [1]. You would need to use the DataStream API but it allows you to have two streams (input and control stream) where the control stream is broadcasted to all sub tasks. So by ingesting messages into the control stream you can send model updates to all sub tasks.


Cheers,
Till

On Tue, Jan 1, 2019 at 6:49 PM miki haiat <[hidden email]> wrote:
Im trying to understand  your  use case.
What is the source  of the data ? FS ,KAFKA else ?


On Tue, Jan 1, 2019 at 6:29 PM Avi Levi <[hidden email]> wrote:
Hi,
I have a list (couple of thousands text lines) that I need to use in my map function. I read this article about broadcasting variables  or using distributed cache however I need to update this list from time to time, and if I understood correctly it is not possible on broadcast or cache without restarting the job. Is there idiomatic way to achieve this? A db seems to be an overkill for that and I do want to be cheap on io/network calls as much as possible.

Cheers 
Avi

Reply | Threaded
Open this post in threaded view
|

Re: using updating shared data

Elias Levy
In reply to this post by avilevi
One thing you must be careful of, is that if you are using event time processing, assuming that the control stream will only receive messages sporadically, is that event time will stop moving forward in the operator joining the streams while the control stream is idle.  You can get around this by using a periodic watermark extractor one the control stream that bounds the event time delay to processing time or by defining your own low level operator that ignores watermarks from the control stream.

On Wed, Jan 2, 2019 at 8:42 AM Avi Levi <[hidden email]> wrote:
Thanks Till I will defiantly going to check it. just to make sure that I got you correctly. you are suggesting the the list that I want to broadcast will be broadcasted via control stream and it will be than be kept in the relevant operator state correct ? and updates (CRUD) on that list will be preformed via the control stream. correct ?
BR
Avi

On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann <[hidden email]> wrote:
Hi Avi,

you could use Flink's broadcast state pattern [1]. You would need to use the DataStream API but it allows you to have two streams (input and control stream) where the control stream is broadcasted to all sub tasks. So by ingesting messages into the control stream you can send model updates to all sub tasks.


Cheers,
Till

On Tue, Jan 1, 2019 at 6:49 PM miki haiat <[hidden email]> wrote:
Im trying to understand  your  use case.
What is the source  of the data ? FS ,KAFKA else ?


On Tue, Jan 1, 2019 at 6:29 PM Avi Levi <[hidden email]> wrote:
Hi,
I have a list (couple of thousands text lines) that I need to use in my map function. I read this article about broadcasting variables  or using distributed cache however I need to update this list from time to time, and if I understood correctly it is not possible on broadcast or cache without restarting the job. Is there idiomatic way to achieve this? A db seems to be an overkill for that and I do want to be cheap on io/network calls as much as possible.

Cheers 
Avi

Reply | Threaded
Open this post in threaded view
|

Re: using updating shared data

avilevi
Thanks for the tip Elias! 

On Wed, Jan 2, 2019 at 9:44 PM Elias Levy <[hidden email]> wrote:
One thing you must be careful of, is that if you are using event time processing, assuming that the control stream will only receive messages sporadically, is that event time will stop moving forward in the operator joining the streams while the control stream is idle.  You can get around this by using a periodic watermark extractor one the control stream that bounds the event time delay to processing time or by defining your own low level operator that ignores watermarks from the control stream.

On Wed, Jan 2, 2019 at 8:42 AM Avi Levi <[hidden email]> wrote:
Thanks Till I will defiantly going to check it. just to make sure that I got you correctly. you are suggesting the the list that I want to broadcast will be broadcasted via control stream and it will be than be kept in the relevant operator state correct ? and updates (CRUD) on that list will be preformed via the control stream. correct ?
BR
Avi

On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann <[hidden email]> wrote:
Hi Avi,

you could use Flink's broadcast state pattern [1]. You would need to use the DataStream API but it allows you to have two streams (input and control stream) where the control stream is broadcasted to all sub tasks. So by ingesting messages into the control stream you can send model updates to all sub tasks.


Cheers,
Till

On Tue, Jan 1, 2019 at 6:49 PM miki haiat <[hidden email]> wrote:
Im trying to understand  your  use case.
What is the source  of the data ? FS ,KAFKA else ?


On Tue, Jan 1, 2019 at 6:29 PM Avi Levi <[hidden email]> wrote:
Hi,
I have a list (couple of thousands text lines) that I need to use in my map function. I read this article about broadcasting variables  or using distributed cache however I need to update this list from time to time, and if I understood correctly it is not possible on broadcast or cache without restarting the job. Is there idiomatic way to achieve this? A db seems to be an overkill for that and I do want to be cheap on io/network calls as much as possible.

Cheers 
Avi

Reply | Threaded
Open this post in threaded view
|

Re: using updating shared data

David Anderson
Another solution to the watermarking issue is to write an AssignerWithPeriodicWatermarks for the control stream that always returns Watermark.MAX_WATERMARK as the current watermark. This produces watermarks for the control stream that will effectively be ignored. 

On Thu, Jan 3, 2019 at 9:18 PM Avi Levi <[hidden email]> wrote:
Thanks for the tip Elias! 

On Wed, Jan 2, 2019 at 9:44 PM Elias Levy <[hidden email]> wrote:
One thing you must be careful of, is that if you are using event time processing, assuming that the control stream will only receive messages sporadically, is that event time will stop moving forward in the operator joining the streams while the control stream is idle.  You can get around this by using a periodic watermark extractor one the control stream that bounds the event time delay to processing time or by defining your own low level operator that ignores watermarks from the control stream.

On Wed, Jan 2, 2019 at 8:42 AM Avi Levi <[hidden email]> wrote:
Thanks Till I will defiantly going to check it. just to make sure that I got you correctly. you are suggesting the the list that I want to broadcast will be broadcasted via control stream and it will be than be kept in the relevant operator state correct ? and updates (CRUD) on that list will be preformed via the control stream. correct ?
BR
Avi

On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann <[hidden email]> wrote:
Hi Avi,

you could use Flink's broadcast state pattern [1]. You would need to use the DataStream API but it allows you to have two streams (input and control stream) where the control stream is broadcasted to all sub tasks. So by ingesting messages into the control stream you can send model updates to all sub tasks.


Cheers,
Till

On Tue, Jan 1, 2019 at 6:49 PM miki haiat <[hidden email]> wrote:
Im trying to understand  your  use case.
What is the source  of the data ? FS ,KAFKA else ?


On Tue, Jan 1, 2019 at 6:29 PM Avi Levi <[hidden email]> wrote:
Hi,
I have a list (couple of thousands text lines) that I need to use in my map function. I read this article about broadcasting variables  or using distributed cache however I need to update this list from time to time, and if I understood correctly it is not possible on broadcast or cache without restarting the job. Is there idiomatic way to achieve this? A db seems to be an overkill for that and I do want to be cheap on io/network calls as much as possible.

Cheers 
Avi

Reply | Threaded
Open this post in threaded view
|

Re: using updating shared data

avilevi
Sounds like a good idea. because in the control stream the time doesn't really matters. Thanks !!!

On Fri, Jan 4, 2019 at 11:13 AM David Anderson <[hidden email]> wrote:
Another solution to the watermarking issue is to write an AssignerWithPeriodicWatermarks for the control stream that always returns Watermark.MAX_WATERMARK as the current watermark. This produces watermarks for the control stream that will effectively be ignored. 

On Thu, Jan 3, 2019 at 9:18 PM Avi Levi <[hidden email]> wrote:
Thanks for the tip Elias! 

On Wed, Jan 2, 2019 at 9:44 PM Elias Levy <[hidden email]> wrote:
One thing you must be careful of, is that if you are using event time processing, assuming that the control stream will only receive messages sporadically, is that event time will stop moving forward in the operator joining the streams while the control stream is idle.  You can get around this by using a periodic watermark extractor one the control stream that bounds the event time delay to processing time or by defining your own low level operator that ignores watermarks from the control stream.

On Wed, Jan 2, 2019 at 8:42 AM Avi Levi <[hidden email]> wrote:
Thanks Till I will defiantly going to check it. just to make sure that I got you correctly. you are suggesting the the list that I want to broadcast will be broadcasted via control stream and it will be than be kept in the relevant operator state correct ? and updates (CRUD) on that list will be preformed via the control stream. correct ?
BR
Avi

On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann <[hidden email]> wrote:
Hi Avi,

you could use Flink's broadcast state pattern [1]. You would need to use the DataStream API but it allows you to have two streams (input and control stream) where the control stream is broadcasted to all sub tasks. So by ingesting messages into the control stream you can send model updates to all sub tasks.


Cheers,
Till

On Tue, Jan 1, 2019 at 6:49 PM miki haiat <[hidden email]> wrote:
Im trying to understand  your  use case.
What is the source  of the data ? FS ,KAFKA else ?


On Tue, Jan 1, 2019 at 6:29 PM Avi Levi <[hidden email]> wrote:
Hi,
I have a list (couple of thousands text lines) that I need to use in my map function. I read this article about broadcasting variables  or using distributed cache however I need to update this list from time to time, and if I understood correctly it is not possible on broadcast or cache without restarting the job. Is there idiomatic way to achieve this? A db seems to be an overkill for that and I do want to be cheap on io/network calls as much as possible.

Cheers 
Avi

Reply | Threaded
Open this post in threaded view
|

Re: using updating shared data

Elias Levy
That is not fully correct.  While in practice it may not matter, ignoring the timestamp of control messages may result in non-deterministic behavior, as during a restart the control message may be processed in a different order in relation to the other stream.  So the output of multiple runs may differ depending in the relative order the messages in the stream are processed in.

On Sun, Jan 6, 2019 at 12:36 AM Avi Levi <[hidden email]> wrote:
Sounds like a good idea. because in the control stream the time doesn't really matters. Thanks !!!

On Fri, Jan 4, 2019 at 11:13 AM David Anderson <[hidden email]> wrote:
Another solution to the watermarking issue is to write an AssignerWithPeriodicWatermarks for the control stream that always returns Watermark.MAX_WATERMARK as the current watermark. This produces watermarks for the control stream that will effectively be ignored. 

On Thu, Jan 3, 2019 at 9:18 PM Avi Levi <[hidden email]> wrote:
Thanks for the tip Elias! 

On Wed, Jan 2, 2019 at 9:44 PM Elias Levy <[hidden email]> wrote:
One thing you must be careful of, is that if you are using event time processing, assuming that the control stream will only receive messages sporadically, is that event time will stop moving forward in the operator joining the streams while the control stream is idle.  You can get around this by using a periodic watermark extractor one the control stream that bounds the event time delay to processing time or by defining your own low level operator that ignores watermarks from the control stream.

On Wed, Jan 2, 2019 at 8:42 AM Avi Levi <[hidden email]> wrote:
Thanks Till I will defiantly going to check it. just to make sure that I got you correctly. you are suggesting the the list that I want to broadcast will be broadcasted via control stream and it will be than be kept in the relevant operator state correct ? and updates (CRUD) on that list will be preformed via the control stream. correct ?
BR
Avi

On Wed, Jan 2, 2019 at 4:28 PM Till Rohrmann <[hidden email]> wrote:
Hi Avi,

you could use Flink's broadcast state pattern [1]. You would need to use the DataStream API but it allows you to have two streams (input and control stream) where the control stream is broadcasted to all sub tasks. So by ingesting messages into the control stream you can send model updates to all sub tasks.


Cheers,
Till

On Tue, Jan 1, 2019 at 6:49 PM miki haiat <[hidden email]> wrote:
Im trying to understand  your  use case.
What is the source  of the data ? FS ,KAFKA else ?


On Tue, Jan 1, 2019 at 6:29 PM Avi Levi <[hidden email]> wrote:
Hi,
I have a list (couple of thousands text lines) that I need to use in my map function. I read this article about broadcasting variables  or using distributed cache however I need to update this list from time to time, and if I understood correctly it is not possible on broadcast or cache without restarting the job. Is there idiomatic way to achieve this? A db seems to be an overkill for that and I do want to be cheap on io/network calls as much as possible.

Cheers 
Avi