(DEPRECATED) Apache Flink User Mailing List archive.

Sharing Java Collections within Flink Cluster

Classic

List

Threaded

18 messages Options

Chakravarthy varaga

Sharing Java Collections within Flink Cluster

Hi Team,

I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Chakravarthy varaga

Re: Sharing Java Collections within Flink Cluster

Hi Team,

Can someone help me here? Appreciate any response !

Best Regards

Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:

Hi Team,

I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Fabian Hueske-2

Re: Sharing Java Collections within Flink Cluster

Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:

Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Chakravarthy varaga

Re: Sharing Java Collections within Flink Cluster

Hi Fabian,

Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.

DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.

Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.

Is there a way?

Best Regards

CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:

Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Fabian Hueske-2

Re: Sharing Java Collections within Flink Cluster

Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:

Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Chakravarthy varaga

Re: Sharing Java Collections within Flink Cluster

certainly, what I thought as well...

The output of DataStream2 could be in 1000s and there are state updates...

reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:

Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Fabian Hueske-2

Re: Sharing Java Collections within Flink Cluster

Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.

If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1].

This will have several benefits:

1) State does not need to be replicated

2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM.

3) Flink takes care of the look-up. No need to have your own hashmap.
4) It will only be possible to rescale jobs with key-value state (this feature is currently under development).

If using the key-value state is possible, I'd go for that.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html

2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <[hidden email]>:

certainly, what I thought as well...
The output of DataStream2 could be in 1000s and there are state updates...
reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:
Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Chakravarthy varaga

Re: Sharing Java Collections within Flink Cluster

I'm understanding this better with your explanation..

With this use case, each element in DS1 has to look up against a 'bunch of keys' from DS2 and DS2 could shrink/expand in terms of the no., of keys.... will the key-value shard work in this case?

On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <[hidden email]> wrote:

Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.

If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1].

This will have several benefits:
1) State does not need to be replicated
2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM.
3) Flink takes care of the look-up. No need to have your own hashmap.
4) It will only be possible to rescale jobs with key-value state (this feature is currently under development).

If using the key-value state is possible, I'd go for that.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html

2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <[hidden email]>:
certainly, what I thought as well...
The output of DataStream2 could be in 1000s and there are state updates...
reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:
Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Fabian Hueske-2

Re: Sharing Java Collections within Flink Cluster

That depends.

1) Growing/Shrinking: This should work. New entries can always be inserted. In order to remove entries from the k-v-state you have to set the value to null. Note that you need an explicit delete-value record to trigger the eviction.

2) Multiple lookups: This does only work if all lookups are independent from each other. You can partition DS1 only on a single key and the other keys might be located on different shards. A workaround might be to duplicate S1 events for each key that they need to look up. However, you might need to collect events from the same S1 event after the join. If that does not work for you, the only thing that comes to my mind is to broadcast the state and keep a full local copy in each operator.

Let me add one more thing regarding the upcoming rescaling feature. If this is interesting for you, rescaling will also work (maybe not in the first version) for broadcasted state, i.e. state which is the same on all parallel operator instances.

2016-09-07 21:45 GMT+02:00 Chakravarthy varaga <[hidden email]>:

I'm understanding this better with your explanation..
With this use case,    each element in DS1 has to look up against a 'bunch of keys' from DS2 and DS2 could shrink/expand in terms of the no., of keys.... will the key-value shard work in this case?

On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <[hidden email]> wrote:
Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.

If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1].

This will have several benefits:
1) State does not need to be replicated
2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM.
3) Flink takes care of the look-up. No need to have your own hashmap.
4) It will only be possible to rescale jobs with key-value state (this feature is currently under development).

If using the key-value state is possible, I'd go for that.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html

2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <[hidden email]>:
certainly, what I thought as well...
The output of DataStream2 could be in 1000s and there are state updates...
reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:
Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

pushpendra.jaiswal

Re: Sharing Java Collections within Flink Cluster

In reply to this post by Fabian Hueske-2

Hi Fabian
I am also looking for this solution, could you help me with two things:

1. How this is different from Queryable state.

2. How to query this key-value state from DS2 even if its running in the same application.

e.g.

val keyedStream = stream.keyby(_.key)
val otherStream = somekafka.createStream

The final goal is to have something like:

otherStream.foreach(kafkamessage => keyedStream.lookup(kafkamessage.key))

~Pushpendra Jaiswal

Fabian Hueske-2

Re: Sharing Java Collections within Flink Cluster

Hi Pushpendra,

1. Queryable state is an upcoming feature and not part of an official release yet. With queryable state you can query operator state from outside the application.

2. Have you had a look at the CoFlatMap operator? This operator "connects" two streams and allows to have state which is accessible from both streams.

Best, Fabian

2016-09-08 8:54 GMT+02:00 pushpendra.jaiswal <[hidden email]>:

Hi Fabian
I am also looking for this solution, could you help me with two things:

1. How this is different from Queryable state.

2. How to query this key-value state from DS2 even if its running in the
same application.

e.g.

val keyedStream = stream.keyby(_.key)
val otherStream = somekafka.createStream

The final goal is to have something like:

otherStream.foreach(kafkamessage => keyedStream.lookup(kafkamessage.key))

~Pushpendra Jaiswal

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Sharing-Java-Collections-within-Flink-Cluster-tp8919p8965.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Chakravarthy varaga

Re: Sharing Java Collections within Flink Cluster

In reply to this post by Fabian Hueske-2

Hi Fabian,

First of all thanks for all your prompt responses. With regards to 2) Multiple looks ups, I have to clarify what I mean by that...

DS1<String> elementKeyStream = stream1.map(String<>); this maps each of the streaming elements into string mapped value...

DS2<T> = stream2.xxx(); // where stream2 is a kafka source stream, as you proposed.. xxx() should be my function() which splits the string and generates key1:<value1>, key2:<value2>, key3:<value3> ....keyN:<value4>

Now,

I wish to map elementKeyStream with look ups within (key1, key2...keyN) where key1, key2.. keyN and their respective values should be available across the cluster...

Thanks a million !

CVP

On Wed, Sep 7, 2016 at 9:15 PM, Fabian Hueske <[hidden email]> wrote:

That depends.
1) Growing/Shrinking: This should work. New entries can always be inserted. In order to remove entries from the k-v-state you have to set the value to null. Note that you need an explicit delete-value record to trigger the eviction.
2) Multiple lookups: This does only work if all lookups are independent from each other. You can partition DS1 only on a single key and the other keys might be located on different shards. A workaround might be to duplicate S1 events for each key that they need to look up. However, you might need to collect events from the same S1 event after the join. If that does not work for you, the only thing that comes to my mind is to broadcast the state and keep a full local copy in each operator.

Let me add one more thing regarding the upcoming rescaling feature. If this is interesting for you, rescaling will also work (maybe not in the first version) for broadcasted state, i.e. state which is the same on all parallel operator instances.

2016-09-07 21:45 GMT+02:00 Chakravarthy varaga <[hidden email]>:
I'm understanding this better with your explanation..
With this use case,    each element in DS1 has to look up against a 'bunch of keys' from DS2 and DS2 could shrink/expand in terms of the no., of keys.... will the key-value shard work in this case?

On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <[hidden email]> wrote:
Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.

If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1].

This will have several benefits:
1) State does not need to be replicated
2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM.
3) Flink takes care of the look-up. No need to have your own hashmap.
4) It will only be possible to rescale jobs with key-value state (this feature is currently under development).

If using the key-value state is possible, I'd go for that.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html

2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <[hidden email]>:
certainly, what I thought as well...
The output of DataStream2 could be in 1000s and there are state updates...
reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:
Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Fabian Hueske-2

Re: Sharing Java Collections within Flink Cluster

Not sure if I got your requirements right, but would this work?

KeyedStream<String, String> ks1 = ds1.keyBy("*") ;

KeyedStream<Tuple2<String, V>, String> ks2 = ds2.flatMap(split T into k-v pairs).keyBy(0);

ks1.connect(ks2).flatMap(X)

X is a CoFlatMapFunction that inserts and removes elements from ks2 into a key-value state member. Elements from ks1 are matched against that state.

Best, Fabian

2016-09-08 20:24 GMT+02:00 Chakravarthy varaga <[hidden email]>:

Hi Fabian,

     First of all thanks for all your prompt responses. With regards to 2) Multiple looks ups, I have to clarify what I mean by that...

     DS1<String> elementKeyStream = stream1.map(String<>); this maps each of the streaming elements into string mapped value...
     DS2<T> = stream2.xxx(); // where stream2 is a kafka source stream, as you proposed.. xxx() should be my function() which splits the string and generates key1:<value1>, key2:<value2>, key3:<value3> ....keyN:<value4>

     Now,
        I wish to map elementKeyStream with look ups within (key1, key2...keyN) where key1, key2.. keyN and their respective values should be available across the cluster...

Thanks a million !
CVP

On Wed, Sep 7, 2016 at 9:15 PM, Fabian Hueske <[hidden email]> wrote:
That depends.
1) Growing/Shrinking: This should work. New entries can always be inserted. In order to remove entries from the k-v-state you have to set the value to null. Note that you need an explicit delete-value record to trigger the eviction.
2) Multiple lookups: This does only work if all lookups are independent from each other. You can partition DS1 only on a single key and the other keys might be located on different shards. A workaround might be to duplicate S1 events for each key that they need to look up. However, you might need to collect events from the same S1 event after the join. If that does not work for you, the only thing that comes to my mind is to broadcast the state and keep a full local copy in each operator.

Let me add one more thing regarding the upcoming rescaling feature. If this is interesting for you, rescaling will also work (maybe not in the first version) for broadcasted state, i.e. state which is the same on all parallel operator instances.

2016-09-07 21:45 GMT+02:00 Chakravarthy varaga <[hidden email]>:
I'm understanding this better with your explanation..
With this use case,    each element in DS1 has to look up against a 'bunch of keys' from DS2 and DS2 could shrink/expand in terms of the no., of keys.... will the key-value shard work in this case?

On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <[hidden email]> wrote:
Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.

If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1].

This will have several benefits:
1) State does not need to be replicated
2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM.
3) Flink takes care of the look-up. No need to have your own hashmap.
4) It will only be possible to rescale jobs with key-value state (this feature is currently under development).

If using the key-value state is possible, I'd go for that.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html

2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <[hidden email]>:
certainly, what I thought as well...
The output of DataStream2 could be in 1000s and there are state updates...
reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:
Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Chakravarthy varaga

Re: Sharing Java Collections within Flink Cluster

Hi Fabian,

I'm coding to check if your proposal works and hit with an issue with ClassCastException.

// Here is my Value that has state information.....an implementation of my value state... where the key is a Double value... on connected stream ks2

public class BlockedEventState implements ValueState<BlockedRoadInfo> {

    public BlockedRoadInfo blockedRoad;

    @Override
    public void clear() {
        blockedRoad = null;

    }

    @Override
    public BlockedRoadInfo value() throws IOException {
        return blockedRoad;
    }

    @Override
    public void update(final BlockedRoadInfo value) throws IOException {
        blockedRoad = value;
    }
}

       //BlockedRoadInfo class...
        public class BlockedRoadInfo {
      long maxLink;
      long minLink;
      double blockedEventId;

....setters & ... getters
}

/// new RichCoFlatMapFunction() {

private transient BlockedEventState blockedRoads;
         ............
      @Override
    public void open(final org.apache.flink.configuration.Configuration parameters) throws Exception {
        final ValueStateDescriptor<BlockedRoadInfo> blockedStateDesc = new ValueStateDescriptor<BlockedRoadInfo>("BlockedEventStates",
                TypeInformation.of(BlockedRoadInfo.class), null);
        blockedRoads = (BlockedEventState) getRuntimeContext().getState(blockedStateDesc); // FAILS HERE WITH CLASSCAST

    };

}

Caused by: java.lang.ClassCastException: org.apache.flink.runtime.state.filesystem.FsValueState cannot be cast to com.ericsson.components.aia.iot.volvo.state.BlockedEventState

I have tried to set the state backend to both MemState and FsState...

streamEnv.setStateBackend(new FsStateBackend("file:///tmp/flink/checkpoints"));

On Thu, Sep 8, 2016 at 10:10 PM, Fabian Hueske <[hidden email]> wrote:

Not sure if I got your requirements right, but would this work?

KeyedStream<String, String> ks1 = ds1.keyBy("*") ;
KeyedStream<Tuple2<String, V>, String> ks2 = ds2.flatMap(split T into k-v pairs).keyBy(0);

ks1.connect(ks2).flatMap(X)

X is a CoFlatMapFunction that inserts and removes elements from ks2 into a key-value state member. Elements from ks1 are matched against that state.

Best, Fabian

2016-09-08 20:24 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

     First of all thanks for all your prompt responses. With regards to 2) Multiple looks ups, I have to clarify what I mean by that...

     DS1<String> elementKeyStream = stream1.map(String<>); this maps each of the streaming elements into string mapped value...
     DS2<T> = stream2.xxx(); // where stream2 is a kafka source stream, as you proposed.. xxx() should be my function() which splits the string and generates key1:<value1>, key2:<value2>, key3:<value3> ....keyN:<value4>

     Now,
        I wish to map elementKeyStream with look ups within (key1, key2...keyN) where key1, key2.. keyN and their respective values should be available across the cluster...

Thanks a million !
CVP

On Wed, Sep 7, 2016 at 9:15 PM, Fabian Hueske <[hidden email]> wrote:
That depends.
1) Growing/Shrinking: This should work. New entries can always be inserted. In order to remove entries from the k-v-state you have to set the value to null. Note that you need an explicit delete-value record to trigger the eviction.
2) Multiple lookups: This does only work if all lookups are independent from each other. You can partition DS1 only on a single key and the other keys might be located on different shards. A workaround might be to duplicate S1 events for each key that they need to look up. However, you might need to collect events from the same S1 event after the join. If that does not work for you, the only thing that comes to my mind is to broadcast the state and keep a full local copy in each operator.

Let me add one more thing regarding the upcoming rescaling feature. If this is interesting for you, rescaling will also work (maybe not in the first version) for broadcasted state, i.e. state which is the same on all parallel operator instances.

2016-09-07 21:45 GMT+02:00 Chakravarthy varaga <[hidden email]>:
I'm understanding this better with your explanation..
With this use case,    each element in DS1 has to look up against a 'bunch of keys' from DS2 and DS2 could shrink/expand in terms of the no., of keys.... will the key-value shard work in this case?

On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <[hidden email]> wrote:
Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.

If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1].

This will have several benefits:
1) State does not need to be replicated
2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM.
3) Flink takes care of the look-up. No need to have your own hashmap.
4) It will only be possible to rescale jobs with key-value state (this feature is currently under development).

If using the key-value state is possible, I'd go for that.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html

2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <[hidden email]>:
certainly, what I thought as well...
The output of DataStream2 could be in 1000s and there are state updates...
reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:
Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Aljoscha Krettek

Re: Sharing Java Collections within Flink Cluster

Hi,

you don't need the BlockedEventState class, you should be able to just do this:

private transient ValueState<BlockedRoadInfo> blockedRoads;
         ............
      @Override
    public void open(final org.apache.flink.configuration.Configuration parameters) throws Exception {
        final ValueStateDescriptor<BlockedRoadInfo> blockedStateDesc = new ValueStateDescriptor<BlockedRoadInfo>("BlockedEventStates",
                TypeInformation.of(BlockedRoadInfo.class), null);
        blockedRoads = getRuntimeContext().getState(blockedStateDesc);

    };

}

Cheers,

Aljoscha

On Mon, 12 Sep 2016 at 16:24 Chakravarthy varaga <[hidden email]> wrote:

Hi Fabian,

    I'm coding to check if your proposal works and hit with an issue with ClassCastException.


    // Here is my Value that has state information.....an implementation of my value state... where the key is a Double value... on connected stream ks2

public class BlockedEventState implements ValueState<BlockedRoadInfo> {

    public BlockedRoadInfo blockedRoad;

    @Override
    public void clear() {
        blockedRoad = null;

    }

    @Override
    public BlockedRoadInfo value() throws IOException {
        return blockedRoad;
    }

    @Override
    public void update(final BlockedRoadInfo value) throws IOException {
        blockedRoad = value;
    }
}

       //BlockedRoadInfo class...
        public class BlockedRoadInfo {
      long maxLink;
      long minLink;
      double blockedEventId;
    ....setters & ... getters
}

/// new RichCoFlatMapFunction() {

private transient BlockedEventState blockedRoads;
         ............
      @Override
    public void open(final org.apache.flink.configuration.Configuration parameters) throws Exception {
        final ValueStateDescriptor<BlockedRoadInfo> blockedStateDesc = new ValueStateDescriptor<BlockedRoadInfo>("BlockedEventStates",
                TypeInformation.of(BlockedRoadInfo.class), null);
        blockedRoads = (BlockedEventState) getRuntimeContext().getState(blockedStateDesc); // FAILS HERE WITH CLASSCAST

    };

}

Caused by: java.lang.ClassCastException: org.apache.flink.runtime.state.filesystem.FsValueState cannot be cast to com.ericsson.components.aia.iot.volvo.state.BlockedEventState

I have tried to set the state backend to both MemState and FsState...

streamEnv.setStateBackend(new FsStateBackend("file:///tmp/flink/checkpoints"));

On Thu, Sep 8, 2016 at 10:10 PM, Fabian Hueske <[hidden email]> wrote:
Not sure if I got your requirements right, but would this work?

KeyedStream<String, String> ks1 = ds1.keyBy("*") ;
KeyedStream<Tuple2<String, V>, String> ks2 = ds2.flatMap(split T into k-v pairs).keyBy(0);

ks1.connect(ks2).flatMap(X)

X is a CoFlatMapFunction that inserts and removes elements from ks2 into a key-value state member. Elements from ks1 are matched against that state.

Best, Fabian

2016-09-08 20:24 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

     First of all thanks for all your prompt responses. With regards to 2) Multiple looks ups, I have to clarify what I mean by that...

     DS1<String> elementKeyStream = stream1.map(String<>); this maps each of the streaming elements into string mapped value...
     DS2<T> = stream2.xxx(); // where stream2 is a kafka source stream, as you proposed.. xxx() should be my function() which splits the string and generates key1:<value1>, key2:<value2>, key3:<value3> ....keyN:<value4>

     Now,
        I wish to map elementKeyStream with look ups within (key1, key2...keyN) where key1, key2.. keyN and their respective values should be available across the cluster...

Thanks a million !
CVP

On Wed, Sep 7, 2016 at 9:15 PM, Fabian Hueske <[hidden email]> wrote:
That depends.
1) Growing/Shrinking: This should work. New entries can always be inserted. In order to remove entries from the k-v-state you have to set the value to null. Note that you need an explicit delete-value record to trigger the eviction.
2) Multiple lookups: This does only work if all lookups are independent from each other. You can partition DS1 only on a single key and the other keys might be located on different shards. A workaround might be to duplicate S1 events for each key that they need to look up. However, you might need to collect events from the same S1 event after the join. If that does not work for you, the only thing that comes to my mind is to broadcast the state and keep a full local copy in each operator.

Let me add one more thing regarding the upcoming rescaling feature. If this is interesting for you, rescaling will also work (maybe not in the first version) for broadcasted state, i.e. state which is the same on all parallel operator instances.

2016-09-07 21:45 GMT+02:00 Chakravarthy varaga <[hidden email]>:
I'm understanding this better with your explanation..
With this use case,    each element in DS1 has to look up against a 'bunch of keys' from DS2 and DS2 could shrink/expand in terms of the no., of keys.... will the key-value shard work in this case?

On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <[hidden email]> wrote:
Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.

If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1].

This will have several benefits:
1) State does not need to be replicated
2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM.
3) Flink takes care of the look-up. No need to have your own hashmap.
4) It will only be possible to rescale jobs with key-value state (this feature is currently under development).

If using the key-value state is possible, I'd go for that.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html

2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <[hidden email]>:
certainly, what I thought as well...
The output of DataStream2 could be in 1000s and there are state updates...
reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:
Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Chakravarthy varaga

Re: Sharing Java Collections within Flink Cluster

Hi Aljoscha & Fabian,

Finally I got this working. Thanks for your help. In terms persisting the state (for S2), I tried to use checkpoint every 10 Secs using a FsStateBackend... What I notice is that the checkpoint duration is almost 2 minutes for many cases, while for the other cases it varies from 100 ms to 1.5 minutes frequently.

    The pseudocode is as below:

     KeyedStream<String, String> ks1 = ds1.keyBy("*") ;
     KeyedStream<Tuple2<String, V>, String> ks2 = ds2.flatMap(split T into k-v pairs).keyBy(0);

     ks1.connect(ks2).flatMap(X);
     //X is a CoFlatMapFunction that inserts and removes elements from ks2 into a key-value state member. Elements from ks1 are matched against that state.

//ks1 is streaming about 100K events/sec from kafka topic

//ks2 is streaming about 1 event every 10 minutes... Precisely when the 1st event is consumed from this stream, checkpoint takes 2 minutes straightaway.

The version of flink is 1.1.2

Best Regards

CVP

On Tue, Sep 13, 2016 at 7:29 AM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
you don't need the BlockedEventState class, you should be able to just do this:

private transient ValueState<BlockedRoadInfo> blockedRoads;
         ............
      @Override
    public void open(final org.apache.flink.configuration.Configuration parameters) throws Exception {
        final ValueStateDescriptor<BlockedRoadInfo> blockedStateDesc = new ValueStateDescriptor<BlockedRoadInfo>("BlockedEventStates",
                TypeInformation.of(BlockedRoadInfo.class), null);
        blockedRoads = getRuntimeContext().getState(blockedStateDesc);

    };

}

Cheers,
Aljoscha

On Mon, 12 Sep 2016 at 16:24 Chakravarthy varaga <[hidden email]> wrote:
Hi Fabian,

    I'm coding to check if your proposal works and hit with an issue with ClassCastException.


    // Here is my Value that has state information.....an implementation of my value state... where the key is a Double value... on connected stream ks2

public class BlockedEventState implements ValueState<BlockedRoadInfo> {

    public BlockedRoadInfo blockedRoad;

    @Override
    public void clear() {
        blockedRoad = null;

    }

    @Override
    public BlockedRoadInfo value() throws IOException {
        return blockedRoad;
    }

    @Override
    public void update(final BlockedRoadInfo value) throws IOException {
        blockedRoad = value;
    }
}

       //BlockedRoadInfo class...
        public class BlockedRoadInfo {
      long maxLink;
      long minLink;
      double blockedEventId;
    ....setters & ... getters
}

/// new RichCoFlatMapFunction() {

private transient BlockedEventState blockedRoads;
         ............
      @Override
    public void open(final org.apache.flink.configuration.Configuration parameters) throws Exception {
        final ValueStateDescriptor<BlockedRoadInfo> blockedStateDesc = new ValueStateDescriptor<BlockedRoadInfo>("BlockedEventStates",
                TypeInformation.of(BlockedRoadInfo.class), null);
        blockedRoads = (BlockedEventState) getRuntimeContext().getState(blockedStateDesc); // FAILS HERE WITH CLASSCAST

    };

}

Caused by: java.lang.ClassCastException: org.apache.flink.runtime.state.filesystem.FsValueState cannot be cast to com.ericsson.components.aia.iot.volvo.state.BlockedEventState

I have tried to set the state backend to both MemState and FsState...

streamEnv.setStateBackend(new FsStateBackend("file:///tmp/flink/checkpoints"));

On Thu, Sep 8, 2016 at 10:10 PM, Fabian Hueske <[hidden email]> wrote:
Not sure if I got your requirements right, but would this work?

KeyedStream<String, String> ks1 = ds1.keyBy("*") ;
KeyedStream<Tuple2<String, V>, String> ks2 = ds2.flatMap(split T into k-v pairs).keyBy(0);

ks1.connect(ks2).flatMap(X)

X is a CoFlatMapFunction that inserts and removes elements from ks2 into a key-value state member. Elements from ks1 are matched against that state.

Best, Fabian

2016-09-08 20:24 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

     First of all thanks for all your prompt responses. With regards to 2) Multiple looks ups, I have to clarify what I mean by that...

     DS1<String> elementKeyStream = stream1.map(String<>); this maps each of the streaming elements into string mapped value...
     DS2<T> = stream2.xxx(); // where stream2 is a kafka source stream, as you proposed.. xxx() should be my function() which splits the string and generates key1:<value1>, key2:<value2>, key3:<value3> ....keyN:<value4>

     Now,
        I wish to map elementKeyStream with look ups within (key1, key2...keyN) where key1, key2.. keyN and their respective values should be available across the cluster...

Thanks a million !
CVP

On Wed, Sep 7, 2016 at 9:15 PM, Fabian Hueske <[hidden email]> wrote:
That depends.
1) Growing/Shrinking: This should work. New entries can always be inserted. In order to remove entries from the k-v-state you have to set the value to null. Note that you need an explicit delete-value record to trigger the eviction.
2) Multiple lookups: This does only work if all lookups are independent from each other. You can partition DS1 only on a single key and the other keys might be located on different shards. A workaround might be to duplicate S1 events for each key that they need to look up. However, you might need to collect events from the same S1 event after the join. If that does not work for you, the only thing that comes to my mind is to broadcast the state and keep a full local copy in each operator.

Let me add one more thing regarding the upcoming rescaling feature. If this is interesting for you, rescaling will also work (maybe not in the first version) for broadcasted state, i.e. state which is the same on all parallel operator instances.

2016-09-07 21:45 GMT+02:00 Chakravarthy varaga <[hidden email]>:
I'm understanding this better with your explanation..
With this use case,    each element in DS1 has to look up against a 'bunch of keys' from DS2 and DS2 could shrink/expand in terms of the no., of keys.... will the key-value shard work in this case?

On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <[hidden email]> wrote:
Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.

If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1].

This will have several benefits:
1) State does not need to be replicated
2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM.
3) Flink takes care of the look-up. No need to have your own hashmap.
4) It will only be possible to rescale jobs with key-value state (this feature is currently under development).

If using the key-value state is possible, I'd go for that.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html

2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <[hidden email]>:
certainly, what I thought as well...
The output of DataStream2 could be in 1000s and there are state updates...
reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:
Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

Chakravarthy varaga

Re: Sharing Java Collections within Flink Cluster

PFA, Flink_checkpoint_time.png in relation to this issue.

On Thu, Sep 22, 2016 at 3:38 PM, Chakravarthy varaga <[hidden email]> wrote:

Hi Aljoscha & Fabian,

    Finally I got this working. Thanks for your help. In terms persisting the state (for S2), I tried to use checkpoint every 10 Secs using a FsStateBackend... What I notice is that the checkpoint duration is almost 2 minutes for many cases, while for the other cases it varies from 100 ms to 1.5 minutes frequently.

    The pseudocode is as below:

     KeyedStream<String, String> ks1 = ds1.keyBy("*") ;
     KeyedStream<Tuple2<String, V>, String> ks2 = ds2.flatMap(split T into k-v pairs).keyBy(0);

     ks1.connect(ks2).flatMap(X);
     //X is a CoFlatMapFunction that inserts and removes elements from ks2 into a key-value state member. Elements from ks1 are matched against that state.

     //ks1 is streaming about 100K events/sec from kafka topic
     //ks2 is streaming about 1 event every 10 minutes... Precisely when the 1st event is consumed from this stream, checkpoint takes 2 minutes straightaway.

    The version of flink is 1.1.2

Best Regards
CVP

On Tue, Sep 13, 2016 at 7:29 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
you don't need the BlockedEventState class, you should be able to just do this:

private transient ValueState<BlockedRoadInfo> blockedRoads;
         ............
      @Override
    public void open(final org.apache.flink.configuration.Configuration parameters) throws Exception {
        final ValueStateDescriptor<BlockedRoadInfo> blockedStateDesc = new ValueStateDescriptor<BlockedRoadInfo>("BlockedEventStates",
                TypeInformation.of(BlockedRoadInfo.class), null);
        blockedRoads = getRuntimeContext().getState(blockedStateDesc);

    };

}

Cheers,
Aljoscha

On Mon, 12 Sep 2016 at 16:24 Chakravarthy varaga <[hidden email]> wrote:
Hi Fabian,

    I'm coding to check if your proposal works and hit with an issue with ClassCastException.


    // Here is my Value that has state information.....an implementation of my value state... where the key is a Double value... on connected stream ks2

public class BlockedEventState implements ValueState<BlockedRoadInfo> {

    public BlockedRoadInfo blockedRoad;

    @Override
    public void clear() {
        blockedRoad = null;

    }

    @Override
    public BlockedRoadInfo value() throws IOException {
        return blockedRoad;
    }

    @Override
    public void update(final BlockedRoadInfo value) throws IOException {
        blockedRoad = value;
    }
}

       //BlockedRoadInfo class...
        public class BlockedRoadInfo {
      long maxLink;
      long minLink;
      double blockedEventId;
    ....setters & ... getters
}

/// new RichCoFlatMapFunction() {

private transient BlockedEventState blockedRoads;
         ............
      @Override
    public void open(final org.apache.flink.configuration.Configuration parameters) throws Exception {
        final ValueStateDescriptor<BlockedRoadInfo> blockedStateDesc = new ValueStateDescriptor<BlockedRoadInfo>("BlockedEventStates",
                TypeInformation.of(BlockedRoadInfo.class), null);
        blockedRoads = (BlockedEventState) getRuntimeContext().getState(blockedStateDesc); // FAILS HERE WITH CLASSCAST

    };

}

Caused by: java.lang.ClassCastException: org.apache.flink.runtime.state.filesystem.FsValueState cannot be cast to com.ericsson.components.aia.iot.volvo.state.BlockedEventState

I have tried to set the state backend to both MemState and FsState...

streamEnv.setStateBackend(new FsStateBackend("file:///tmp/flink/checkpoints"));

On Thu, Sep 8, 2016 at 10:10 PM, Fabian Hueske <[hidden email]> wrote:
Not sure if I got your requirements right, but would this work?

KeyedStream<String, String> ks1 = ds1.keyBy("*") ;
KeyedStream<Tuple2<String, V>, String> ks2 = ds2.flatMap(split T into k-v pairs).keyBy(0);

ks1.connect(ks2).flatMap(X)

X is a CoFlatMapFunction that inserts and removes elements from ks2 into a key-value state member. Elements from ks1 are matched against that state.

Best, Fabian

2016-09-08 20:24 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

     First of all thanks for all your prompt responses. With regards to 2) Multiple looks ups, I have to clarify what I mean by that...

     DS1<String> elementKeyStream = stream1.map(String<>); this maps each of the streaming elements into string mapped value...
     DS2<T> = stream2.xxx(); // where stream2 is a kafka source stream, as you proposed.. xxx() should be my function() which splits the string and generates key1:<value1>, key2:<value2>, key3:<value3> ....keyN:<value4>

     Now,
        I wish to map elementKeyStream with look ups within (key1, key2...keyN) where key1, key2.. keyN and their respective values should be available across the cluster...

Thanks a million !
CVP

On Wed, Sep 7, 2016 at 9:15 PM, Fabian Hueske <[hidden email]> wrote:
That depends.
1) Growing/Shrinking: This should work. New entries can always be inserted. In order to remove entries from the k-v-state you have to set the value to null. Note that you need an explicit delete-value record to trigger the eviction.
2) Multiple lookups: This does only work if all lookups are independent from each other. You can partition DS1 only on a single key and the other keys might be located on different shards. A workaround might be to duplicate S1 events for each key that they need to look up. However, you might need to collect events from the same S1 event after the join. If that does not work for you, the only thing that comes to my mind is to broadcast the state and keep a full local copy in each operator.

Let me add one more thing regarding the upcoming rescaling feature. If this is interesting for you, rescaling will also work (maybe not in the first version) for broadcasted state, i.e. state which is the same on all parallel operator instances.

2016-09-07 21:45 GMT+02:00 Chakravarthy varaga <[hidden email]>:
I'm understanding this better with your explanation..
With this use case,    each element in DS1 has to look up against a 'bunch of keys' from DS2 and DS2 could shrink/expand in terms of the no., of keys.... will the key-value shard work in this case?

On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <[hidden email]> wrote:
Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.

If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1].

This will have several benefits:
1) State does not need to be replicated
2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM.
3) Flink takes care of the look-up. No need to have your own hashmap.
4) It will only be possible to rescale jobs with key-value state (this feature is currently under development).

If using the key-value state is possible, I'd go for that.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html

2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <[hidden email]>:
certainly, what I thought as well...
The output of DataStream2 could be in 1000s and there are state updates...
reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:
Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP

flink_checkpoint_time.png (69K) Download Attachment

Chakravarthy varaga

Re: Sharing Java Collections within Flink Cluster

Hi Team,

Will you be able to guide me on this? Is this a known issue with checkpointing ?

CVP

On 22 Sep 2016 15:57, "Chakravarthy varaga" <[hidden email]> wrote:

PFA, Flink_checkpoint_time.png in relation to this issue.

On Thu, Sep 22, 2016 at 3:38 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Aljoscha & Fabian,

    Finally I got this working. Thanks for your help. In terms persisting the state (for S2), I tried to use checkpoint every 10 Secs using a FsStateBackend... What I notice is that the checkpoint duration is almost 2 minutes for many cases, while for the other cases it varies from 100 ms to 1.5 minutes frequently.

    The pseudocode is as below:

     KeyedStream<String, String> ks1 = ds1.keyBy("*") ;
     KeyedStream<Tuple2<String, V>, String> ks2 = ds2.flatMap(split T into k-v pairs).keyBy(0);

     ks1.connect(ks2).flatMap(X);
     //X is a CoFlatMapFunction that inserts and removes elements from ks2 into a key-value state member. Elements from ks1 are matched against that state.

     //ks1 is streaming about 100K events/sec from kafka topic
     //ks2 is streaming about 1 event every 10 minutes... Precisely when the 1st event is consumed from this stream, checkpoint takes 2 minutes straightaway.

    The version of flink is 1.1.2

Best Regards
CVP

On Tue, Sep 13, 2016 at 7:29 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
you don't need the BlockedEventState class, you should be able to just do this:

private transient ValueState<BlockedRoadInfo> blockedRoads;
         ............
      @Override
    public void open(final org.apache.flink.configuration.Configuration parameters) throws Exception {
        final ValueStateDescriptor<BlockedRoadInfo> blockedStateDesc = new ValueStateDescriptor<BlockedRoadInfo>("BlockedEventStates",
                TypeInformation.of(BlockedRoadInfo.class), null);
        blockedRoads = getRuntimeContext().getState(blockedStateDesc);

    };

}

Cheers,
Aljoscha

On Mon, 12 Sep 2016 at 16:24 Chakravarthy varaga <[hidden email]> wrote:
Hi Fabian,

    I'm coding to check if your proposal works and hit with an issue with ClassCastException.


    // Here is my Value that has state information.....an implementation of my value state... where the key is a Double value... on connected stream ks2

public class BlockedEventState implements ValueState<BlockedRoadInfo> {

    public BlockedRoadInfo blockedRoad;

    @Override
    public void clear() {
        blockedRoad = null;

    }

    @Override
    public BlockedRoadInfo value() throws IOException {
        return blockedRoad;
    }

    @Override
    public void update(final BlockedRoadInfo value) throws IOException {
        blockedRoad = value;
    }
}

       //BlockedRoadInfo class...
        public class BlockedRoadInfo {
      long maxLink;
      long minLink;
      double blockedEventId;
    ....setters & ... getters
}

/// new RichCoFlatMapFunction() {

private transient BlockedEventState blockedRoads;
         ............
      @Override
    public void open(final org.apache.flink.configuration.Configuration parameters) throws Exception {
        final ValueStateDescriptor<BlockedRoadInfo> blockedStateDesc = new ValueStateDescriptor<BlockedRoadInfo>("BlockedEventStates",
                TypeInformation.of(BlockedRoadInfo.class), null);
        blockedRoads = (BlockedEventState) getRuntimeContext().getState(blockedStateDesc); // FAILS HERE WITH CLASSCAST

    };

}

Caused by: java.lang.ClassCastException: org.apache.flink.runtime.state.filesystem.FsValueState cannot be cast to com.ericsson.components.aia.iot.volvo.state.BlockedEventState

I have tried to set the state backend to both MemState and FsState...

streamEnv.setStateBackend(new FsStateBackend("file:///tmp/flink/checkpoints"));

On Thu, Sep 8, 2016 at 10:10 PM, Fabian Hueske <[hidden email]> wrote:
Not sure if I got your requirements right, but would this work?

KeyedStream<String, String> ks1 = ds1.keyBy("*") ;
KeyedStream<Tuple2<String, V>, String> ks2 = ds2.flatMap(split T into k-v pairs).keyBy(0);

ks1.connect(ks2).flatMap(X)

X is a CoFlatMapFunction that inserts and removes elements from ks2 into a key-value state member. Elements from ks1 are matched against that state.

Best, Fabian

2016-09-08 20:24 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

     First of all thanks for all your prompt responses. With regards to 2) Multiple looks ups, I have to clarify what I mean by that...

     DS1<String> elementKeyStream = stream1.map(String<>); this maps each of the streaming elements into string mapped value...
     DS2<T> = stream2.xxx(); // where stream2 is a kafka source stream, as you proposed.. xxx() should be my function() which splits the string and generates key1:<value1>, key2:<value2>, key3:<value3> ....keyN:<value4>

     Now,
        I wish to map elementKeyStream with look ups within (key1, key2...keyN) where key1, key2.. keyN and their respective values should be available across the cluster...

Thanks a million !
CVP

On Wed, Sep 7, 2016 at 9:15 PM, Fabian Hueske <[hidden email]> wrote:
That depends.
1) Growing/Shrinking: This should work. New entries can always be inserted. In order to remove entries from the k-v-state you have to set the value to null. Note that you need an explicit delete-value record to trigger the eviction.
2) Multiple lookups: This does only work if all lookups are independent from each other. You can partition DS1 only on a single key and the other keys might be located on different shards. A workaround might be to duplicate S1 events for each key that they need to look up. However, you might need to collect events from the same S1 event after the join. If that does not work for you, the only thing that comes to my mind is to broadcast the state and keep a full local copy in each operator.

Let me add one more thing regarding the upcoming rescaling feature. If this is interesting for you, rescaling will also work (maybe not in the first version) for broadcasted state, i.e. state which is the same on all parallel operator instances.

2016-09-07 21:45 GMT+02:00 Chakravarthy varaga <[hidden email]>:
I'm understanding this better with your explanation..
With this use case,    each element in DS1 has to look up against a 'bunch of keys' from DS2 and DS2 could shrink/expand in terms of the no., of keys.... will the key-value shard work in this case?

On Wed, Sep 7, 2016 at 7:44 PM, Fabian Hueske <[hidden email]> wrote:
Operator state is always local in Flink. However, with key-value state, you can have something which behaves kind of similar to a distribute hashmap, because each operator holds a different shard/partition of the hashtable.

If you have to do only a single key lookup for each element of DS1, you should think about partitioning both streams (keyBy) and writing the state into Flink's key-value state [1].

This will have several benefits:
1) State does not need to be replicated
2) Depending on the backend (RocksDB) [2], parts of the state can reside on disk. You are not bound to the memory of the JVM.
3) Flink takes care of the look-up. No need to have your own hashmap.
4) It will only be possible to rescale jobs with key-value state (this feature is currently under development).

If using the key-value state is possible, I'd go for that.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state_backends.html

2016-09-07 19:55 GMT+02:00 Chakravarthy varaga <[hidden email]>:
certainly, what I thought as well...
The output of DataStream2 could be in 1000s and there are state updates...
reading this topic from the other job, job1, is okie.
However, assuming that we maintain this state into a collection, and updating the state (by reading from the topic) in this collection, will this be replicated across the cluster within this job1 ?

On Wed, Sep 7, 2016 at 6:33 PM, Fabian Hueske <[hidden email]> wrote:
Is writing DataStream2 to a Kafka topic and reading it from the other job an option?

2016-09-07 19:07 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Fabian,

    Thanks for your response. Apparently these DataStream (Job1-DataStream1 & Job2-DataStream2) are from different flink applications running within the same cluster.
    DataStream2 (from Job2) applies transformations and updates a 'cache' on which (Job1) needs to work on.
    Our intention is to not use the external key/value store as we are trying to localize the cache within the cluster.
    Is there a way?

Best Regards
CVP

On Wed, Sep 7, 2016 at 5:00 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

Flink does not provide shared state.
However, you can broadcast a stream to CoFlatMapFunction, such that each operator has its own local copy of the state.

If that does not work for you because the state is too large and if it is possible to partition the state (and both streams), you can also use keyBy instead of broadcast.

Finally, you can use an external system like a KeyValue Store or In-Memory store like Apache Ignite to hold your distributed collection.

Best, Fabian

2016-09-07 17:49 GMT+02:00 Chakravarthy varaga <[hidden email]>:
Hi Team,

     Can someone help me here? Appreciate any response !

Best Regards
Varaga

On Mon, Sep 5, 2016 at 4:51 PM, Chakravarthy varaga <[hidden email]> wrote:
Hi Team,

    I'm working on a Flink Streaming application. The data is injected through Kafka connectors. The payload volume is roughly 100K/sec. The event payload is a string. Let's call this as DataStream1.
This application also uses another DataStream, call it DataStream2, (consumes events off a kafka topic). The elements of this DataStream2 involves in a certain transformation that finally updates a Hashmap(/Java util Collection). Apparently the flink application should share this HashMap across the flink cluster so that DataStream1 application could check the state of the values in this collection. Is there a way to do this in Flink?

    I don't see any Shared Collection used within the cluster?

Best Regards
CVP