Flink on Ignite - Collocation?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink on Ignite - Collocation?

Matt
Hi all,

I've been playing around with Apache Ignite and I want to run Flink on top of it but there's something I'm not getting.

Ignite has its own support for clustering, and data is distributed on different nodes using a partitioned key. Then, we are able to run a closure and do some computation on the nodes that owns the data (collocation of computation [1]), that way saving time and bandwidth. It all looks good, but I'm not sure how it would play with Flink's own clustering capability.

My initial idea -which I haven't tried yet- is to use collocation to run a closure where the data resides, and use that closure to execute a Flink pipeline locally on that node (running it using a local environment), then using a custom made data source I should be able to plug the data from the local Ignite cache to the Flink pipeline and back into a cache using an Ignite sink.

I'm not sure it's a good idea to disable Flink distribution and running it in a local environment so the data is not transferred to another node. I think it's the same problem with Kafka, if it partitions the data on different nodes, how do you guarantee that Flink jobs are executed where the data resides? In case there's no way to guarantee that unless you enable local environment, what do you think of that approach (in terms of performance)?

Any additional insight regarding stream processing on Ignite or any other distributed storage is very welcome!

Best regards,
Matt

Reply | Threaded
Open this post in threaded view
|

Re: Flink on Ignite - Collocation?

Ufuk Celebi
Hey Matt,

in general, Flink doesn't put too much work in co-locating sources
(doesn't happen for Kafka, etc. either). I think the only local
assignments happen in the DataSet API for files in HDFS.

Often this is of limited help anyways. Your approach sounds like it
could work, but I would generally not recommend such custom solutions
if you don't really need it. Have you tried running your program with
remote reads? What's the bottleneck for you there?

– Ufuk

On Mon, Apr 24, 2017 at 11:47 AM, Matt <[hidden email]> wrote:

> Hi all,
>
> I've been playing around with Apache Ignite and I want to run Flink on top
> of it but there's something I'm not getting.
>
> Ignite has its own support for clustering, and data is distributed on
> different nodes using a partitioned key. Then, we are able to run a closure
> and do some computation on the nodes that owns the data (collocation of
> computation [1]), that way saving time and bandwidth. It all looks good, but
> I'm not sure how it would play with Flink's own clustering capability.
>
> My initial idea -which I haven't tried yet- is to use collocation to run a
> closure where the data resides, and use that closure to execute a Flink
> pipeline locally on that node (running it using a local environment), then
> using a custom made data source I should be able to plug the data from the
> local Ignite cache to the Flink pipeline and back into a cache using an
> Ignite sink.
>
> I'm not sure it's a good idea to disable Flink distribution and running it
> in a local environment so the data is not transferred to another node. I
> think it's the same problem with Kafka, if it partitions the data on
> different nodes, how do you guarantee that Flink jobs are executed where the
> data resides? In case there's no way to guarantee that unless you enable
> local environment, what do you think of that approach (in terms of
> performance)?
>
> Any additional insight regarding stream processing on Ignite or any other
> distributed storage is very welcome!
>
> Best regards,
> Matt
>
> [1] https://apacheignite.readme.io/docs/collocate-compute-and-data
Reply | Threaded
Open this post in threaded view
|

Re: Flink on Ignite - Collocation?

Matt
It seems to me the bottleneck will be the network if I don't collocate Flink jobs, after all Ignite caches are in-memory not on disk (much faster than network).

Achieving collocation would be more difficult in Kafka, but it should be relatively easy in Ignite due to its out of the box collocated computation system. Running a collocated Ignite closure and executing a Flink job in a local environments should be enough.

Why do you recommend against custom collocation? I may be missing something.

Matt

On Mon, Apr 24, 2017 at 9:47 AM, Ufuk Celebi <[hidden email]> wrote:
Hey Matt,

in general, Flink doesn't put too much work in co-locating sources
(doesn't happen for Kafka, etc. either). I think the only local
assignments happen in the DataSet API for files in HDFS.

Often this is of limited help anyways. Your approach sounds like it
could work, but I would generally not recommend such custom solutions
if you don't really need it. Have you tried running your program with
remote reads? What's the bottleneck for you there?

– Ufuk

On Mon, Apr 24, 2017 at 11:47 AM, Matt <[hidden email]> wrote:
> Hi all,
>
> I've been playing around with Apache Ignite and I want to run Flink on top
> of it but there's something I'm not getting.
>
> Ignite has its own support for clustering, and data is distributed on
> different nodes using a partitioned key. Then, we are able to run a closure
> and do some computation on the nodes that owns the data (collocation of
> computation [1]), that way saving time and bandwidth. It all looks good, but
> I'm not sure how it would play with Flink's own clustering capability.
>
> My initial idea -which I haven't tried yet- is to use collocation to run a
> closure where the data resides, and use that closure to execute a Flink
> pipeline locally on that node (running it using a local environment), then
> using a custom made data source I should be able to plug the data from the
> local Ignite cache to the Flink pipeline and back into a cache using an
> Ignite sink.
>
> I'm not sure it's a good idea to disable Flink distribution and running it
> in a local environment so the data is not transferred to another node. I
> think it's the same problem with Kafka, if it partitions the data on
> different nodes, how do you guarantee that Flink jobs are executed where the
> data resides? In case there's no way to guarantee that unless you enable
> local environment, what do you think of that approach (in terms of
> performance)?
>
> Any additional insight regarding stream processing on Ignite or any other
> distributed storage is very welcome!
>
> Best regards,
> Matt
>
> [1] https://apacheignite.readme.io/docs/collocate-compute-and-data