(DEPRECATED) Apache Flink User Mailing List archive.

Can Connected Components run on a streaming dataset using iterate delta?

Classic

List

Threaded

4 messages Options

kant kodali

Can Connected Components run on a streaming dataset using iterate delta?

Hi All,

I am wondering if connected components can run on a streaming data? or say incremental batch?

I see that with delta iteration not all vertices need to participate at every iteration which is great but in my case the graph is evolving over time other words new edges are getting added over time. If so, does the algorithm needs to run on the entire graph or can it simply run on the new batch of edges?

Finally, What is the best way to compute connected components on Graphs evolving over time? Should I use streaming or batch or any custom incremental approach? Also, the documentation take maxIterations as an input. How can I come up with a good number for max iterations? and once I come up with a good number for max Iterations is the algorithm guaranteed to converge?

Thanks,

Kant

Yun Gao

Re: Can Connected Components run on a streaming dataset using iterate delta?

Hi Kant,

As far as I know, I think the current example connected components implementation based on DataSet API could not be extended to streaming data or incremental batch directly.

From the algorithm's perspective, if the graph only add edge and never remove edge, I think the connected components should be able to be updated incrementally when the graph changes: When some edges are added, a new search should be started from the sources of the added edges to propagate its component ID. This will trigger a new pass of update of the following vertices, and the updates continues until no vertices' component ID get updated. However, if there are also edge removes, I think the incremental computation should not be easily achieved.

To implement the above logic on Flink, I think currently there should be two possible methods:

1) Use DataSet API and DataSet iteration, maintains the graph structure and the latest computation result in a storage, and whenever there are enough changes to the graph, submits a new DataSet job to recompute the result. The job should load the edges, the latest component id and whether it is the source of the newly added edges for each graph vertex, and then start the above incremental computation logic.

2) Flink also provide DataStream iteration API[1] that enables iterating on the unbounded data. In this case the graph modification should be modeled as a datastream, and some operators inside the iteration should maintain the graph structure and current component id. whenever there are enough changes, it starts a new pass of computation.

Best,

Yun

[1] Flink DataStream iteration, https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/datastream_api.html#iterations

------------------------------------------------------------------
From:kant kodali <[hidden email]>
Send Time:2020 Feb. 18 (Tue.) 15:11
To:user <[hidden email]>
Subject:Can Connected Components run on a streaming dataset using iterate delta?

Hi All,

I am wondering if connected components can run on a streaming data? or say incremental batch?

I see that with delta iteration not all vertices need to participate at every iteration which is great but in my case the graph is evolving over time other words new edges are getting added over time. If so, does the algorithm needs to run on the entire graph or can it simply run on the new batch of edges?

Finally, What is the best way to compute connected components on Graphs evolving over time? Should I use streaming or batch or any custom incremental approach? Also, the documentation take maxIterations as an input. How can I come up with a good number for max iterations? and once I come up with a good number for max Iterations is the algorithm guaranteed to converge?

Thanks,
Kant

kant kodali

Re: Can Connected Components run on a streaming dataset using iterate delta?

Hi,

Thanks for that but Looks like it is already available https://github.com/vasia/gelly-streaming in streaming but I wonder why this is not part of Flink? there are no releases either.

Thanks!

On Tue, Feb 18, 2020 at 9:13 AM Yun Gao <[hidden email]> wrote:

Hi Kant,

As far as I know, I think the current example connected components implementation based on DataSet API could not be extended to streaming data or incremental batch directly.

From the algorithm's perspective, if the graph only add edge and never remove edge, I think the connected components should be able to be updated incrementally when the graph changes: When some edges are added, a new search should be started from the sources of the added edges to propagate its component ID. This will trigger a new pass of update of the following vertices, and the updates continues until no vertices' component ID get updated. However, if there are also edge removes, I think the incremental computation should not be easily achieved.

To implement the above logic on Flink, I think currently there should be two possible methods:
                    1) Use DataSet API and DataSet iteration, maintains the graph structure and the latest computation result in a storage, and whenever there are enough changes to the graph, submits a new DataSet job to recompute the result. The job should load the edges, the latest component id and whether it is the source of the newly added edges for each graph vertex, and then start the above incremental computation logic.
                    2) Flink also provide DataStream iteration API[1] that enables iterating on the unbounded data. In this case the graph modification should be modeled as a datastream, and some operators inside the iteration should maintain the graph structure and current component id. whenever there are enough changes, it starts a new pass of computation.

Best,
Yun

[1] Flink DataStream iteration, https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/datastream_api.html#iterations

------------------------------------------------------------------
From:kant kodali <[hidden email]>
Send Time:2020 Feb. 18 (Tue.) 15:11
To:user <[hidden email]>
Subject:Can Connected Components run on a streaming dataset using iterate delta?

Hi All,

I am wondering if connected components can run on a streaming data? or say incremental batch?

I see that with delta iteration not all vertices need to participate at every iteration which is great but in my case the graph is evolving over time other words new edges are getting added over time. If so, does the algorithm needs to run on the entire graph or can it simply run on the new batch of edges?

Finally, What is the best way to compute connected components on Graphs evolving over time? Should I use streaming or batch or any custom incremental approach? Also, the documentation take maxIterations as an input. How can I come up with a good number for max iterations? and once I come up with a good number for max Iterations is the algorithm guaranteed to converge?

Thanks,
Kant

Arvid Heise-3

Re: Can Connected Components run on a streaming dataset using iterate delta?

Hi Kant,

there has not been high demand yet and it's always a matter of priority for the scarce manpower.

I'd probably get inspired by gelly and implement it on DataStream in your stead.

On Sat, Feb 22, 2020 at 11:23 AM kant kodali <[hidden email]> wrote:

Hi,

Thanks for that but Looks like it is already available https://github.com/vasia/gelly-streaming in streaming but I wonder why this is not part of Flink? there are no releases either.

Thanks!

On Tue, Feb 18, 2020 at 9:13 AM Yun Gao <[hidden email]> wrote:
Hi Kant,

As far as I know, I think the current example connected components implementation based on DataSet API could not be extended to streaming data or incremental batch directly.

From the algorithm's perspective, if the graph only add edge and never remove edge, I think the connected components should be able to be updated incrementally when the graph changes: When some edges are added, a new search should be started from the sources of the added edges to propagate its component ID. This will trigger a new pass of update of the following vertices, and the updates continues until no vertices' component ID get updated. However, if there are also edge removes, I think the incremental computation should not be easily achieved.

To implement the above logic on Flink, I think currently there should be two possible methods:
                    1) Use DataSet API and DataSet iteration, maintains the graph structure and the latest computation result in a storage, and whenever there are enough changes to the graph, submits a new DataSet job to recompute the result. The job should load the edges, the latest component id and whether it is the source of the newly added edges for each graph vertex, and then start the above incremental computation logic.
                    2) Flink also provide DataStream iteration API[1] that enables iterating on the unbounded data. In this case the graph modification should be modeled as a datastream, and some operators inside the iteration should maintain the graph structure and current component id. whenever there are enough changes, it starts a new pass of computation.

Best,
Yun

[1] Flink DataStream iteration, https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/datastream_api.html#iterations

------------------------------------------------------------------
From:kant kodali <[hidden email]>
Send Time:2020 Feb. 18 (Tue.) 15:11
To:user <[hidden email]>
Subject:Can Connected Components run on a streaming dataset using iterate delta?

Hi All,

I am wondering if connected components can run on a streaming data? or say incremental batch?

I see that with delta iteration not all vertices need to participate at every iteration which is great but in my case the graph is evolving over time other words new edges are getting added over time. If so, does the algorithm needs to run on the entire graph or can it simply run on the new batch of edges?

Finally, What is the best way to compute connected components on Graphs evolving over time? Should I use streaming or batch or any custom incremental approach? Also, the documentation take maxIterations as an input. How can I come up with a good number for max iterations? and once I come up with a good number for max Iterations is the algorithm guaranteed to converge?

Thanks,
Kant