Parallelizing DataStream operations on Array elements

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallelizing DataStream operations on Array elements

danielsuo
Hello!

I have a data source that emits Arrays that I collect into windows via countWindow. Rather than parallelize my subsequent operations by groups of these arrays, I'd like to parallelize my operations across the elements of the array (rows rather than columns, if you will) within each window.

Some context: I'm attempting a time series analysis across some number of voxels. Each time step, I receive an Array of voxel data, but I'd like to analyze the voxels across time.

It sounds like this approach mixes DataStream and DataSet concepts (where each window is a DataSet), which I know are not supported. Perhaps there is some other way to accomplish this task?

Thanks!
Daniel
Reply | Threaded
Open this post in threaded view
|

Re: Parallelizing DataStream operations on Array elements

Till Rohrmann
Hi Daniel,

I'm not sure whether I grasp the whole problem, but can't you split the vector up into the different rows, group by the row index and then apply some kind of continuous aggregation or window function?

Maybe it helps if you can share some of your code with the community to discuss the implementation.

Cheers,
Till

On Fri, Nov 4, 2016 at 5:45 PM, Daniel Suo <[hidden email]> wrote:
Hello!

I have a data source that emits Arrays that I collect into windows via countWindow. Rather than parallelize my subsequent operations by groups of these arrays, I'd like to parallelize my operations across the elements of the array (rows rather than columns, if you will) within each window.

Some context: I'm attempting a time series analysis across some number of voxels. Each time step, I receive an Array of voxel data, but I'd like to analyze the voxels across time.

It sounds like this approach mixes DataStream and DataSet concepts (where each window is a DataSet), which I know are not supported. Perhaps there is some other way to accomplish this task?

Thanks!
Daniel

Reply | Threaded
Open this post in threaded view
|

Re: Parallelizing DataStream operations on Array elements

danielsuo
This post was updated on .
Till Rohrmann wrote
I'm not sure whether I grasp the whole problem, but can't you split the vector up into the different rows, group by the row index and then apply some kind of continuous aggregation or window function?
So I could flatMap my incoming Arrays into (rowId, arrayElement) and gather them appropriately in a window operation. Here is brief code to describe the problem:

// Source emits Array[Double]
val input: DataStream[Array[Double]] = env.addSource(new MyArraySource())

// Collect windowSize Array[Double]
input.countWindowAll(windowSize, slideLength)

Now I have a windowSize (representing time) by arrayLength (representing voxels) matrix. Flink lets me parallelize by time easily, but I'd like to parallelize by voxel.
Reply | Threaded
Open this post in threaded view
|

Re: Parallelizing DataStream operations on Array elements

danielsuo
I was able to resolve my issue by collecting all the 'column' Arrays via countWindowAll and using flatMap to emit 'row' Arrays.
Reply | Threaded
Open this post in threaded view
|

Re: Parallelizing DataStream operations on Array elements

Till Rohrmann

In order to parallelize by voxel you have to do a keyBy(rowId) given that rowId is the same as voxel id.

Glad to hear that you’ve resolved the problem :-)

Cheers,
Till


On Sat, Nov 5, 2016 at 2:47 AM, danielsuo <[hidden email]> wrote:
I was able to resolve my issue by collecting all the 'column' Arrays via
countWindowAll and using flatMap to emit 'row' Arrays.



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Parallelizing-DataStream-operations-on-Array-elements-tp9911p9917.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.