Sequential/ordered map

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Sequential/ordered map

Sebastian Neef
Hello,

I'd like to implement an algorithm which doesn't really look
parallelizable to me, but maybe there's a way around it:

In general the algorithm looks like this:

1. Take a list of texts T_1 ... T_n
2. For every text T_i (i > 1) do
2.1: Split text into a list of words W_1 ... W_m
2.2: For every word W_j do:
2.2.1.: Check if word already existed in a prior text T_k ( i > k )
2.2.2.: If so, mark word W_j with k
2.2.3.: Else mark word W_j with i
3. Do something with texts based on marked words...

I have a DataSet<Text> with all texts T_1...T_n.

As far as I understand, I cannot simply .map(...) the DataSet, because
T_i's calculation is based on previous results (i.e. T_(i-1)).

My current solution would be to set the  parallelism to 1.

- Is there an elegant way to parallelize this algorithm?
- Does setting parallelism=1 guarantee a specific order of the DataSet?
- Is there a way to check if an element exists in a DataSet? E.g.
DataSet<>.contains(elem)?

Best regards,
Sebastian
Reply | Threaded
Open this post in threaded view
|

Re: Sequential/ordered map

Kostas Kloudas
Hi Sebastian,

If T_1 must be processed before T_i, i>1, then you cannot parallelize the algorithm.

If this is not a restriction, then you could;
1) split the text in words and also attach the id of the text they appear in,
2) do a groupBy that will send all the same words to the same node,
3) keep a “per-word” state with the word and the index of the text,
4) when a new word arrives you should check if the word already exists in the state.

Regards,
Kostas

> On Jan 5, 2017, at 11:51 AM, Sebastian Neef <[hidden email]> wrote:
>
> Hello,
>
> I'd like to implement an algorithm which doesn't really look
> parallelizable to me, but maybe there's a way around it:
>
> In general the algorithm looks like this:
>
> 1. Take a list of texts T_1 ... T_n
> 2. For every text T_i (i > 1) do
> 2.1: Split text into a list of words W_1 ... W_m
> 2.2: For every word W_j do:
> 2.2.1.: Check if word already existed in a prior text T_k ( i > k )
> 2.2.2.: If so, mark word W_j with k
> 2.2.3.: Else mark word W_j with i
> 3. Do something with texts based on marked words...
>
> I have a DataSet<Text> with all texts T_1...T_n.
>
> As far as I understand, I cannot simply .map(...) the DataSet, because
> T_i's calculation is based on previous results (i.e. T_(i-1)).
>
> My current solution would be to set the  parallelism to 1.
>
> - Is there an elegant way to parallelize this algorithm?
> - Does setting parallelism=1 guarantee a specific order of the DataSet?
> - Is there a way to check if an element exists in a DataSet? E.g.
> DataSet<>.contains(elem)?
>
> Best regards,
> Sebastian

Reply | Threaded
Open this post in threaded view
|

Re: Sequential/ordered map

Sebastian Neef
Hi Kostas,

thanks for the quick reply.

> If T_1 must be processed before T_i, i>1, then you cannot parallelize the algorithm.

What would be the best way to process it anyway?

DataSet.collect() -> loop over List -> env.fromCollection(...) ?
Or with a parallelism of 1 and a .map(...) ?

However, this approach would collect all data at one node and wouldn't
scale, correct?

Regards,
Sebastian
Reply | Threaded
Open this post in threaded view
|

Re: Sequential/ordered map

Chesnay Schepler
In reply to this post by Kostas Kloudas
So given an ordered list of texts, for each word find the earliest text
it appears in?

As Kostas said, when splitting the text into words wrap them in a Tuple2
containing the word
and text index and group them by the word.

As far as i can tell the next step would be a simple reduce that finds
the smallest
index; for this there is a convenience minBy() transformation.

Regards,
Chesnay

On 05.01.2017 12:25, Kostas Kloudas wrote:

> Hi Sebastian,
>
> If T_1 must be processed before T_i, i>1, then you cannot parallelize the algorithm.
>
> If this is not a restriction, then you could;
> 1) split the text in words and also attach the id of the text they appear in,
> 2) do a groupBy that will send all the same words to the same node,
> 3) keep a “per-word” state with the word and the index of the text,
> 4) when a new word arrives you should check if the word already exists in the state.
>
> Regards,
> Kostas
>
>> On Jan 5, 2017, at 11:51 AM, Sebastian Neef <[hidden email]> wrote:
>>
>> Hello,
>>
>> I'd like to implement an algorithm which doesn't really look
>> parallelizable to me, but maybe there's a way around it:
>>
>> In general the algorithm looks like this:
>>
>> 1. Take a list of texts T_1 ... T_n
>> 2. For every text T_i (i > 1) do
>> 2.1: Split text into a list of words W_1 ... W_m
>> 2.2: For every word W_j do:
>> 2.2.1.: Check if word already existed in a prior text T_k ( i > k )
>> 2.2.2.: If so, mark word W_j with k
>> 2.2.3.: Else mark word W_j with i
>> 3. Do something with texts based on marked words...
>>
>> I have a DataSet<Text> with all texts T_1...T_n.
>>
>> As far as I understand, I cannot simply .map(...) the DataSet, because
>> T_i's calculation is based on previous results (i.e. T_(i-1)).
>>
>> My current solution would be to set the  parallelism to 1.
>>
>> - Is there an elegant way to parallelize this algorithm?
>> - Does setting parallelism=1 guarantee a specific order of the DataSet?
>> - Is there a way to check if an element exists in a DataSet? E.g.
>> DataSet<>.contains(elem)?
>>
>> Best regards,
>> Sebastian
>

Reply | Threaded
Open this post in threaded view
|

Re: Sequential/ordered map

Sebastian Neef
Hi Chesnay,

thanks for the input. Finding a word's first occurrence is part of the
algorithm.

To be exact I'm trying to implement Adler's Text authorship tracking in
flink (http://www2007.org/papers/paper692.pdf, page 266).

Thanks,
Sebastian
Reply | Threaded
Open this post in threaded view
|

Re: Sequential/ordered map

Fabian Hueske-2
Please avoid collecting the data to the client using collect(). This operation looks convenient but is only meant for super small data and would be a lot slower and less robust even if it would work for large data sets.
Rather set the parallelism of the operator to 1.

Fabian

2017-01-05 13:18 GMT+01:00 Sebastian Neef <[hidden email]>:
Hi Chesnay,

thanks for the input. Finding a word's first occurrence is part of the
algorithm.

To be exact I'm trying to implement Adler's Text authorship tracking in
flink (http://www2007.org/papers/paper692.pdf, page 266).

Thanks,
Sebastian