(DEPRECATED) Apache Flink User Mailing List archive.

Mapping two datasets

Classic

List

Threaded

5 messages Options

Saliya Ekanayake

Mapping two datasets

Hi,

I've two data sets like,

DataSet<T> a = ...

DataSet<T> b = ...

They have the same type and same decomposition. I want to apply a map operator that need both a and b. For example,

a.map( i -> OP)

within this OP I need the corresponding (i th) element of b as well. Is there a way to do this?

Thank you,

Saliya

Saliya Ekanayake

Ph.D. Candidate | Research Assistant

School of Informatics and Computing | Digital Science Center

Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Márton Balassi

Re: Mapping two datasets

Hey Saliya,

I would add a uniqe ID to both the DataSets, the variable you referred to as 'i'. Then you can join the two DataSets on the field containing 'i' and do the mapping on the joined result.

Hope this helps,

Marton

On Thu, Feb 25, 2016 at 5:38 PM, Saliya Ekanayake <[hidden email]> wrote:

Hi,

I've two data sets like,

DataSet<T> a = ...
DataSet<T> b = ...

They have the same type and same decomposition. I want to apply a map operator that need both a and b. For example,

a.map( i -> OP)

within this OP I need the corresponding (i th) element of b as well. Is there a way to do this?

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell <a href="tel:812-391-4914" value="+18123914914" target="_blank">812-391-4914
http://saliya.org

Saliya Ekanayake

Re: Mapping two datasets

Thank you, Marton. That seems doable.

However, is there a way I can create a dummy indexed data set? Like a way to partition the index range without data across parallel tasks. For example, if I could have something like,

DataSet<IndexedSet> ds = ...

then I can implement a custom method to load required data for a split within a map operation, which will be less expensive than a join for my case.

Thank you,
Saliya

On Thu, Feb 25, 2016 at 11:45 AM, Márton Balassi <[hidden email]> wrote:

Hey Saliya,

I would add a uniqe ID to both the DataSets, the variable you referred to as 'i'. Then you can join the two DataSets on the field containing 'i' and do the mapping on the joined result.

Hope this helps,

Marton

On Thu, Feb 25, 2016 at 5:38 PM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

I've two data sets like,

DataSet<T> a = ...
DataSet<T> b = ...

They have the same type and same decomposition. I want to apply a map operator that need both a and b. For example,

a.map( i -> OP)

within this OP I need the corresponding (i th) element of b as well. Is there a way to do this?

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell <a href="tel:812-391-4914" value="+18123914914" target="_blank">812-391-4914
http://saliya.org

Saliya Ekanayake

Ph.D. Candidate | Research Assistant

School of Informatics and Computing | Digital Science Center

Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Márton Balassi

Re: Mapping two datasets

Hey Saliya,

I recommend using DataSetUtils.zipWithIndex for this task. [1] It comes with flink-java.

[1] https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L77

On Thu, Feb 25, 2016 at 5:52 PM, Saliya Ekanayake <[hidden email]> wrote:

Thank you, Marton. That seems doable.

However, is there a way I can create a dummy indexed data set? Like a way to partition the index range without data across parallel tasks. For example, if I could have something like,

DataSet<IndexedSet> ds = ...

then I can implement a custom method to load required data for a split within a map operation, which will be less expensive than a join for my case.

Thank you,
Saliya

On Thu, Feb 25, 2016 at 11:45 AM, Márton Balassi <[hidden email]> wrote:
Hey Saliya,

I would add a uniqe ID to both the DataSets, the variable you referred to as 'i'. Then you can join the two DataSets on the field containing 'i' and do the mapping on the joined result.

Hope this helps,

Marton

On Thu, Feb 25, 2016 at 5:38 PM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

I've two data sets like,

DataSet<T> a = ...
DataSet<T> b = ...

They have the same type and same decomposition. I want to apply a map operator that need both a and b. For example,

a.map( i -> OP)

within this OP I need the corresponding (i th) element of b as well. Is there a way to do this?

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell <a href="tel:812-391-4914" value="+18123914914" target="_blank">812-391-4914
http://saliya.org

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell <a href="tel:812-391-4914" value="+18123914914" target="_blank">812-391-4914
http://saliya.org

Saliya Ekanayake

Re: Mapping two datasets

Thank you. Any thoughts on the ParallelIteratorInputFormat in Flink?

On Thu, Feb 25, 2016 at 12:07 PM, Márton Balassi <[hidden email]> wrote:

Hey Saliya,

I recommend using DataSetUtils.zipWithIndex for this task. [1] It comes with flink-java.

[1] https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L77

On Thu, Feb 25, 2016 at 5:52 PM, Saliya Ekanayake <[hidden email]> wrote:
Thank you, Marton. That seems doable.

However, is there a way I can create a dummy indexed data set? Like a way to partition the index range without data across parallel tasks. For example, if I could have something like,

DataSet<IndexedSet> ds = ...

then I can implement a custom method to load required data for a split within a map operation, which will be less expensive than a join for my case.

Thank you,
Saliya

On Thu, Feb 25, 2016 at 11:45 AM, Márton Balassi <[hidden email]> wrote:
Hey Saliya,

I would add a uniqe ID to both the DataSets, the variable you referred to as 'i'. Then you can join the two DataSets on the field containing 'i' and do the mapping on the joined result.

Hope this helps,

Marton

On Thu, Feb 25, 2016 at 5:38 PM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

I've two data sets like,

DataSet<T> a = ...
DataSet<T> b = ...

They have the same type and same decomposition. I want to apply a map operator that need both a and b. For example,

a.map( i -> OP)

within this OP I need the corresponding (i th) element of b as well. Is there a way to do this?

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell <a href="tel:812-391-4914" value="+18123914914" target="_blank">812-391-4914
http://saliya.org

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell <a href="tel:812-391-4914" value="+18123914914" target="_blank">812-391-4914
http://saliya.org

Saliya Ekanayake

Ph.D. Candidate | Research Assistant

School of Informatics and Computing | Digital Science Center

Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org