Wordindex conversation.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Wordindex conversation.

Kürşat Kurt

Hi;

 

I have MainDataset (Label,WordList) :

 

(0,List(a, b, c, d, e, f, g))

(1,List(b, c, f, a, g))

 

..and, wordIndex dataset(created with .zipWithIndex) :

 

wordIndex> (0,a)

wordIndex> (1,b)

wordIndex> (2,c)

wordIndex> (3,d)

wordIndex> (4,e)

wordIndex> (5,f)

wordIndex> (6,g)

 

How can i convert mainDataset to indexed wordList dataset like this:

(0,List(1,2,3,4,5,6))

(1,List(2,3,5,0,6)

 

Reply | Threaded
Open this post in threaded view
|

Re: Wordindex conversation.

Fabian Hueske-2
Hi,

you can do it like this:

1) you have to split each label record of the main dataset into separate records:

(0,List(a, b, c, d, e, f, g)) -> (0, a), (0, b), (0, c), ..., (0, g)
(1,List(b, c, f, a, g)) -> (1, b), (1, c), ..., (1, g)

2) join word index dataset with splitted main dataset:

DataSet<Tuple2<Integer, String>> splittedMain = ...
DataSet<Tuple2<Long, String>> wordIdx = ...

DataSet<Integer, Long> joined = splittedMain.join(wordIdx).where(1).equalTo(1).with(...)

3) Group by Label:

DataSet<Tuple2<Integer, Long[]>> labelsWithIdx = joined.groupBy(0).reduceGroup(...) // collect all indexes in list / array

Best, Fabian



2016-10-10 23:49 GMT+02:00 Kürşat Kurt <[hidden email]>:

Hi;

 

I have MainDataset (Label,WordList) :

 

(0,List(a, b, c, d, e, f, g))

(1,List(b, c, f, a, g))

 

..and, wordIndex dataset(created with .zipWithIndex) :

 

wordIndex> (0,a)

wordIndex> (1,b)

wordIndex> (2,c)

wordIndex> (3,d)

wordIndex> (4,e)

wordIndex> (5,f)

wordIndex> (6,g)

 

How can i convert mainDataset to indexed wordList dataset like this:

(0,List(1,2,3,4,5,6))

(1,List(2,3,5,0,6)

 


Reply | Threaded
Open this post in threaded view
|

RE: Wordindex conversation.

Kürşat Kurt

Ok, thanks Fabian.

 

From: Fabian Hueske [mailto:[hidden email]]
Sent: Tuesday, October 11, 2016 1:12 AM
To: [hidden email]
Subject: Re: Wordindex conversation.

 

Hi,

you can do it like this:

 

1) you have to split each label record of the main dataset into separate records:


(0,List(a, b, c, d, e, f, g)) -> (0, a), (0, b), (0, c), ..., (0, g)
(1,List(b, c, f, a, g)) -> (1, b), (1, c), ..., (1, g)

2) join word index dataset with splitted main dataset:

DataSet<Tuple2<Integer, String>> splittedMain = ...

DataSet<Tuple2<Long, String>> wordIdx = ...

DataSet<Integer, Long> joined = splittedMain.join(wordIdx).where(1).equalTo(1).with(...)

3) Group by Label:

DataSet<Tuple2<Integer, Long[]>> labelsWithIdx = joined.groupBy(0).reduceGroup(...) // collect all indexes in list / array

Best, Fabian

 

 

2016-10-10 23:49 GMT+02:00 Kürşat Kurt <[hidden email]>:

Hi;

 

I have MainDataset (Label,WordList) :

 

(0,List(a, b, c, d, e, f, g))

(1,List(b, c, f, a, g))

 

..and, wordIndex dataset(created with .zipWithIndex) :

 

wordIndex> (0,a)

wordIndex> (1,b)

wordIndex> (2,c)

wordIndex> (3,d)

wordIndex> (4,e)

wordIndex> (5,f)

wordIndex> (6,g)

 

How can i convert mainDataset to indexed wordList dataset like this:

(0,List(1,2,3,4,5,6))

(1,List(2,3,5,0,6)