Read corpus documents sentence by sentence instead of linewise

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Read corpus documents sentence by sentence instead of linewise

Felix Schüler
Hi!

We have implemented a transformer that computes a cooccurrence matrix
for words within a given window.
This matrix will then be used for unsupervised learning of vector
representations for words (we basically implement this:
http://nlp.stanford.edu/projects/glove/)

Right now, we have implemented the computation of the cooccurrence
matrix as a sliding window over lines that we get from env.readTextFile(...)
Instead, it would be nice if we could do a sliding window over
sentences. Until now, we could not figure out how to get sentences that
(in the worst case) span multiple lines.

Is this somehow possible or would we have to define our own input-format
for this? The idea is to read a corpus and allow some kind of user
defined parsing of the text documents (something like CorpusInputFormat
maybe...?).

Thanks!
Felix
Reply | Threaded
Open this post in threaded view
|

Re: Read corpus documents sentence by sentence instead of linewise

Stephan Ewen
If you want the inputs to be chunked by sentence, you can try and split sentences by the period character.
You can do this with the DelimitedInputFormat, by setting the delimiter.

The readAsText uses actually a special case delimited input format that splits at line breaks.

Greetings,
Stephan



On Wed, May 20, 2015 at 2:57 PM, Felix Schüler <[hidden email]> wrote:
Hi!

We have implemented a transformer that computes a cooccurrence matrix
for words within a given window.
This matrix will then be used for unsupervised learning of vector
representations for words (we basically implement this:
http://nlp.stanford.edu/projects/glove/)

Right now, we have implemented the computation of the cooccurrence
matrix as a sliding window over lines that we get from env.readTextFile(...)
Instead, it would be nice if we could do a sliding window over
sentences. Until now, we could not figure out how to get sentences that
(in the worst case) span multiple lines.

Is this somehow possible or would we have to define our own input-format
for this? The idea is to read a corpus and allow some kind of user
defined parsing of the text documents (something like CorpusInputFormat
maybe...?).

Thanks!
Felix