Hi!
We have implemented a transformer that computes a cooccurrence matrix
for words within a given window.
This matrix will then be used for unsupervised learning of vector
representations for words (we basically implement this:
http://nlp.stanford.edu/projects/glove/)
Right now, we have implemented the computation of the cooccurrence
matrix as a sliding window over lines that we get from env.readTextFile(...)
Instead, it would be nice if we could do a sliding window over
sentences. Until now, we could not figure out how to get sentences that
(in the worst case) span multiple lines.
Is this somehow possible or would we have to define our own input-format
for this? The idea is to read a corpus and allow some kind of user
defined parsing of the text documents (something like CorpusInputFormat
maybe...?).
Thanks!
Felix