Input dimensions

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Input dimensions

Maximilian Alber
Hi everybody,

I try currently to implement a Machine Learning algorithm on Stratosphere for the ML group at TU Berlin. I ran into some issues. Here is the first one.

The input data I get is of a unknown dimension i.e. I have a list of vectors represent as CSV input with each row representing one vector. Currently I've solved the problem with this code snippet:

def getInputSource(XFile: String) = {
//todo: make nicer
dimensions match {
case 1 => DataSource(XFile, CsvInputFormat[Float](" "));
case 2 => DataSource(XFile, CsvInputFormat[(Float, Float)](" "));
case 3 => DataSource(XFile, CsvInputFormat[(Float, Float, Float)](" "));
case 4 => DataSource(XFile, CsvInputFormat[(Float, Float, Float, Float)](" "));
case 5 => DataSource(XFile, CsvInputFormat[(Float, Float, Float, Float, Float)](" "));
....

Unfortunately there are data sets with larger dimensions than Scala tuples can be (22) f.e. 350. (Besides the code style.)

Is there better way to solve this problem?

Cheers,
Max
Reply | Threaded
Open this post in threaded view
|

Re: Input dimensions

Stephan Ewen
Hi!

One way could be to read as text, split(delimiter) the string and return that result.

One important thing about the current API is that it needs to know the concrete type (class) of the data elements. In your case, it might not be able to statically determine that type, because it depends on a parameter.

I would suggest to use an array as the data type in such cases.

Stephan