Strategies for reading structured file formats as POJO DataSets

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Strategies for reading structured file formats as POJO DataSets

Elliot West
Hello,

As a new Flink user I wondered if there are any existing approaches or practices for reading file formats such as CSV, TSV, etc. as DataSets or POJOs? My current approach can be illustrated with a contrived example:

// Simulating a TSV file DataSet
DataSet<String> tsvRatings = env.fromElements("category-1\t10");

// Mapping to a POJO
DataSet<Rating> ratings = tsvRatings.map(line -> {
  String[] elements = line.split("\t");
  return new Rating(elements[0], Integer.parseInt(elements[1]));     });

While such a mapping could be implemented in a more general form, I'm keen to avoid wheel reinvention and therefore wonder if there are already good ways of doing this?

Thanks - Elliot.

Reply | Threaded
Open this post in threaded view
|

Re: Strategies for reading structured file formats as POJO DataSets

rmetzger0
Hi Elliot,

Right now there is no tooling support for reading CSV/TSV data into a POJO, but there is a pull request open where a user contributes such a feature: https://github.com/apache/flink/pull/426
So its probably only a matter of days until it is available in master.

Your suggested approach of using a mapper is perfectly fine.
You can do it a bit easier by using env.readCsvFile(). It will do the parsing into the types for you.

Sorry that the feature is not already available for you.

Please let us know if you have more questions regarding Flink.


Best,
Robert


On Thu, Mar 5, 2015 at 10:18 AM, Elliot West <[hidden email]> wrote:
Hello,

As a new Flink user I wondered if there are any existing approaches or practices for reading file formats such as CSV, TSV, etc. as DataSets or POJOs? My current approach can be illustrated with a contrived example:

// Simulating a TSV file DataSet
DataSet<String> tsvRatings = env.fromElements("category-1\t10");

// Mapping to a POJO
DataSet<Rating> ratings = tsvRatings.map(line -> {
  String[] elements = line.split("\t");
  return new Rating(elements[0], Integer.parseInt(elements[1]));     });

While such a mapping could be implemented in a more general form, I'm keen to avoid wheel reinvention and therefore wonder if there are already good ways of doing this?

Thanks - Elliot.


Reply | Threaded
Open this post in threaded view
|

Re: Strategies for reading structured file formats as POJO DataSets

Fabian Hueske-2
Hi Elliot,

right now, I see the following options to read CSV/TSV files:

- Read CSV files (ExecutionEnvironment.readCsvFile()) into Tuples (max number of fields 25 for Java, 22 for Scala) and map Tuples to POJOs in a subsequent Map function (if necessary). I would recommend this approach, if the field limitation is not a problem for you. The CsvReader can be configured in several ways. For example record and field delimiters (',', '\t', ...) can be adapted.

- Read the CSV file as text file (ExecutionEnvironment.readTextFile()) which gives you each line of a file as String. You can parse that line and create a POJO out of it in a subsequent Map function (just as you did in your example). This is more generic but leaves the parsing of the line up to you.

See the DataSource documentation for details:

Best, Fabian

2015-03-05 10:58 GMT+01:00 Robert Metzger <[hidden email]>:
Hi Elliot,

Right now there is no tooling support for reading CSV/TSV data into a POJO, but there is a pull request open where a user contributes such a feature: https://github.com/apache/flink/pull/426
So its probably only a matter of days until it is available in master.

Your suggested approach of using a mapper is perfectly fine.
You can do it a bit easier by using env.readCsvFile(). It will do the parsing into the types for you.

Sorry that the feature is not already available for you.

Please let us know if you have more questions regarding Flink.


Best,
Robert


On Thu, Mar 5, 2015 at 10:18 AM, Elliot West <[hidden email]> wrote:
Hello,

As a new Flink user I wondered if there are any existing approaches or practices for reading file formats such as CSV, TSV, etc. as DataSets or POJOs? My current approach can be illustrated with a contrived example:

// Simulating a TSV file DataSet
DataSet<String> tsvRatings = env.fromElements("category-1\t10");

// Mapping to a POJO
DataSet<Rating> ratings = tsvRatings.map(line -> {
  String[] elements = line.split("\t");
  return new Rating(elements[0], Integer.parseInt(elements[1]));     });

While such a mapping could be implemented in a more general form, I'm keen to avoid wheel reinvention and therefore wonder if there are already good ways of doing this?

Thanks - Elliot.