(DEPRECATED) Apache Flink User Mailing List archive.

reading csv file from null value

Classic

List

Threaded

8 messages Options

Philip Lee

reading csv file from null value

Hi,

I am trying to load the dataset with the part of null value by using readCsvFile().

// e.g  _date|_click|_sales|_item|_web_page|_user

case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: Int,_page: Int, _user: Int)

private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = {

  env.readCsvFile[WebClick](
    webClickPath,
    fieldDelimiter = "|",
    includedFields = Array(0, 1, 2, 3, 4, 5),
    // lenient = true
  )
}

Well, I know there is an option to ignore malformed value, but I have to read the dataset even though it has null value.

as it follows, dataset (third column is null) looks like

37794|24669||16705|23|54810

but I have to read null value as well because I have to use filter or where function ( _sales == null )

Is there any detail suggestion to do it?

Thanks,

Philip

==========================================================

Hae Joon Lee

Now, in Germany,

M.S. Candidate, Interested in Distributed System, Iterative Processing

Dept. of Computer Science, Informatik in German, TUB

Technical University of Berlin

In Korea,

M.S. Candidate, Computer Architecture Laboratory

Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST

373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

==========================================================

Maximilian Michels

Re: reading csv file from null value

Hi Philip,

How about making the empty field of type String? Then you can read the CSV into a DataSet and treat the empty string as a null value. Not very nice but a workaround. As of now, Flink deliberately doesn't support null values.

Regards,

Max

On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee <[hidden email]> wrote:

Hi,

I am trying to load the dataset with the part of null value by using readCsvFile().
// e.g  _date|_click|_sales|_item|_web_page|_user

case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: Int,_page: Int, _user: Int)

private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = {

  env.readCsvFile[WebClick](
    webClickPath,
    fieldDelimiter = "|",
    includedFields = Array(0, 1, 2, 3, 4, 5),
    // lenient = true
  )
}
Well, I know there is an option to ignore malformed value, but I have to read the dataset even though it has null value.

as it follows, dataset (third column is null) looks like
37794|24669||16705|23|54810
but I have to read null value as well because I have to use filter or where function ( _sales == null )

Is there any detail suggestion to do it?

Thanks,
Philip

--
==========================================================
Hae Joon Lee

Now, in Germany,
M.S. Candidate, Interested in Distributed System, Iterative Processing
Dept. of Computer Science, Informatik in German, TUB
Technical University of Berlin

In Korea,
M.S. Candidate, Computer Architecture Laboratory
Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST
373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
==========================================================

Shiti Saxena

Re: reading csv file from null value

For a similar problem where we wanted to preserve and track null entries, we load the CSV as a DataSet[Array[Object]] and then transform it into DataSet[Row] using a custom RowSerializer(https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.

The Table API(which supports null) can then be used on the resulting DataSet[Row].

On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels <[hidden email]> wrote:

Hi Philip,

How about making the empty field of type String? Then you can read the CSV into a DataSet and treat the empty string as a null value. Not very nice but a workaround. As of now, Flink deliberately doesn't support null values.

Regards,
Max
On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee <[hidden email]> wrote:
Hi,

I am trying to load the dataset with the part of null value by using readCsvFile().
// e.g  _date|_click|_sales|_item|_web_page|_user

case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: Int,_page: Int, _user: Int)

private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = {

  env.readCsvFile[WebClick](
    webClickPath,
    fieldDelimiter = "|",
    includedFields = Array(0, 1, 2, 3, 4, 5),
    // lenient = true
  )
}
Well, I know there is an option to ignore malformed value, but I have to read the dataset even though it has null value.

as it follows, dataset (third column is null) looks like
37794|24669||16705|23|54810
but I have to read null value as well because I have to use filter or where function ( _sales == null )

Is there any detail suggestion to do it?

Thanks,
Philip

--
==========================================================
Hae Joon Lee

Now, in Germany,
M.S. Candidate, Interested in Distributed System, Iterative Processing
Dept. of Computer Science, Informatik in German, TUB
Technical University of Berlin

In Korea,
M.S. Candidate, Computer Architecture Laboratory
Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST
373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
==========================================================

Philip Lee

Re: reading csv file from null value

Maximilian said if we handle null value with String, it would be acceptable.

But in fact, readCsvFile() still cannot accept null value; they said "Row too short" in error msg.

case class WebClick(click_date: String, click_time: String, user: String, item: String)
private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = {
  env.readCsvFile[WebClick](
    webClickPath,
    fieldDelimiter = "|",
    includedFields = Array(0, 1, 3, 5)
    //lenient = true
    )
}

// e.g. 36890|26789|0|3725|20|85457

Caused by: org.apache.flink.api.common.io.ParseException: Row too short: 36890|4749||13183|29|

at org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:383)

at org.apache.flink.api.scala.operators.ScalaCsvInputFormat.readRecord(ScalaCsvInputFormat.java:214)

at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:454)

at org.apache.flink.api.scala.operators.ScalaCsvInputFormat.nextRecord(ScalaCsvInputFormat.java:182)

at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:176)

at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)

at java.lang.Thread.run(Thread.java:745)

Is there any suggestion?

On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena <[hidden email]> wrote:

For a similar problem where we wanted to preserve and track null entries, we load the CSV as a DataSet[Array[Object]] and then transform it into DataSet[Row] using a custom RowSerializer(https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.

The Table API(which supports null) can then be used on the resulting DataSet[Row].
On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Philip,

How about making the empty field of type String? Then you can read the CSV into a DataSet and treat the empty string as a null value. Not very nice but a workaround. As of now, Flink deliberately doesn't support null values.

Regards,
Max
On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee <[hidden email]> wrote:
Hi,

I am trying to load the dataset with the part of null value by using readCsvFile().
// e.g  _date|_click|_sales|_item|_web_page|_user

case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: Int,_page: Int, _user: Int)

private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = {

  env.readCsvFile[WebClick](
    webClickPath,
    fieldDelimiter = "|",
    includedFields = Array(0, 1, 2, 3, 4, 5),
    // lenient = true
  )
}
Well, I know there is an option to ignore malformed value, but I have to read the dataset even though it has null value.

as it follows, dataset (third column is null) looks like
37794|24669||16705|23|54810
but I have to read null value as well because I have to use filter or where function ( _sales == null )

Is there any detail suggestion to do it?

Thanks,
Philip

--
==========================================================
Hae Joon Lee

Now, in Germany,
M.S. Candidate, Interested in Distributed System, Iterative Processing
Dept. of Computer Science, Informatik in German, TUB
Technical University of Berlin

In Korea,
M.S. Candidate, Computer Architecture Laboratory
Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST
373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
==========================================================

==========================================================

Hae Joon Lee

Now, in Germany,

M.S. Candidate, Interested in Distributed System, Iterative Processing

Dept. of Computer Science, Informatik in German, TUB

Technical University of Berlin

In Korea,

M.S. Candidate, Computer Architecture Laboratory

Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST

373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

==========================================================

Philip Lee

Re: reading csv file from null value

Plus, from Shiti to overcome this null value, we could use RowSerializer, right?

I tried it in many ways, but it still did not work.

Could you take an example for it according to the previous email?

On Sat, Oct 24, 2015 at 11:19 PM, Philip Lee <[hidden email]> wrote:

Maximilian said if we handle null value with String, it would be acceptable.
But in fact, readCsvFile() still cannot accept null value; they said "Row too short" in error msg.
case class WebClick(click_date: String, click_time: String, user: String, item: String)
private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = {
  env.readCsvFile[WebClick](
    webClickPath,
    fieldDelimiter = "|",
    includedFields = Array(0, 1, 3, 5)
    //lenient = true
    )
}
// e.g. 36890|26789|0|3725|20|85457
// e.g _date|_click|_sales|_item|_web_page|_user

Caused by: org.apache.flink.api.common.io.ParseException: Row too short: 36890|4749||13183|29|
at org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:383)
at org.apache.flink.api.scala.operators.ScalaCsvInputFormat.readRecord(ScalaCsvInputFormat.java:214)
at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:454)
at org.apache.flink.api.scala.operators.ScalaCsvInputFormat.nextRecord(ScalaCsvInputFormat.java:182)
at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:176)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)

Is there any suggestion?
On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena <[hidden email]> wrote:
For a similar problem where we wanted to preserve and track null entries, we load the CSV as a DataSet[Array[Object]] and then transform it into DataSet[Row] using a custom RowSerializer(https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.

The Table API(which supports null) can then be used on the resulting DataSet[Row].
On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Philip,

How about making the empty field of type String? Then you can read the CSV into a DataSet and treat the empty string as a null value. Not very nice but a workaround. As of now, Flink deliberately doesn't support null values.

Regards,
Max
On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee <[hidden email]> wrote:
Hi,

I am trying to load the dataset with the part of null value by using readCsvFile().
// e.g  _date|_click|_sales|_item|_web_page|_user

case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: Int,_page: Int, _user: Int)

private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = {

  env.readCsvFile[WebClick](
    webClickPath,
    fieldDelimiter = "|",
    includedFields = Array(0, 1, 2, 3, 4, 5),
    // lenient = true
  )
}
Well, I know there is an option to ignore malformed value, but I have to read the dataset even though it has null value.

as it follows, dataset (third column is null) looks like
37794|24669||16705|23|54810
but I have to read null value as well because I have to use filter or where function ( _sales == null )

Is there any detail suggestion to do it?

Thanks,
Philip

--
==========================================================
Hae Joon Lee

Now, in Germany,
M.S. Candidate, Interested in Distributed System, Iterative Processing
Dept. of Computer Science, Informatik in German, TUB
Technical University of Berlin

In Korea,
M.S. Candidate, Computer Architecture Laboratory
Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST
373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
==========================================================
--
==========================================================
Hae Joon Lee

Now, in Germany,
M.S. Candidate, Interested in Distributed System, Iterative Processing
Dept. of Computer Science, Informatik in German, TUB
Technical University of Berlin

In Korea,
M.S. Candidate, Computer Architecture Laboratory
Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST
373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
==========================================================

==========================================================

Hae Joon Lee

Now, in Germany,

M.S. Candidate, Interested in Distributed System, Iterative Processing

Dept. of Computer Science, Informatik in German, TUB

Technical University of Berlin

In Korea,

M.S. Candidate, Computer Architecture Laboratory

Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST

373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

==========================================================

Maximilian Michels

Re: reading csv file from null value

In reply to this post by Shiti Saxena

As far as I know the null support was removed from the Table API because its support was consistently supported with all operations. See https://issues.apache.org/jira/browse/FLINK-2236

On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena <[hidden email]> wrote:

For a similar problem where we wanted to preserve and track null entries, we load the CSV as a DataSet[Array[Object]] and then transform it into DataSet[Row] using a custom RowSerializer(https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.

The Table API(which supports null) can then be used on the resulting DataSet[Row].
On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Philip,

How about making the empty field of type String? Then you can read the CSV into a DataSet and treat the empty string as a null value. Not very nice but a workaround. As of now, Flink deliberately doesn't support null values.

Regards,
Max
On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee <[hidden email]> wrote:
Hi,

I am trying to load the dataset with the part of null value by using readCsvFile().
// e.g  _date|_click|_sales|_item|_web_page|_user

case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: Int,_page: Int, _user: Int)

private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = {

  env.readCsvFile[WebClick](
    webClickPath,
    fieldDelimiter = "|",
    includedFields = Array(0, 1, 2, 3, 4, 5),
    // lenient = true
  )
}
Well, I know there is an option to ignore malformed value, but I have to read the dataset even though it has null value.

as it follows, dataset (third column is null) looks like
37794|24669||16705|23|54810
but I have to read null value as well because I have to use filter or where function ( _sales == null )

Is there any detail suggestion to do it?

Thanks,
Philip

--
==========================================================
Hae Joon Lee

Now, in Germany,
M.S. Candidate, Interested in Distributed System, Iterative Processing
Dept. of Computer Science, Informatik in German, TUB
Technical University of Berlin

In Korea,
M.S. Candidate, Computer Architecture Laboratory
Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST
373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
==========================================================

Philip Lee

Re: reading csv file from null value

Thanks for your reply.

What if I do not use Table API?

The error happens when using just env.readFromCsvFile().

I heard that using RowSerializer would handle this null value, but its error of TypeInformation happens when it is converted

On Mon, Oct 26, 2015 at 10:26 AM, Maximilian Michels <[hidden email]> wrote:

As far as I know the null support was removed from the Table API because its support was consistently supported with all operations. See https://issues.apache.org/jira/browse/FLINK-2236
On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena <[hidden email]> wrote:
For a similar problem where we wanted to preserve and track null entries, we load the CSV as a DataSet[Array[Object]] and then transform it into DataSet[Row] using a custom RowSerializer(https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.

The Table API(which supports null) can then be used on the resulting DataSet[Row].
On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Philip,

How about making the empty field of type String? Then you can read the CSV into a DataSet and treat the empty string as a null value. Not very nice but a workaround. As of now, Flink deliberately doesn't support null values.

Regards,
Max
On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee <[hidden email]> wrote:
Hi,

I am trying to load the dataset with the part of null value by using readCsvFile().
// e.g  _date|_click|_sales|_item|_web_page|_user

case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: Int,_page: Int, _user: Int)

private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = {

  env.readCsvFile[WebClick](
    webClickPath,
    fieldDelimiter = "|",
    includedFields = Array(0, 1, 2, 3, 4, 5),
    // lenient = true
  )
}
Well, I know there is an option to ignore malformed value, but I have to read the dataset even though it has null value.

as it follows, dataset (third column is null) looks like
37794|24669||16705|23|54810
but I have to read null value as well because I have to use filter or where function ( _sales == null )

Is there any detail suggestion to do it?

Thanks,
Philip

--
==========================================================
Hae Joon Lee

Now, in Germany,
M.S. Candidate, Interested in Distributed System, Iterative Processing
Dept. of Computer Science, Informatik in German, TUB
Technical University of Berlin

In Korea,
M.S. Candidate, Computer Architecture Laboratory
Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST
373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
==========================================================

==========================================================

Hae Joon Lee

Now, in Germany,

M.S. Candidate, Interested in Distributed System, Iterative Processing

Dept. of Computer Science, Informatik in German, TUB

Technical University of Berlin

In Korea,

M.S. Candidate, Computer Architecture Laboratory

Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST

373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea

==========================================================

Fabian Hueske-2

Re: reading csv file from null value

Hi Philip,

the CsvInputFormat does not support to read empty fields.

I see two ways to achieve this functionality:

- Use a TextInputFormat that returns each line as a String and do the parsing in a subsequent MapFunction

- Extend the CsvInputFormat to support empty fields

Cheers,
Fabian

2015-10-26 10:43 GMT+01:00 Philip Lee <[hidden email]>:

Thanks for your reply.

What if I do not use Table API?
The error happens when using just env.readFromCsvFile().

I heard that using RowSerializer would handle this null value, but its error of TypeInformation happens when it is converted
On Mon, Oct 26, 2015 at 10:26 AM, Maximilian Michels <[hidden email]> wrote:
As far as I know the null support was removed from the Table API because its support was consistently supported with all operations. See https://issues.apache.org/jira/browse/FLINK-2236
On Fri, Oct 23, 2015 at 7:18 PM, Shiti Saxena <[hidden email]> wrote:
For a similar problem where we wanted to preserve and track null entries, we load the CSV as a DataSet[Array[Object]] and then transform it into DataSet[Row] using a custom RowSerializer(https://gist.github.com/Shiti/d0572c089cc08654019c) which handles null.

The Table API(which supports null) can then be used on the resulting DataSet[Row].
On Fri, Oct 23, 2015 at 7:38 PM, Maximilian Michels <[hidden email]> wrote:
Hi Philip,

How about making the empty field of type String? Then you can read the CSV into a DataSet and treat the empty string as a null value. Not very nice but a workaround. As of now, Flink deliberately doesn't support null values.

Regards,
Max
On Thu, Oct 22, 2015 at 4:30 PM, Philip Lee <[hidden email]> wrote:
Hi,

I am trying to load the dataset with the part of null value by using readCsvFile().
// e.g  _date|_click|_sales|_item|_web_page|_user

case class WebClick(_click_date: Long, _click_time: Long, _sales: Int, _item: Int,_page: Int, _user: Int)

private def getWebClickDataSet(env: ExecutionEnvironment): DataSet[WebClick] = {

  env.readCsvFile[WebClick](
    webClickPath,
    fieldDelimiter = "|",
    includedFields = Array(0, 1, 2, 3, 4, 5),
    // lenient = true
  )
}
Well, I know there is an option to ignore malformed value, but I have to read the dataset even though it has null value.

as it follows, dataset (third column is null) looks like
37794|24669||16705|23|54810
but I have to read null value as well because I have to use filter or where function ( _sales == null )

Is there any detail suggestion to do it?

Thanks,
Philip

--
==========================================================
Hae Joon Lee

Now, in Germany,
M.S. Candidate, Interested in Distributed System, Iterative Processing
Dept. of Computer Science, Informatik in German, TUB
Technical University of Berlin

In Korea,
M.S. Candidate, Computer Architecture Laboratory
Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST
373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
==========================================================
--
==========================================================
Hae Joon Lee

Now, in Germany,
M.S. Candidate, Interested in Distributed System, Iterative Processing
Dept. of Computer Science, Informatik in German, TUB
Technical University of Berlin

In Korea,
M.S. Candidate, Computer Architecture Laboratory
Dept. of Computer Science, KAIST

Rm# 4414 CS Dept. KAIST
373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)

Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
==========================================================