(DEPRECATED) Apache Flink User Mailing List archive.

Re: Urgent help on S3 CSV file reader DataStream Job

Classic

List

Threaded

5 messages Options

Wei Zhong

Re: Urgent help on S3 CSV file reader DataStream Job

Hi Deep,

(redirecting this to user mailing list as this is not a dev question)

You can try to set the line delimiter and field delimiter of the RowCsvInputFormat to a non-printing character (assume there is no non-printing characters in the csv files). It will read all the content of a csv file into one Row. e.g.

final StreamExecutionEnvironment env =
   StreamExecutionEnvironment.getExecutionEnvironment();
String path = "test";
TypeInformation[] fieldTypes = new TypeInformation[]{
   BasicTypeInfo.STRING_TYPE_INFO};
RowCsvInputFormat csvFormat = 
   new RowCsvInputFormat(new Path(path), fieldTypes);
csvFormat.setNestedFileEnumeration(true);
csvFormat.setDelimiter((char) 0);
csvFormat.setFieldDelimiter(String.valueOf((char) 0));
DataStream<Row>
   lines = env.readFile(csvFormat, path, FileProcessingMode.PROCESS_ONCE,
   -1);lines.map(value -> value).print();
env.execute();

Then you can convert the content of the csv files to json manually.

Best,

Wei

在 2020年12月7日，19:10，DEEP NARAYAN Singh <[hidden email]> 写道：

Hi Guys,

Below is my code snippet , which read all csv files under the given folder
row by row but my requirement is to read csv file at a time and convert as
json which will looks like :
{"A":"1","B":"3","C":"4","D":9}

Csv file data format   :
-------------------------------
*field_id,data,*

*A,1B,3C,4D,9*

Code snippet:
--------------------------

*final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();String path =
"<a href="s3://messages/data/test/dev/2020-12-07/67241306/" class="">s3://messages/data/test/dev/2020-12-07/67241306/";TypeInformation[]
fieldTypes = new TypeInformation[]{      BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO};RowCsvInputFormat csvFormat =      new
RowCsvInputFormat(            new Path(path),
fieldTypes);csvFormat.setSkipFirstLineAsHeader(true);csvFormat.setNestedFileEnumeration(true);DataStream<Row>
lines = env.readFile(csvFormat, path, FileProcessingMode.PROCESS_ONCE,
-1);lines.map(value -> value).print();*

Any help is highly appreciated.

Thanks,
-Deep

Till Rohrmann

Re: Urgent help on S3 CSV file reader DataStream Job

Hi Deep,

Could you use the TextInputFormat which reads a file line by line? That way you can do the JSON parsing as part of a mapper which consumes the file lines.

Cheers,

Till

On Mon, Dec 7, 2020 at 1:05 PM Wei Zhong <[hidden email]> wrote:

Hi Deep,

(redirecting this to user mailing list as this is not a dev question)

You can try to set the line delimiter and field delimiter of the RowCsvInputFormat to a non-printing character (assume there is no non-printing characters in the csv files). It will read all the content of a csv file into one Row. e.g.
final StreamExecutionEnvironment env =
   StreamExecutionEnvironment.getExecutionEnvironment();
String path = "test";
TypeInformation[] fieldTypes = new TypeInformation[]{
   BasicTypeInfo.STRING_TYPE_INFO};
RowCsvInputFormat csvFormat = 
   new RowCsvInputFormat(new Path(path), fieldTypes);
csvFormat.setNestedFileEnumeration(true);
csvFormat.setDelimiter((char) 0);
csvFormat.setFieldDelimiter(String.valueOf((char) 0));
DataStream<Row>
   lines = env.readFile(csvFormat, path, FileProcessingMode.PROCESS_ONCE,
   -1);lines.map(value -> value).print();
env.execute();
Then you can convert the content of the csv files to json manually.

Best,
Wei

在 2020年12月7日，19:10，DEEP NARAYAN Singh <[hidden email]> 写道：

Hi Guys,

Below is my code snippet , which read all csv files under the given folder
row by row but my requirement is to read csv file at a time and convert as
json which will looks like :
{"A":"1","B":"3","C":"4","D":9}

Csv file data format   :
-------------------------------
*field_id,data,*

*A,1B,3C,4D,9*

Code snippet:
--------------------------

*final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();String path =
"s3://messages/data/test/dev/2020-12-07/67241306/";TypeInformation[]
fieldTypes = new TypeInformation[]{      BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO};RowCsvInputFormat csvFormat =      new
RowCsvInputFormat(            new Path(path),
fieldTypes);csvFormat.setSkipFirstLineAsHeader(true);csvFormat.setNestedFileEnumeration(true);DataStream<Row>
lines = env.readFile(csvFormat, path, FileProcessingMode.PROCESS_ONCE,
-1);lines.map(value -> value).print();*

Any help is highly appreciated.

Thanks,
-Deep

Wei Zhong

Re: Urgent help on S3 CSV file reader DataStream Job

Hi Deep,

Could you show your current code snippet? I have tried the Csv file data on my local machine and it works fine, so I guess what might be wrong elsewhere.

Best,
Wei

> 在 2020年12月8日，03:20，DEEP NARAYAN Singh <[hidden email]> 写道：
>
> Hi Wei and Till,
> Thanks for the quick reply.
>
> @Wei, I tried with code which you have suggested and it is working fine but I have one use case where it is failing, below is the csv input data format :
> Csv file data format :
> -------------------------------
> field_id,data,
> A,1
> B,3
> C,4
> D,9
> E,0,0,0,0
>
> because of last row which contains more that two value, and its is throwing org.apache.flink.api.common.io.ParseException: Row too short: field_id,data,
>
> How to handle the above corner case.Could you please suggest some way to handle this.
>
> @Till, Could you please elaborate more which you are suggesting? As per my use case I am dealing with multiple csv files under the given folder and reading line by line using TextInputFormat and transform will not work by using map operator. Correct me if i'm wrong .
>
> Thanks & Regards,
> -Deep
>
>
> On Mon, Dec 7, 2020 at 6:38 PM Till Rohrmann <[hidden email]> wrote:
> Hi Deep,
>
> Could you use the TextInputFormat which reads a file line by line? That way
> you can do the JSON parsing as part of a mapper which consumes the file
> lines.
>
> Cheers,
> Till
>
> On Mon, Dec 7, 2020 at 1:05 PM Wei Zhong <[hidden email]> wrote:
>
> > Hi Deep,
> >
> > (redirecting this to user mailing list as this is not a dev question)
> >
> > You can try to set the line delimiter and field delimiter of the
> > RowCsvInputFormat to a non-printing character (assume there is no non-printing
> > characters in the csv files). It will read all the content of a csv file
> > into one Row. e.g.
> >
> > final StreamExecutionEnvironment env =
> > StreamExecutionEnvironment.getExecutionEnvironment();
> > String path = "test";
> > TypeInformation[] fieldTypes = new TypeInformation[]{
> > BasicTypeInfo.STRING_TYPE_INFO};
> > RowCsvInputFormat csvFormat =
> > new RowCsvInputFormat(new Path(path), fieldTypes);
> > csvFormat.setNestedFileEnumeration(true);
> > csvFormat.setDelimiter((char) 0);
> > csvFormat.setFieldDelimiter(String.valueOf((char) 0));
> > DataStream<Row>
> > lines = env.readFile(csvFormat, path, FileProcessingMode.PROCESS_ONCE,
> > -1);lines.map(value -> value).print();
> > env.execute();
> >
> >
> > Then you can convert the content of the csv files to json manually.
> >
> > Best,
> > Wei
> >
> >
> > 在 2020年12月7日，19:10，DEEP NARAYAN Singh <[hidden email]> 写道：
> >
> > Hi Guys,
> >
> > Below is my code snippet , which read all csv files under the given folder
> > row by row but my requirement is to read csv file at a time and convert as
> > json which will looks like :
> > {"A":"1","B":"3","C":"4","D":9}
> >
> > Csv file data format :
> > -------------------------------
> > *field_id,data,*
> >
> >
> >
> > *A,1B,3C,4D,9*
> >
> > Code snippet:
> > --------------------------
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *final StreamExecutionEnvironment env =
> > StreamExecutionEnvironment.getExecutionEnvironment();String path =
> > "s3://messages/data/test/dev/2020-12-07/67241306/";TypeInformation[]
> > fieldTypes = new TypeInformation[]{ BasicTypeInfo.STRING_TYPE_INFO,
> > BasicTypeInfo.STRING_TYPE_INFO};RowCsvInputFormat csvFormat = new
> > RowCsvInputFormat( new Path(path),
> >
> > fieldTypes);csvFormat.setSkipFirstLineAsHeader(true);csvFormat.setNestedFileEnumeration(true);DataStream<Row>
> > lines = env.readFile(csvFormat, path, FileProcessingMode.PROCESS_ONCE,
> > -1);lines.map(value -> value).print();*
> >
> >
> > Any help is highly appreciated.
> >
> > Thanks,
> > -Deep
> >
> >
> >

Wei Zhong

Re: Urgent help on S3 CSV file reader DataStream Job

Hi Deep,

It seems that the TypeInformation array in your code has 2 elements, but we only need one here. This approach treats the entire csv file as a Row which has only a one column, so there should be only one `BasicTypeInfo.STRING_TYPE_INFO` in the array. And if you use the TextInputFormat instead of the RowCsvInputFormat, this problem can also be solved.

If you have created your own InputFormat via extending the RowCsvInputFormat, you can get the current file path via `this.currentSplit.getPath()` in your class. Note that if you choose to fill the file path into the second column of the Row, you do not need to make the above changes, because at this time we really need the TypeInformation array to contain two StringTypeInfo elements.

Best,

Wei

在 2020年12月8日，19:29，DEEP NARAYAN Singh <[hidden email]> 写道：
Hi Wei,

Also I need to know how to get file names along with single Row data as part of Datastream during runtime.So that I can extract some of the data from the file name in the next operator to construct the final json string.

Thanks,
-Deep
On Tue, Dec 8, 2020 at 4:10 PM DEEP NARAYAN Singh <[hidden email]> wrote:
Hi Wei,

Please find the below code snippet:
TypeInformation[] typeInformation = new TypeInformation[]{BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO};
RowCsvInputFormat csvInputFormat = new RowCsvInputFormat(new org.apache.flink.core.fs.Path(directory), typeInformation);
csvInputFormat.setDelimiter((char) 0);
csvInputFormat.setFieldDelimiter(String.valueOf((char) 0));
csvInputFormat.setNestedFileEnumeration(true);
csvInputFormat.setMinSplitSize(10);
return environment
        .readFile(csvInputFormat, directory, FileProcessingMode.PROCESS_ONCE, -1, S3Service.createCustomFilter(finalParameters))
        .name("Source: Custom File Reader for path " + directory).setParallelism(readerParallelism);
But after that,I have created my own custom RowCsvInputFormat and enabled the csvInputFormat.setLenient(true) and modified the class a little bit then it worked.
// check valid start position
if (startPos > limit || (startPos == limit && field != fieldIncluded.length - 1)) {
   if (isLenient()) {
      return true;
   } else {
      throw new ParseException("Row too short: " + new String(bytes, offset, numBytes, getCharset()));
   }
}
Let me know if you need any details.
Thanks,
-Deep
On Tue, Dec 8, 2020 at 8:13 AM Wei Zhong <[hidden email]> wrote:
Hi Deep,

Could you show your current code snippet? I have tried the Csv file data on my local machine and it works fine, so I guess what might be wrong elsewhere.

Best,
Wei

> 在 2020年12月8日，03:20，DEEP NARAYAN Singh <[hidden email]> 写道：
>
> Hi Wei and Till,
> Thanks for the quick reply.
>
> @Wei, I tried with code which you have suggested and it is working fine but I have one use case where it is failing, below is the csv input data format :
> Csv file data format :
> -------------------------------
> field_id,data,
> A,1
> B,3
> C,4
> D,9
> E,0,0,0,0
>
> because of last row which contains more that two value, and its is throwing org.apache.flink.api.common.io.ParseException: Row too short: field_id,data,
>
> How to handle the above corner case.Could you please suggest some way to handle this.
>
> @Till, Could you please elaborate more which you are suggesting? As per my use case I am dealing with multiple csv files under the given folder and reading line by line using TextInputFormat and transform will not work by using map operator. Correct me if i'm wrong .
>
> Thanks & Regards,
> -Deep
>
>
> On Mon, Dec 7, 2020 at 6:38 PM Till Rohrmann <[hidden email]> wrote:
> Hi Deep,
>
> Could you use the TextInputFormat which reads a file line by line? That way
> you can do the JSON parsing as part of a mapper which consumes the file
> lines.
>
> Cheers,
> Till
>
> On Mon, Dec 7, 2020 at 1:05 PM Wei Zhong <[hidden email]> wrote:
>
> > Hi Deep,
> >
> > (redirecting this to user mailing list as this is not a dev question)
> >
> > You can try to set the line delimiter and field delimiter of the
> > RowCsvInputFormat to a non-printing character (assume there is no non-printing
> > characters in the csv files). It will read all the content of a csv file
> > into one Row. e.g.
> >
> > final StreamExecutionEnvironment env =
> > StreamExecutionEnvironment.getExecutionEnvironment();
> > String path = "test";
> > TypeInformation[] fieldTypes = new TypeInformation[]{
> > BasicTypeInfo.STRING_TYPE_INFO};
> > RowCsvInputFormat csvFormat =
> > new RowCsvInputFormat(new Path(path), fieldTypes);
> > csvFormat.setNestedFileEnumeration(true);
> > csvFormat.setDelimiter((char) 0);
> > csvFormat.setFieldDelimiter(String.valueOf((char) 0));
> > DataStream<Row>
> > lines = env.readFile(csvFormat, path, FileProcessingMode.PROCESS_ONCE,
> > -1);lines.map(value -> value).print();
> > env.execute();
> >
> >
> > Then you can convert the content of the csv files to json manually.
> >
> > Best,
> > Wei
> >
> >
> > 在 2020年12月7日，19:10，DEEP NARAYAN Singh <[hidden email]> 写道：
> >
> > Hi Guys,
> >
> > Below is my code snippet , which read all csv files under the given folder
> > row by row but my requirement is to read csv file at a time and convert as
> > json which will looks like :
> > {"A":"1","B":"3","C":"4","D":9}
> >
> > Csv file data format :
> > -------------------------------
> > *field_id,data,*
> >
> >
> >
> > *A,1B,3C,4D,9*
> >
> > Code snippet:
> > --------------------------
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *final StreamExecutionEnvironment env =
> > StreamExecutionEnvironment.getExecutionEnvironment();String path =
> > "<a href="s3://messages/data/test/dev/2020-12-07/67241306/" class="">s3://messages/data/test/dev/2020-12-07/67241306/";TypeInformation[]
> > fieldTypes = new TypeInformation[]{ BasicTypeInfo.STRING_TYPE_INFO,
> > BasicTypeInfo.STRING_TYPE_INFO};RowCsvInputFormat csvFormat = new
> > RowCsvInputFormat( new Path(path),
> >
> > fieldTypes);csvFormat.setSkipFirstLineAsHeader(true);csvFormat.setNestedFileEnumeration(true);DataStream<Row>
> > lines = env.readFile(csvFormat, path, FileProcessingMode.PROCESS_ONCE,
> > -1);lines.map(value -> value).print();*
> >
> >
> > Any help is highly appreciated.
> >
> > Thanks,
> > -Deep
> >
> >
> >

Wei Zhong

Re: Urgent help on S3 CSV file reader DataStream Job

Hi Deep,

You can try to change the `FileProcessingMode.PROCESS_ONCE` to `FileProcessingMode.PROCESS_CONTINUOUSLY`.

Best,

Wei

在 2020年12月15日，20:18，DEEP NARAYAN Singh <[hidden email]> 写道：
Hi Wei,
Could you please suggest , how to fix this below issues.

Thanks & Regards,
Deep
On Mon, 14 Dec, 2020, 10:28 AM DEEP NARAYAN Singh, <[hidden email]> wrote:
Hi Wei,
No problem at all.Thanks for your response.
Yes ,it is just starting from the beginning like no check pointing finished.

Thanks,
-Deep
On Mon, 14 Dec, 2020, 8:01 AM Wei Zhong, <[hidden email]> wrote:
Hi Deep,

Sorry for the late reply. Could you provide more specific information about the problem? e.g. did the job skip the file that was being processed during the last checkpointing, or did it start from the beginning just like no checkpointing finished?

Best,
Wei
在 2020年12月12日，13:14，DEEP NARAYAN Singh <[hidden email]> 写道：
Hi Wei,
I'm sorry to bother you ,could you please help me in clarifying my doubt which have mentioned in previous email?

Thank you in advance.

Regards,
-Deep
On Fri, 11 Dec, 2020, 2:16 PM DEEP NARAYAN Singh, <[hidden email]> wrote:
Hi Wei,
Just I want to clarify my doubt about check pointing as part of s3 datastream source . Let say my job started with a current resource and it got failed in between because of some lack of resource (e.g Heap space Exception etc.), In that case what I observed was that if the job is auto restart by using restart strategy , it was not processing the data from the last checkpointing .

Could you please help me in how to handle this case as part of s3 data source.

Thanks,
-Deep
On Tue, Dec 8, 2020 at 10:22 PM DEEP NARAYAN Singh <[hidden email]> wrote:
Hi Wei,
Thanks you for the clarification. I have implemented the suggest approach and it is working fine now.🙂

Thanks,
-Deep
On Tue, 8 Dec, 2020, 5:24 PM Wei Zhong, <[hidden email]> wrote:
Hi Deep,

It seems that the TypeInformation array in your code has 2 elements, but we only need one here. This approach treats the entire csv file as a Row which has only a one column, so there should be only one `BasicTypeInfo.STRING_TYPE_INFO` in the array. And if you use the TextInputFormat instead of the RowCsvInputFormat, this problem can also be solved.

If you have created your own InputFormat via extending the RowCsvInputFormat, you can get the current file path via `this.currentSplit.getPath()` in your class. Note that if you choose to fill the file path into the second column of the Row, you do not need to make the above changes, because at this time we really need the TypeInformation array to contain two StringTypeInfo elements.

Best,
Wei
在 2020年12月8日，19:29，DEEP NARAYAN Singh <[hidden email]> 写道：
Hi Wei,

Also I need to know how to get file names along with single Row data as part of Datastream during runtime.So that I can extract some of the data from the file name in the next operator to construct the final json string.

Thanks,
-Deep
On Tue, Dec 8, 2020 at 4:10 PM DEEP NARAYAN Singh <[hidden email]> wrote:
Hi Wei,

Please find the below code snippet:
TypeInformation[] typeInformation = new TypeInformation[]{BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO};
RowCsvInputFormat csvInputFormat = new RowCsvInputFormat(new org.apache.flink.core.fs.Path(directory), typeInformation);
csvInputFormat.setDelimiter((char) 0);
csvInputFormat.setFieldDelimiter(String.valueOf((char) 0));
csvInputFormat.setNestedFileEnumeration(true);
csvInputFormat.setMinSplitSize(10);
return environment
        .readFile(csvInputFormat, directory, FileProcessingMode.PROCESS_ONCE, -1, S3Service.createCustomFilter(finalParameters))
        .name("Source: Custom File Reader for path " + directory).setParallelism(readerParallelism);
But after that,I have created my own custom RowCsvInputFormat and enabled the csvInputFormat.setLenient(true) and modified the class a little bit then it worked.
// check valid start position
if (startPos > limit || (startPos == limit && field != fieldIncluded.length - 1)) {
   if (isLenient()) {
      return true;
   } else {
      throw new ParseException("Row too short: " + new String(bytes, offset, numBytes, getCharset()));
   }
}
Let me know if you need any details.
Thanks,
-Deep
On Tue, Dec 8, 2020 at 8:13 AM Wei Zhong <[hidden email]> wrote:
Hi Deep,

Could you show your current code snippet? I have tried the Csv file data on my local machine and it works fine, so I guess what might be wrong elsewhere.

Best,
Wei

> 在 2020年12月8日，03:20，DEEP NARAYAN Singh <[hidden email]> 写道：
>
> Hi Wei and Till,
> Thanks for the quick reply.
>
> @Wei, I tried with code which you have suggested and it is working fine but I have one use case where it is failing, below is the csv input data format :
> Csv file data format :
> -------------------------------
> field_id,data,
> A,1
> B,3
> C,4
> D,9
> E,0,0,0,0
>
> because of last row which contains more that two value, and its is throwing org.apache.flink.api.common.io.ParseException: Row too short: field_id,data,
>
> How to handle the above corner case.Could you please suggest some way to handle this.
>
> @Till, Could you please elaborate more which you are suggesting? As per my use case I am dealing with multiple csv files under the given folder and reading line by line using TextInputFormat and transform will not work by using map operator. Correct me if i'm wrong .
>
> Thanks & Regards,
> -Deep
>
>
> On Mon, Dec 7, 2020 at 6:38 PM Till Rohrmann <[hidden email]> wrote:
> Hi Deep,
>
> Could you use the TextInputFormat which reads a file line by line? That way
> you can do the JSON parsing as part of a mapper which consumes the file
> lines.
>
> Cheers,
> Till
>
> On Mon, Dec 7, 2020 at 1:05 PM Wei Zhong <[hidden email]> wrote:
>
> > Hi Deep,
> >
> > (redirecting this to user mailing list as this is not a dev question)
> >
> > You can try to set the line delimiter and field delimiter of the
> > RowCsvInputFormat to a non-printing character (assume there is no non-printing
> > characters in the csv files). It will read all the content of a csv file
> > into one Row. e.g.
> >
> > final StreamExecutionEnvironment env =
> > StreamExecutionEnvironment.getExecutionEnvironment();
> > String path = "test";
> > TypeInformation[] fieldTypes = new TypeInformation[]{
> > BasicTypeInfo.STRING_TYPE_INFO};
> > RowCsvInputFormat csvFormat =
> > new RowCsvInputFormat(new Path(path), fieldTypes);
> > csvFormat.setNestedFileEnumeration(true);
> > csvFormat.setDelimiter((char) 0);
> > csvFormat.setFieldDelimiter(String.valueOf((char) 0));
> > DataStream<Row>
> > lines = env.readFile(csvFormat, path, FileProcessingMode.PROCESS_ONCE,
> > -1);lines.map(value -> value).print();
> > env.execute();
> >
> >
> > Then you can convert the content of the csv files to json manually.
> >
> > Best,
> > Wei
> >
> >
> > 在 2020年12月7日，19:10，DEEP NARAYAN Singh <[hidden email]> 写道：
> >
> > Hi Guys,
> >
> > Below is my code snippet , which read all csv files under the given folder
> > row by row but my requirement is to read csv file at a time and convert as
> > json which will looks like :
> > {"A":"1","B":"3","C":"4","D":9}
> >
> > Csv file data format :
> > -------------------------------
> > *field_id,data,*
> >
> >
> >
> > *A,1B,3C,4D,9*
> >
> > Code snippet:
> > --------------------------
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *final StreamExecutionEnvironment env =
> > StreamExecutionEnvironment.getExecutionEnvironment();String path =
> > "s3://messages/data/test/dev/2020-12-07/67241306/";TypeInformation[]
> > fieldTypes = new TypeInformation[]{ BasicTypeInfo.STRING_TYPE_INFO,
> > BasicTypeInfo.STRING_TYPE_INFO};RowCsvInputFormat csvFormat = new
> > RowCsvInputFormat( new Path(path),
> >
> > fieldTypes);csvFormat.setSkipFirstLineAsHeader(true);csvFormat.setNestedFileEnumeration(true);DataStream<Row>
> > lines = env.readFile(csvFormat, path, FileProcessingMode.PROCESS_ONCE,
> > -1);lines.map(value -> value).print();*
> >
> >
> > Any help is highly appreciated.
> >
> > Thanks,
> > -Deep
> >
> >
> >