WriteAsText bug or bad name?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

WriteAsText bug or bad name?

Flavio Pompermaier
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: WriteAsText bug or bad name?

Márton Balassi
Dear Flavio,

Yes, the writeAsText() merthod really creates a folder which contains a file for each execution thread, so your threads do not block each other and the execution can use multiple cores on your machine. You can see similar results if you try it with env.execute() from an IDE.

There are filesystems, HDFS to mention the most prominent one which can transparently treat such folder structure as a single file and then it would behave as you expect. I hope this answers your question.

Best,

Marton

On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: WriteAsText bug or bad name?

Flavio Pompermaier

Would it be that difficult to change the behaviour for file:/// and create a single file?or is there a way to do that?

On Oct 29, 2014 9:52 PM, "Márton Balassi" <[hidden email]> wrote:
Dear Flavio,

Yes, the writeAsText() merthod really creates a folder which contains a file for each execution thread, so your threads do not block each other and the execution can use multiple cores on your machine. You can see similar results if you try it with env.execute() from an IDE.

There are filesystems, HDFS to mention the most prominent one which can transparently treat such folder structure as a single file and then it would behave as you expect. I hope this answers your question.

Best,

Marton

On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: WriteAsText bug or bad name?

Fabian Hueske
You can set the DOP of the data sink to 1 [1].
There is also a config parameter whether to create a directory or not in case of DOP=1. If I remember correctly, the default is to NOT create a folder for DOP=1.


Best, Fabian

2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Would it be that difficult to change the behaviour for file:/// and create a single file?or is there a way to do that?

On Oct 29, 2014 9:52 PM, "Márton Balassi" <[hidden email]> wrote:
Dear Flavio,

Yes, the writeAsText() merthod really creates a folder which contains a file for each execution thread, so your threads do not block each other and the execution can use multiple cores on your machine. You can see similar results if you try it with env.execute() from an IDE.

There are filesystems, HDFS to mention the most prominent one which can transparently treat such folder structure as a single file and then it would behave as you expect. I hope this answers your question.

Best,

Marton

On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: WriteAsText bug or bad name?

Robert Waury
In reply to this post by Flavio Pompermaier
Just use setParallelism(). This specifies how many threads are used for the operator.

writeAsText("file:///tmp/res.txt").setParallelism(1);

This will give you a single output file.

Cheers,
Robert

On Wed, Oct 29, 2014 at 10:22 PM, Flavio Pompermaier <[hidden email]> wrote:

Would it be that difficult to change the behaviour for file:/// and create a single file?or is there a way to do that?

On Oct 29, 2014 9:52 PM, "Márton Balassi" <[hidden email]> wrote:
Dear Flavio,

Yes, the writeAsText() merthod really creates a folder which contains a file for each execution thread, so your threads do not block each other and the execution can use multiple cores on your machine. You can see similar results if you try it with env.execute() from an IDE.

There are filesystems, HDFS to mention the most prominent one which can transparently treat such folder structure as a single file and then it would behave as you expect. I hope this answers your question.

Best,

Marton

On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: WriteAsText bug or bad name?

Fabian Hueske
In reply to this post by Fabian Hueske
Regarding the text vs. sequence output.
writeAsText() emits each record using its toString() method, which should be the String itself in your case.

So if it would write binary data, something is wrong...


2014-10-29 22:34 GMT+01:00 Fabian Hueske <[hidden email]>:
You can set the DOP of the data sink to 1 [1].
There is also a config parameter whether to create a directory or not in case of DOP=1. If I remember correctly, the default is to NOT create a folder for DOP=1.


Best, Fabian

2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Would it be that difficult to change the behaviour for file:/// and create a single file?or is there a way to do that?

On Oct 29, 2014 9:52 PM, "Márton Balassi" <[hidden email]> wrote:
Dear Flavio,

Yes, the writeAsText() merthod really creates a folder which contains a file for each execution thread, so your threads do not block each other and the execution can use multiple cores on your machine. You can see similar results if you try it with env.execute() from an IDE.

There are filesystems, HDFS to mention the most prominent one which can transparently treat such folder structure as a single file and then it would behave as you expect. I hope this answers your question.

Best,

Marton

On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio




Reply | Threaded
Open this post in threaded view
|

Re: WriteAsText bug or bad name?

Fabian Hueske
Hi Flavio,

any updates on this bug?

Thanks, Fabian

2014-10-29 22:36 GMT+01:00 Fabian Hueske <[hidden email]>:
Regarding the text vs. sequence output.
writeAsText() emits each record using its toString() method, which should be the String itself in your case.

So if it would write binary data, something is wrong...


2014-10-29 22:34 GMT+01:00 Fabian Hueske <[hidden email]>:
You can set the DOP of the data sink to 1 [1].
There is also a config parameter whether to create a directory or not in case of DOP=1. If I remember correctly, the default is to NOT create a folder for DOP=1.


Best, Fabian

2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Would it be that difficult to change the behaviour for file:/// and create a single file?or is there a way to do that?

On Oct 29, 2014 9:52 PM, "Márton Balassi" <[hidden email]> wrote:
Dear Flavio,

Yes, the writeAsText() merthod really creates a folder which contains a file for each execution thread, so your threads do not block each other and the execution can use multiple cores on your machine. You can see similar results if you try it with env.execute() from an IDE.

There are filesystems, HDFS to mention the most prominent one which can transparently treat such folder structure as a single file and then it would behave as you expect. I hope this answers your question.

Best,

Marton

On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio





Reply | Threaded
Open this post in threaded view
|

Re: WriteAsText bug or bad name?

Flavio Pompermaier
Nope. This is actually a bug for me, I don't know what the FLINK community or committee think

On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

any updates on this bug?

Thanks, Fabian

2014-10-29 22:36 GMT+01:00 Fabian Hueske <[hidden email]>:
Regarding the text vs. sequence output.
writeAsText() emits each record using its toString() method, which should be the String itself in your case.

So if it would write binary data, something is wrong...


2014-10-29 22:34 GMT+01:00 Fabian Hueske <[hidden email]>:
You can set the DOP of the data sink to 1 [1].
There is also a config parameter whether to create a directory or not in case of DOP=1. If I remember correctly, the default is to NOT create a folder for DOP=1.


Best, Fabian

2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Would it be that difficult to change the behaviour for file:/// and create a single file?or is there a way to do that?

On Oct 29, 2014 9:52 PM, "Márton Balassi" <[hidden email]> wrote:
Dear Flavio,

Yes, the writeAsText() merthod really creates a folder which contains a file for each execution thread, so your threads do not block each other and the execution can use multiple cores on your machine. You can see similar results if you try it with env.execute() from an IDE.

There are filesystems, HDFS to mention the most prominent one which can transparently treat such folder structure as a single file and then it would behave as you expect. I hope this answers your question.

Best,

Marton

On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio






Reply | Threaded
Open this post in threaded view
|

Re: WriteAsText bug or bad name?

Fabian Hueske
OK, I assume the problem of creating multiple files (+ output directory) is fixed by setting the DOP of the OutputFormat to 1, right?

But you still get binary output with a TextOutputFormat that writes a DataSet<String>?

2014-11-03 11:58 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Nope. This is actually a bug for me, I don't know what the FLINK community or committee think


On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

any updates on this bug?

Thanks, Fabian

2014-10-29 22:36 GMT+01:00 Fabian Hueske <[hidden email]>:
Regarding the text vs. sequence output.
writeAsText() emits each record using its toString() method, which should be the String itself in your case.

So if it would write binary data, something is wrong...


2014-10-29 22:34 GMT+01:00 Fabian Hueske <[hidden email]>:
You can set the DOP of the data sink to 1 [1].
There is also a config parameter whether to create a directory or not in case of DOP=1. If I remember correctly, the default is to NOT create a folder for DOP=1.


Best, Fabian

2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Would it be that difficult to change the behaviour for file:/// and create a single file?or is there a way to do that?

On Oct 29, 2014 9:52 PM, "Márton Balassi" <[hidden email]> wrote:
Dear Flavio,

Yes, the writeAsText() merthod really creates a folder which contains a file for each execution thread, so your threads do not block each other and the execution can use multiple cores on your machine. You can see similar results if you try it with env.execute() from an IDE.

There are filesystems, HDFS to mention the most prominent one which can transparently treat such folder structure as a single file and then it would behave as you expect. I hope this answers your question.

Best,

Marton

On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio







Reply | Threaded
Open this post in threaded view
|

Re: WriteAsText bug or bad name?

Stephan Ewen
Hey!

Parallel outputs require multiple output files.

The only way to make this a single file by default is to set the default parallelism of file outputs to 1. That would cause many surprises on cluster execution, actually.

It may be a fair compromise to set the default parallelism of sinks to 1 if the execution environment is the local environment.

Stephan


On Mon, Nov 3, 2014 at 12:06 PM, Fabian Hueske <[hidden email]> wrote:
OK, I assume the problem of creating multiple files (+ output directory) is fixed by setting the DOP of the OutputFormat to 1, right?

But you still get binary output with a TextOutputFormat that writes a DataSet<String>?

2014-11-03 11:58 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Nope. This is actually a bug for me, I don't know what the FLINK community or committee think


On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

any updates on this bug?

Thanks, Fabian

2014-10-29 22:36 GMT+01:00 Fabian Hueske <[hidden email]>:
Regarding the text vs. sequence output.
writeAsText() emits each record using its toString() method, which should be the String itself in your case.

So if it would write binary data, something is wrong...


2014-10-29 22:34 GMT+01:00 Fabian Hueske <[hidden email]>:
You can set the DOP of the data sink to 1 [1].
There is also a config parameter whether to create a directory or not in case of DOP=1. If I remember correctly, the default is to NOT create a folder for DOP=1.


Best, Fabian

2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Would it be that difficult to change the behaviour for file:/// and create a single file?or is there a way to do that?

On Oct 29, 2014 9:52 PM, "Márton Balassi" <[hidden email]> wrote:
Dear Flavio,

Yes, the writeAsText() merthod really creates a folder which contains a file for each execution thread, so your threads do not block each other and the execution can use multiple cores on your machine. You can see similar results if you try it with env.execute() from an IDE.

There are filesystems, HDFS to mention the most prominent one which can transparently treat such folder structure as a single file and then it would behave as you expect. I hope this answers your question.

Best,

Marton

On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio








Reply | Threaded
Open this post in threaded view
|

Re: WriteAsText bug or bad name?

Flavio Pompermaier
That is not a big problem, it should just be well documented :)

On Mon, Nov 3, 2014 at 12:09 PM, Stephan Ewen <[hidden email]> wrote:
Hey!

Parallel outputs require multiple output files.

The only way to make this a single file by default is to set the default parallelism of file outputs to 1. That would cause many surprises on cluster execution, actually.

It may be a fair compromise to set the default parallelism of sinks to 1 if the execution environment is the local environment.

Stephan


On Mon, Nov 3, 2014 at 12:06 PM, Fabian Hueske <[hidden email]> wrote:
OK, I assume the problem of creating multiple files (+ output directory) is fixed by setting the DOP of the OutputFormat to 1, right?

But you still get binary output with a TextOutputFormat that writes a DataSet<String>?

2014-11-03 11:58 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Nope. This is actually a bug for me, I don't know what the FLINK community or committee think


On Mon, Nov 3, 2014 at 11:52 AM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

any updates on this bug?

Thanks, Fabian

2014-10-29 22:36 GMT+01:00 Fabian Hueske <[hidden email]>:
Regarding the text vs. sequence output.
writeAsText() emits each record using its toString() method, which should be the String itself in your case.

So if it would write binary data, something is wrong...


2014-10-29 22:34 GMT+01:00 Fabian Hueske <[hidden email]>:
You can set the DOP of the data sink to 1 [1].
There is also a config parameter whether to create a directory or not in case of DOP=1. If I remember correctly, the default is to NOT create a folder for DOP=1.


Best, Fabian

2014-10-29 22:22 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Would it be that difficult to change the behaviour for file:/// and create a single file?or is there a way to do that?

On Oct 29, 2014 9:52 PM, "Márton Balassi" <[hidden email]> wrote:
Dear Flavio,

Yes, the writeAsText() merthod really creates a folder which contains a file for each execution thread, so your threads do not block each other and the execution can use multiple cores on your machine. You can see similar results if you try it with env.execute() from an IDE.

There are filesystems, HDFS to mention the most prominent one which can transparently treat such folder structure as a single file and then it would behave as you expect. I hope this answers your question.

Best,

Marton

On Wed, Oct 29, 2014 at 8:31 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
running the example at http://flink.incubator.apache.org/docs/0.7-incubating/local_execution.html I was thinking that the writeAsText on a local file was creating a text file on my local filesystem..instead it creates something similar to a sequence file (within a folder).
This is something misleading I think...or the API name is wrong or this is a bug (IMHO).
Btw..how can I modify the following program to write results in a single text file on my local filesystem?

public static void main(String[] args) throws Exception {
 ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
 DataSet<String> data = env.readTextFile("file:///tmp/res.txt");
 data.filter(new FilterFunction<String>() {
   public boolean filter(String value) {
    return value.startsWith("http://");
   }
  }).writeAsText("file:///tmp/res.txt");
  env.execute();
}

Best,
Flavio