Get file metadata

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Get file metadata

Ronny Bräunlich
Hello,

I want to read a file containing textfiles with Flink.
As I already found out I can simply point the environment to the directory and it will read all the files.
What I couldn’t find out is if it’s possible to keep the file metadata somehow.
Concrete, I need the timestamp, the filename and the file content. Is there a way to do this with the ExecutionEnvironment?

Cheers,
Ronny
Reply | Threaded
Open this post in threaded view
|

Re: Get file metadata

rmetzger0
Hi Ronny,

It is a similar use case ... I guess you can get the metadata from the input split as well.

On Wed, Jul 1, 2015 at 11:30 AM, Ronny Bräunlich <[hidden email]> wrote:
Hello,

I want to read a file containing textfiles with Flink.
As I already found out I can simply point the environment to the directory and it will read all the files.
What I couldn’t find out is if it’s possible to keep the file metadata somehow.
Concrete, I need the timestamp, the filename and the file content. Is there a way to do this with the ExecutionEnvironment?

Cheers,
Ronny

Reply | Threaded
Open this post in threaded view
|

Re: Get file metadata

Ronny Bräunlich
Hi Robert,

thank you for your quick answer.
Just one additional question:
When I use the ExecutionEnvironment like this: DataSource<String> files = env.readTextFile("file:///Users/me/path/to/file/dir);
Shouldn’t it read all the files in dir? I have three .json files there but when I print the result, nothing is shown.

Cheers,
Ronny


Am 01.07.2015 um 11:35 schrieb Robert Metzger <[hidden email]>:

Hi Ronny,

It is a similar use case ... I guess you can get the metadata from the input split as well.

On Wed, Jul 1, 2015 at 11:30 AM, Ronny Bräunlich <[hidden email]> wrote:
Hello,

I want to read a file containing textfiles with Flink.
As I already found out I can simply point the environment to the directory and it will read all the files.
What I couldn’t find out is if it’s possible to keep the file metadata somehow.
Concrete, I need the timestamp, the filename and the file content. Is there a way to do this with the ExecutionEnvironment?

Cheers,
Ronny


Reply | Threaded
Open this post in threaded view
|

Re: Get file metadata

Ronny Bräunlich
In reply to this post by rmetzger0
Hi Robert,

just ignore my previous question.
My files started with underscore and I just found out that FileInputFormat does filter for underscores in acceptFile().

Cheers,
Ronny

Am 01.07.2015 um 11:35 schrieb Robert Metzger <[hidden email]>:

Hi Ronny,

It is a similar use case ... I guess you can get the metadata from the input split as well.

On Wed, Jul 1, 2015 at 11:30 AM, Ronny Bräunlich <[hidden email]> wrote:
Hello,

I want to read a file containing textfiles with Flink.
As I already found out I can simply point the environment to the directory and it will read all the files.
What I couldn’t find out is if it’s possible to keep the file metadata somehow.
Concrete, I need the timestamp, the filename and the file content. Is there a way to do this with the ExecutionEnvironment?

Cheers,
Ronny


Reply | Threaded
Open this post in threaded view
|

Re: Get file metadata

rmetzger0
Okay. We filter files starting with underscores because that is the same behavior as Hadoop.
Hadoop is always creating some underscore files, so when reading results of a MapReduce job, Flink would read these files.

On Wed, Jul 1, 2015 at 12:15 PM, Ronny Bräunlich <[hidden email]> wrote:
Hi Robert,

just ignore my previous question.
My files started with underscore and I just found out that FileInputFormat does filter for underscores in acceptFile().

Cheers,
Ronny

Am 01.07.2015 um 11:35 schrieb Robert Metzger <[hidden email]>:

Hi Ronny,

It is a similar use case ... I guess you can get the metadata from the input split as well.

On Wed, Jul 1, 2015 at 11:30 AM, Ronny Bräunlich <[hidden email]> wrote:
Hello,

I want to read a file containing textfiles with Flink.
As I already found out I can simply point the environment to the directory and it will read all the files.
What I couldn’t find out is if it’s possible to keep the file metadata somehow.
Concrete, I need the timestamp, the filename and the file content. Is there a way to do this with the ExecutionEnvironment?

Cheers,
Ronny