Ingestion of data into HDFS

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Ingestion of data into HDFS

Flavio Pompermaier
Hi to all,

in my use case I have bursts of data to store into hdfs and once finished, compact them into a single directory (as Parquet). From what I know, the current approach is to use Flume that automatically ingest data and compact them based on some configurable policy.
However I'd like to avoid to add Flume to my architecture because these bursts are not long lived processed so I just want to write a batch of rows as a single file in some directory, and once the process finish, i want to read all of them and compact into a single output directory as Parquet.
It's something similar to a streaming process but (for the moment) I'd like to avoid to have a long lived Flink process listening for incoming data.

Do you have any suggestion for such a process or is there any example in Flink code?


Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Ingestion of data into HDFS

Fabian Hueske-2
I'm not sure if I got your question right.

Do you want to know if it is possible to implement a Flink program that reads several files and writes their data into a Parquet format?
Or are you asking how such a job could be scheduled for execution based on some external event (such as a file appearing)?

Both should be possible.

The job would be a simple pipeline with or without some transformations depending on the required logic and a Parquet data sink.
The job execution can be triggered from outside of Flink for example using a monitoring process or a cron job that calls the CLI client with the right parameters.

Best, Fabian



2015-05-22 14:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

in my use case I have bursts of data to store into hdfs and once finished, compact them into a single directory (as Parquet). From what I know, the current approach is to use Flume that automatically ingest data and compact them based on some configurable policy.
However I'd like to avoid to add Flume to my architecture because these bursts are not long lived processed so I just want to write a batch of rows as a single file in some directory, and once the process finish, i want to read all of them and compact into a single output directory as Parquet.
It's something similar to a streaming process but (for the moment) I'd like to avoid to have a long lived Flink process listening for incoming data.

Do you have any suggestion for such a process or is there any example in Flink code?


Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Ingestion of data into HDFS

Stephan Ewen
If you simply want to trigger two Flink jobs one after the other, you can simply do this in one program.

Since the "env.execute()" call blocks, the second program starts after the first.

-----

ExecutionEnvironment env1 = ExecutionEnvironment.getExecutionEnvironment();

// program 1

env1.execute();


ExecutionEnvironment env2 = ExecutionEnvironment.getExecutionEnvironment();

// program 2

env2.execute();




On Fri, May 22, 2015 at 6:21 PM, Fabian Hueske <[hidden email]> wrote:
I'm not sure if I got your question right.

Do you want to know if it is possible to implement a Flink program that reads several files and writes their data into a Parquet format?
Or are you asking how such a job could be scheduled for execution based on some external event (such as a file appearing)?

Both should be possible.

The job would be a simple pipeline with or without some transformations depending on the required logic and a Parquet data sink.
The job execution can be triggered from outside of Flink for example using a monitoring process or a cron job that calls the CLI client with the right parameters.

Best, Fabian



2015-05-22 14:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

in my use case I have bursts of data to store into hdfs and once finished, compact them into a single directory (as Parquet). From what I know, the current approach is to use Flume that automatically ingest data and compact them based on some configurable policy.
However I'd like to avoid to add Flume to my architecture because these bursts are not long lived processed so I just want to write a batch of rows as a single file in some directory, and once the process finish, i want to read all of them and compact into a single output directory as Parquet.
It's something similar to a streaming process but (for the moment) I'd like to avoid to have a long lived Flink process listening for incoming data.

Do you have any suggestion for such a process or is there any example in Flink code?


Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Ingestion of data into HDFS

Flavio Pompermaier
Hi Stephan and Fabian,
I'll try to make it more clear..:)
in my use case I have a webapp that receives a series of row sets one after the other (I have also a start and an end event that determines when such process starts and ends).
At every request, the batch of rows is translated in another batch of rows (i.e. what in Flink it's a flatMap...) that I want to store in HDFS or a local fs (this is a non Flink part but I need to decide which format to use..I Imagine Tuples could be ok and thus I could store such lines as TypeSerializerInputFormat/TypeSerializerOutputFormat in a file, for each batch).
Those files will contains duplicated tuples so, once this first process finish, I need to read all those files and save them in a Parquet directory.
What I'd like to know how can for example generate a new file for each batch (I was thinking to use UUIDs) or if there's something in flink to manage such append-like mechanism and its following compaction..
Also the possibility to run distinct on tuples with null values could be a very nice improvement of Flink...

Best,
Flavio

On Fri, May 22, 2015 at 6:29 PM, Stephan Ewen <[hidden email]> wrote:
If you simply want to trigger two Flink jobs one after the other, you can simply do this in one program.

Since the "env.execute()" call blocks, the second program starts after the first.

-----

ExecutionEnvironment env1 = ExecutionEnvironment.getExecutionEnvironment();

// program 1

env1.execute();


ExecutionEnvironment env2 = ExecutionEnvironment.getExecutionEnvironment();

// program 2

env2.execute();




On Fri, May 22, 2015 at 6:21 PM, Fabian Hueske <[hidden email]> wrote:
I'm not sure if I got your question right.

Do you want to know if it is possible to implement a Flink program that reads several files and writes their data into a Parquet format?
Or are you asking how such a job could be scheduled for execution based on some external event (such as a file appearing)?

Both should be possible.

The job would be a simple pipeline with or without some transformations depending on the required logic and a Parquet data sink.
The job execution can be triggered from outside of Flink for example using a monitoring process or a cron job that calls the CLI client with the right parameters.

Best, Fabian



2015-05-22 14:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

in my use case I have bursts of data to store into hdfs and once finished, compact them into a single directory (as Parquet). From what I know, the current approach is to use Flume that automatically ingest data and compact them based on some configurable policy.
However I'd like to avoid to add Flume to my architecture because these bursts are not long lived processed so I just want to write a batch of rows as a single file in some directory, and once the process finish, i want to read all of them and compact into a single output directory as Parquet.
It's something similar to a streaming process but (for the moment) I'd like to avoid to have a long lived Flink process listening for incoming data.

Do you have any suggestion for such a process or is there any example in Flink code?


Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Ingestion of data into HDFS

Fabian Hueske-2
So, the question is how can you control the names of output files such that they do not overwrite other files, right?

There are two ways to do that:

1) The easiest way is to write the output of each job to a nested directory. Flink (and I believe also Hadoop) can read files in nested directories. For Flink you have to set the boolean configuration parameter "recursive.file.enumeration" of the FileInputFormat to true.

2) If nested directories are not an option, you can extend the FileOutputFormat and overwrite the method getDirectoryFileName(int i) to specify the name of the output file of the i-th output task.

Best, Fabian

2015-05-22 18:44 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Stephan and Fabian,
I'll try to make it more clear..:)
in my use case I have a webapp that receives a series of row sets one after the other (I have also a start and an end event that determines when such process starts and ends).
At every request, the batch of rows is translated in another batch of rows (i.e. what in Flink it's a flatMap...) that I want to store in HDFS or a local fs (this is a non Flink part but I need to decide which format to use..I Imagine Tuples could be ok and thus I could store such lines as TypeSerializerInputFormat/TypeSerializerOutputFormat in a file, for each batch).
Those files will contains duplicated tuples so, once this first process finish, I need to read all those files and save them in a Parquet directory.
What I'd like to know how can for example generate a new file for each batch (I was thinking to use UUIDs) or if there's something in flink to manage such append-like mechanism and its following compaction..
Also the possibility to run distinct on tuples with null values could be a very nice improvement of Flink...

Best,
Flavio

On Fri, May 22, 2015 at 6:29 PM, Stephan Ewen <[hidden email]> wrote:
If you simply want to trigger two Flink jobs one after the other, you can simply do this in one program.

Since the "env.execute()" call blocks, the second program starts after the first.

-----

ExecutionEnvironment env1 = ExecutionEnvironment.getExecutionEnvironment();

// program 1

env1.execute();


ExecutionEnvironment env2 = ExecutionEnvironment.getExecutionEnvironment();

// program 2

env2.execute();




On Fri, May 22, 2015 at 6:21 PM, Fabian Hueske <[hidden email]> wrote:
I'm not sure if I got your question right.

Do you want to know if it is possible to implement a Flink program that reads several files and writes their data into a Parquet format?
Or are you asking how such a job could be scheduled for execution based on some external event (such as a file appearing)?

Both should be possible.

The job would be a simple pipeline with or without some transformations depending on the required logic and a Parquet data sink.
The job execution can be triggered from outside of Flink for example using a monitoring process or a cron job that calls the CLI client with the right parameters.

Best, Fabian



2015-05-22 14:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

in my use case I have bursts of data to store into hdfs and once finished, compact them into a single directory (as Parquet). From what I know, the current approach is to use Flume that automatically ingest data and compact them based on some configurable policy.
However I'd like to avoid to add Flume to my architecture because these bursts are not long lived processed so I just want to write a batch of rows as a single file in some directory, and once the process finish, i want to read all of them and compact into a single output directory as Parquet.
It's something similar to a streaming process but (for the moment) I'd like to avoid to have a long lived Flink process listening for incoming data.

Do you have any suggestion for such a process or is there any example in Flink code?


Best,
Flavio