(DEPRECATED) Apache Flink User Mailing List archive.

Hadoop compatibility and HBase bulk loading

Classic

List

Threaded

9 messages Options

Flavio Pompermaier

Hadoop compatibility and HBase bulk loading

Hi guys,

I have a nice question about Hadoop compatibility.

In https://flink.apache.org/news/2014/11/18/hadoop-compatibility.html you say that you can reuse existing mapreduce programs.
Could it be possible to manage also complex mapreduce programs like HBase BulkImport that use for example a custom partioner (org.apache.hadoop.mapreduce.Partitioner)?

In the bulk-import examples the call HFileOutputFormat2.configureIncrementalLoadMap that sets a series of job parameters (like partitioner, mapper, reducers, etc) -> http://pastebin.com/8VXjYAEf.

The full code of it can be seen at https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java.

Do you think there's any change to make it run in flink?

Best,

Flavio

Fabian Hueske-2

Re: Hadoop compatibility and HBase bulk loading

We had an effort to execute any HadoopMR program by simply specifying the JobConf and execute it (even embedded in regular Flink programs).

We got quite far but did not complete (counters and custom grouping / sorting functions for Combiners are missing if I remember correctly).
I don't think that anybody is working on that right now, but it would definitely be a cool feature.

2015-04-10 11:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:

Hi guys,

I have a nice question about Hadoop compatibility.
In https://flink.apache.org/news/2014/11/18/hadoop-compatibility.html you say that you can reuse existing mapreduce programs.
Could it be possible to manage also complex mapreduce programs like HBase BulkImport that use for example a custom partioner (org.apache.hadoop.mapreduce.Partitioner)?

In the bulk-import examples the call HFileOutputFormat2.configureIncrementalLoadMap that sets a series of job parameters (like partitioner, mapper, reducers, etc) -> http://pastebin.com/8VXjYAEf.
The full code of it can be seen at https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java.

Do you think there's any change to make it run in flink?

Best,
Flavio

Flavio Pompermaier

Re: Hadoop compatibility and HBase bulk loading

I think I could also take care of it if somebody can help me and guide me a little bit..

How long do you think it will require to complete such a task?

On Fri, Apr 10, 2015 at 12:02 PM, Fabian Hueske <[hidden email]> wrote:

We had an effort to execute any HadoopMR program by simply specifying the JobConf and execute it (even embedded in regular Flink programs).
We got quite far but did not complete (counters and custom grouping / sorting functions for Combiners are missing if I remember correctly).
I don't think that anybody is working on that right now, but it would definitely be a cool feature.

2015-04-10 11:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

I have a nice question about Hadoop compatibility.
In https://flink.apache.org/news/2014/11/18/hadoop-compatibility.html you say that you can reuse existing mapreduce programs.
Could it be possible to manage also complex mapreduce programs like HBase BulkImport that use for example a custom partioner (org.apache.hadoop.mapreduce.Partitioner)?

In the bulk-import examples the call HFileOutputFormat2.configureIncrementalLoadMap that sets a series of job parameters (like partitioner, mapper, reducers, etc) -> http://pastebin.com/8VXjYAEf.
The full code of it can be seen at https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java.

Do you think there's any change to make it run in flink?

Best,
Flavio

Fabian Hueske-2

Re: Hadoop compatibility and HBase bulk loading

Hmm, that's a tricky question ;-) I would need to have a closer look. But getting custom comparators for sorting and grouping into the Combiner is not that trivial because it touches API, Optimizer, and Runtime code. However, I did that before for the Reducer and with the recent addition of groupCombine the Reducer changes might be just applied to combine.

I'll be gone next week, but if you want to, we can have a closer look at the problem after that.

2015-04-10 12:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:

I think I could also take care of it if somebody can help me and guide me a little bit..
How long do you think it will require to complete such a task?

On Fri, Apr 10, 2015 at 12:02 PM, Fabian Hueske <[hidden email]> wrote:
We had an effort to execute any HadoopMR program by simply specifying the JobConf and execute it (even embedded in regular Flink programs).
We got quite far but did not complete (counters and custom grouping / sorting functions for Combiners are missing if I remember correctly).
I don't think that anybody is working on that right now, but it would definitely be a cool feature.

2015-04-10 11:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

I have a nice question about Hadoop compatibility.
In https://flink.apache.org/news/2014/11/18/hadoop-compatibility.html you say that you can reuse existing mapreduce programs.
Could it be possible to manage also complex mapreduce programs like HBase BulkImport that use for example a custom partioner (org.apache.hadoop.mapreduce.Partitioner)?

In the bulk-import examples the call HFileOutputFormat2.configureIncrementalLoadMap that sets a series of job parameters (like partitioner, mapper, reducers, etc) -> http://pastebin.com/8VXjYAEf.
The full code of it can be seen at https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java.

Do you think there's any change to make it run in flink?

Best,
Flavio

Flavio Pompermaier

Re: Hadoop compatibility and HBase bulk loading

Great! That will be awesome.

Thank you Fabian

On Fri, Apr 10, 2015 at 12:14 PM, Fabian Hueske <[hidden email]> wrote:

Hmm, that's a tricky question ;-) I would need to have a closer look. But getting custom comparators for sorting and grouping into the Combiner is not that trivial because it touches API, Optimizer, and Runtime code. However, I did that before for the Reducer and with the recent addition of groupCombine the Reducer changes might be just applied to combine.

I'll be gone next week, but if you want to, we can have a closer look at the problem after that.

2015-04-10 12:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
I think I could also take care of it if somebody can help me and guide me a little bit..
How long do you think it will require to complete such a task?

On Fri, Apr 10, 2015 at 12:02 PM, Fabian Hueske <[hidden email]> wrote:
We had an effort to execute any HadoopMR program by simply specifying the JobConf and execute it (even embedded in regular Flink programs).
We got quite far but did not complete (counters and custom grouping / sorting functions for Combiners are missing if I remember correctly).
I don't think that anybody is working on that right now, but it would definitely be a cool feature.

2015-04-10 11:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

I have a nice question about Hadoop compatibility.
In https://flink.apache.org/news/2014/11/18/hadoop-compatibility.html you say that you can reuse existing mapreduce programs.
Could it be possible to manage also complex mapreduce programs like HBase BulkImport that use for example a custom partioner (org.apache.hadoop.mapreduce.Partitioner)?

In the bulk-import examples the call HFileOutputFormat2.configureIncrementalLoadMap that sets a series of job parameters (like partitioner, mapper, reducers, etc) -> http://pastebin.com/8VXjYAEf.
The full code of it can be seen at https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java.

Do you think there's any change to make it run in flink?

Best,
Flavio

Flavio Pompermaier

Re: Hadoop compatibility and HBase bulk loading

Any progress on this Fabian? HBase bulk loading is a common task for us and it's very annoying and uncomfortable to run a separate YARN job to accomplish it...

On 10 Apr 2015 12:26, "Flavio Pompermaier" <[hidden email]> wrote:

Great! That will be awesome.
Thank you Fabian

On Fri, Apr 10, 2015 at 12:14 PM, Fabian Hueske <[hidden email]> wrote:
Hmm, that's a tricky question ;-) I would need to have a closer look. But getting custom comparators for sorting and grouping into the Combiner is not that trivial because it touches API, Optimizer, and Runtime code. However, I did that before for the Reducer and with the recent addition of groupCombine the Reducer changes might be just applied to combine.

I'll be gone next week, but if you want to, we can have a closer look at the problem after that.

2015-04-10 12:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
I think I could also take care of it if somebody can help me and guide me a little bit..
How long do you think it will require to complete such a task?

On Fri, Apr 10, 2015 at 12:02 PM, Fabian Hueske <[hidden email]> wrote:
We had an effort to execute any HadoopMR program by simply specifying the JobConf and execute it (even embedded in regular Flink programs).
We got quite far but did not complete (counters and custom grouping / sorting functions for Combiners are missing if I remember correctly).
I don't think that anybody is working on that right now, but it would definitely be a cool feature.

2015-04-10 11:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

I have a nice question about Hadoop compatibility.
In https://flink.apache.org/news/2014/11/18/hadoop-compatibility.html you say that you can reuse existing mapreduce programs.
Could it be possible to manage also complex mapreduce programs like HBase BulkImport that use for example a custom partioner (org.apache.hadoop.mapreduce.Partitioner)?

In the bulk-import examples the call HFileOutputFormat2.configureIncrementalLoadMap that sets a series of job parameters (like partitioner, mapper, reducers, etc) -> http://pastebin.com/8VXjYAEf.
The full code of it can be seen at https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java.

Do you think there's any change to make it run in flink?

Best,
Flavio

Fabian Hueske-2

Re: Hadoop compatibility and HBase bulk loading

No, I'm not aware of anybody working on extending the Hadoop compatibility support.

I'll also have no time to work on this any time soon :-(

2018-01-13 1:34 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Any progress on this Fabian? HBase bulk loading is a common task for us and it's very annoying and uncomfortable to run a separate YARN job to accomplish it...

On 10 Apr 2015 12:26, "Flavio Pompermaier" <[hidden email]> wrote:
Great! That will be awesome.
Thank you Fabian

On Fri, Apr 10, 2015 at 12:14 PM, Fabian Hueske <[hidden email]> wrote:
Hmm, that's a tricky question ;-) I would need to have a closer look. But getting custom comparators for sorting and grouping into the Combiner is not that trivial because it touches API, Optimizer, and Runtime code. However, I did that before for the Reducer and with the recent addition of groupCombine the Reducer changes might be just applied to combine.

I'll be gone next week, but if you want to, we can have a closer look at the problem after that.

2015-04-10 12:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
I think I could also take care of it if somebody can help me and guide me a little bit..
How long do you think it will require to complete such a task?

On Fri, Apr 10, 2015 at 12:02 PM, Fabian Hueske <[hidden email]> wrote:
We had an effort to execute any HadoopMR program by simply specifying the JobConf and execute it (even embedded in regular Flink programs).
We got quite far but did not complete (counters and custom grouping / sorting functions for Combiners are missing if I remember correctly).
I don't think that anybody is working on that right now, but it would definitely be a cool feature.

2015-04-10 11:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

I have a nice question about Hadoop compatibility.
In https://flink.apache.org/news/2014/11/18/hadoop-compatibility.html you say that you can reuse existing mapreduce programs.
Could it be possible to manage also complex mapreduce programs like HBase BulkImport that use for example a custom partioner (org.apache.hadoop.mapreduce.Partitioner)?

In the bulk-import examples the call HFileOutputFormat2.configureIncrementalLoadMap that sets a series of job parameters (like partitioner, mapper, reducers, etc) -> http://pastebin.com/8VXjYAEf.
The full code of it can be seen at https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java.

Do you think there's any change to make it run in flink?

Best,
Flavio

Flavio Pompermaier

Re: Hadoop compatibility and HBase bulk loading

Do you think is that complex to support it? I think we can try to implement it if someone could give us some support (at least some big picture)

On Tue, Jan 16, 2018 at 10:02 AM, Fabian Hueske <[hidden email]> wrote:

No, I'm not aware of anybody working on extending the Hadoop compatibility support.
I'll also have no time to work on this any time soon :-(

2018-01-13 1:34 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Any progress on this Fabian? HBase bulk loading is a common task for us and it's very annoying and uncomfortable to run a separate YARN job to accomplish it...

On 10 Apr 2015 12:26, "Flavio Pompermaier" <[hidden email]> wrote:
Great! That will be awesome.
Thank you Fabian

On Fri, Apr 10, 2015 at 12:14 PM, Fabian Hueske <[hidden email]> wrote:
Hmm, that's a tricky question ;-) I would need to have a closer look. But getting custom comparators for sorting and grouping into the Combiner is not that trivial because it touches API, Optimizer, and Runtime code. However, I did that before for the Reducer and with the recent addition of groupCombine the Reducer changes might be just applied to combine.

I'll be gone next week, but if you want to, we can have a closer look at the problem after that.

2015-04-10 12:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
I think I could also take care of it if somebody can help me and guide me a little bit..
How long do you think it will require to complete such a task?

On Fri, Apr 10, 2015 at 12:02 PM, Fabian Hueske <[hidden email]> wrote:
We had an effort to execute any HadoopMR program by simply specifying the JobConf and execute it (even embedded in regular Flink programs).
We got quite far but did not complete (counters and custom grouping / sorting functions for Combiners are missing if I remember correctly).
I don't think that anybody is working on that right now, but it would definitely be a cool feature.

2015-04-10 11:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

I have a nice question about Hadoop compatibility.
In https://flink.apache.org/news/2014/11/18/hadoop-compatibility.html you say that you can reuse existing mapreduce programs.
Could it be possible to manage also complex mapreduce programs like HBase BulkImport that use for example a custom partioner (org.apache.hadoop.mapreduce.Partitioner)?

In the bulk-import examples the call HFileOutputFormat2.configureIncrementalLoadMap that sets a series of job parameters (like partitioner, mapper, reducers, etc) -> http://pastebin.com/8VXjYAEf.
The full code of it can be seen at https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java.

Do you think there's any change to make it run in flink?

Best,
Flavio

Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. <a href="tel:+39%200461%20041809" value="+390461041809" target="_blank">+(39) 0461 041809

Fabian Hueske-2

Re: Hadoop compatibility and HBase bulk loading

Looking at my previous mail which mentions changes to API, optimizer, and runtime code of the DataSet API this would be a major and non-trivial effort and also require that a committer spends a good amount of time for this.

2018-01-16 10:07 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Do you think is that complex to support it? I think we can try to implement it if someone could give us some support (at least some big picture)

On Tue, Jan 16, 2018 at 10:02 AM, Fabian Hueske <[hidden email]> wrote:
No, I'm not aware of anybody working on extending the Hadoop compatibility support.
I'll also have no time to work on this any time soon :-(

2018-01-13 1:34 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Any progress on this Fabian? HBase bulk loading is a common task for us and it's very annoying and uncomfortable to run a separate YARN job to accomplish it...

On 10 Apr 2015 12:26, "Flavio Pompermaier" <[hidden email]> wrote:
Great! That will be awesome.
Thank you Fabian

On Fri, Apr 10, 2015 at 12:14 PM, Fabian Hueske <[hidden email]> wrote:
Hmm, that's a tricky question ;-) I would need to have a closer look. But getting custom comparators for sorting and grouping into the Combiner is not that trivial because it touches API, Optimizer, and Runtime code. However, I did that before for the Reducer and with the recent addition of groupCombine the Reducer changes might be just applied to combine.

I'll be gone next week, but if you want to, we can have a closer look at the problem after that.

2015-04-10 12:07 GMT+02:00 Flavio Pompermaier <[hidden email]>:
I think I could also take care of it if somebody can help me and guide me a little bit..
How long do you think it will require to complete such a task?

On Fri, Apr 10, 2015 at 12:02 PM, Fabian Hueske <[hidden email]> wrote:
We had an effort to execute any HadoopMR program by simply specifying the JobConf and execute it (even embedded in regular Flink programs).
We got quite far but did not complete (counters and custom grouping / sorting functions for Combiners are missing if I remember correctly).
I don't think that anybody is working on that right now, but it would definitely be a cool feature.

2015-04-10 11:55 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

I have a nice question about Hadoop compatibility.
In https://flink.apache.org/news/2014/11/18/hadoop-compatibility.html you say that you can reuse existing mapreduce programs.
Could it be possible to manage also complex mapreduce programs like HBase BulkImport that use for example a custom partioner (org.apache.hadoop.mapreduce.Partitioner)?

In the bulk-import examples the call HFileOutputFormat2.configureIncrementalLoadMap that sets a series of job parameters (like partitioner, mapper, reducers, etc) -> http://pastebin.com/8VXjYAEf.
The full code of it can be seen at https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/HFileOutputFormat2.java.

Do you think there's any change to make it run in flink?

Best,
Flavio

--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. <a href="tel:+39%200461%20041809" value="+390461041809" target="_blank">+(39) 0461 041809