Integration with Apache AirFlow

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Integration with Apache AirFlow

Flavio Pompermaier
Hello everybody,
is there any suggested way/pointer to schedule Flink jobs using Apache AirFlow?
What I'd like to achieve is the submission (using the REST API of AirFlow) of 2 jobs, where the second one can be executed only if the first one succeed.

Thanks in advance
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Integration with Apache AirFlow

Flavio Pompermaier
Any advice here?

On Wed, Jan 27, 2021 at 9:49 PM Flavio Pompermaier <[hidden email]> wrote:
Hello everybody,
is there any suggested way/pointer to schedule Flink jobs using Apache AirFlow?
What I'd like to achieve is the submission (using the REST API of AirFlow) of 2 jobs, where the second one can be executed only if the first one succeed.

Thanks in advance
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Integration with Apache AirFlow

姜鑫
Hi Flavio,

Could you explain what your direct question is? In my opinion, it is possible to define two airflow operators to submit dependent flink job, as long as the first one can reach the end.

Regards,
Xin

2021年2月1日 下午6:43,Flavio Pompermaier <[hidden email]> 写道:

Any advice here?

On Wed, Jan 27, 2021 at 9:49 PM Flavio Pompermaier <[hidden email]> wrote:
Hello everybody,
is there any suggested way/pointer to schedule Flink jobs using Apache AirFlow?
What I'd like to achieve is the submission (using the REST API of AirFlow) of 2 jobs, where the second one can be executed only if the first one succeed.

Thanks in advance
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Integration with Apache AirFlow

Flavio Pompermaier
Hi Xin,
let me state first that I never used AirFlow so I can probably miss some background here.
I just want to externalize the job scheduling to some consolidated framework and from what I see Apache AirFlow is probably what I need.
However I can't find any good blog post or documentation about how to integrate these 2 technologies using REST API of both services.
I saw that Flink AI decided to use a customized/enhanced version of AirFlow [1] but I didn't look into the code to understand how they use it.
In my use case I just want to schedule 2 Flink batch jobs using the REST API of AirFlow, where the second one is fired after the first.


Best,
Flavio

On Tue, Feb 2, 2021 at 2:43 AM 姜鑫 <[hidden email]> wrote:
Hi Flavio,

Could you explain what your direct question is? In my opinion, it is possible to define two airflow operators to submit dependent flink job, as long as the first one can reach the end.

Regards,
Xin

2021年2月1日 下午6:43,Flavio Pompermaier <[hidden email]> 写道:

Any advice here?

On Wed, Jan 27, 2021 at 9:49 PM Flavio Pompermaier <[hidden email]> wrote:
Hello everybody,
is there any suggested way/pointer to schedule Flink jobs using Apache AirFlow?
What I'd like to achieve is the submission (using the REST API of AirFlow) of 2 jobs, where the second one can be executed only if the first one succeed.

Thanks in advance
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Integration with Apache AirFlow

姜鑫
Hi Flavio,

I probably understand what you need. Apache AirFlow is a scheduling framework which you can define your own dependent operators, therefore you can define a BashOperator to submit flink job to you local flink cluster. For example:
```
t1 = BashOperator(
    task_id=‘flink-wordcount',
    bash_command=‘./bin/flink run flink/build-target/examples/batch/WordCount.jar',
    ...
)
```
Alse Airflow supports submitting jobs to kubernetes and you can even implement your own operator if bash command doesn’t meet your demands.

Indeed Flink AI (flink-ai-extended ?) needs an enhanced version of AirFlow, but it is mainly for streaming scenario which means the job won’t stop. In your case which are all batch jobs it doesn’t help much. Hope this helps.

Regard,
Xin


2021年2月2日 下午4:30,Flavio Pompermaier <[hidden email]> 写道:

Hi Xin,
let me state first that I never used AirFlow so I can probably miss some background here.
I just want to externalize the job scheduling to some consolidated framework and from what I see Apache AirFlow is probably what I need.
However I can't find any good blog post or documentation about how to integrate these 2 technologies using REST API of both services.
I saw that Flink AI decided to use a customized/enhanced version of AirFlow [1] but I didn't look into the code to understand how they use it.
In my use case I just want to schedule 2 Flink batch jobs using the REST API of AirFlow, where the second one is fired after the first.


Best,
Flavio

On Tue, Feb 2, 2021 at 2:43 AM 姜鑫 <[hidden email]> wrote:
Hi Flavio,

Could you explain what your direct question is? In my opinion, it is possible to define two airflow operators to submit dependent flink job, as long as the first one can reach the end.

Regards,
Xin

2021年2月1日 下午6:43,Flavio Pompermaier <[hidden email]> 写道:

Any advice here?

On Wed, Jan 27, 2021 at 9:49 PM Flavio Pompermaier <[hidden email]> wrote:
Hello everybody,
is there any suggested way/pointer to schedule Flink jobs using Apache AirFlow?
What I'd like to achieve is the submission (using the REST API of AirFlow) of 2 jobs, where the second one can be executed only if the first one succeed.

Thanks in advance
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Integration with Apache AirFlow

Arvid Heise-4
Hi Flavio,

If you know a bit of Python, it's also trivial to add a new Flink operator where you can use REST API.

In general, I'd consider Airflow to be the best choice for your problem, especially if it gets more complicated in the future (do something else if the first job fails).

If you have specific questions, feel free to ask.

Best,

Arvid

On Tue, Feb 2, 2021 at 10:08 AM 姜鑫 <[hidden email]> wrote:
Hi Flavio,

I probably understand what you need. Apache AirFlow is a scheduling framework which you can define your own dependent operators, therefore you can define a BashOperator to submit flink job to you local flink cluster. For example:
```
t1 = BashOperator(
    task_id=‘flink-wordcount',
    bash_command=‘./bin/flink run flink/build-target/examples/batch/WordCount.jar',
    ...
)
```
Alse Airflow supports submitting jobs to kubernetes and you can even implement your own operator if bash command doesn’t meet your demands.

Indeed Flink AI (flink-ai-extended ?) needs an enhanced version of AirFlow, but it is mainly for streaming scenario which means the job won’t stop. In your case which are all batch jobs it doesn’t help much. Hope this helps.

Regard,
Xin


2021年2月2日 下午4:30,Flavio Pompermaier <[hidden email]> 写道:

Hi Xin,
let me state first that I never used AirFlow so I can probably miss some background here.
I just want to externalize the job scheduling to some consolidated framework and from what I see Apache AirFlow is probably what I need.
However I can't find any good blog post or documentation about how to integrate these 2 technologies using REST API of both services.
I saw that Flink AI decided to use a customized/enhanced version of AirFlow [1] but I didn't look into the code to understand how they use it.
In my use case I just want to schedule 2 Flink batch jobs using the REST API of AirFlow, where the second one is fired after the first.


Best,
Flavio

On Tue, Feb 2, 2021 at 2:43 AM 姜鑫 <[hidden email]> wrote:
Hi Flavio,

Could you explain what your direct question is? In my opinion, it is possible to define two airflow operators to submit dependent flink job, as long as the first one can reach the end.

Regards,
Xin

2021年2月1日 下午6:43,Flavio Pompermaier <[hidden email]> 写道:

Any advice here?

On Wed, Jan 27, 2021 at 9:49 PM Flavio Pompermaier <[hidden email]> wrote:
Hello everybody,
is there any suggested way/pointer to schedule Flink jobs using Apache AirFlow?
What I'd like to achieve is the submission (using the REST API of AirFlow) of 2 jobs, where the second one can be executed only if the first one succeed.

Thanks in advance
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Integration with Apache AirFlow

Flavio Pompermaier
Thank you all for the hints. However looking at the REST API[1] of AirFlow 2.0 I can't find how to setup my DAG (if this is the right concept).
Do I need to first create a Connection? A DAG?  a TaskInstance? How do I specify the 2 BashOperator? 
I was thinking to connect to AirFlow via Java so I can't use the Python API..


On Tue, Feb 2, 2021 at 10:53 AM Arvid Heise <[hidden email]> wrote:
Hi Flavio,

If you know a bit of Python, it's also trivial to add a new Flink operator where you can use REST API.

In general, I'd consider Airflow to be the best choice for your problem, especially if it gets more complicated in the future (do something else if the first job fails).

If you have specific questions, feel free to ask.

Best,

Arvid

On Tue, Feb 2, 2021 at 10:08 AM 姜鑫 <[hidden email]> wrote:
Hi Flavio,

I probably understand what you need. Apache AirFlow is a scheduling framework which you can define your own dependent operators, therefore you can define a BashOperator to submit flink job to you local flink cluster. For example:
```
t1 = BashOperator(
    task_id=‘flink-wordcount',
    bash_command=‘./bin/flink run flink/build-target/examples/batch/WordCount.jar',
    ...
)
```
Alse Airflow supports submitting jobs to kubernetes and you can even implement your own operator if bash command doesn’t meet your demands.

Indeed Flink AI (flink-ai-extended ?) needs an enhanced version of AirFlow, but it is mainly for streaming scenario which means the job won’t stop. In your case which are all batch jobs it doesn’t help much. Hope this helps.

Regard,
Xin


2021年2月2日 下午4:30,Flavio Pompermaier <[hidden email]> 写道:

Hi Xin,
let me state first that I never used AirFlow so I can probably miss some background here.
I just want to externalize the job scheduling to some consolidated framework and from what I see Apache AirFlow is probably what I need.
However I can't find any good blog post or documentation about how to integrate these 2 technologies using REST API of both services.
I saw that Flink AI decided to use a customized/enhanced version of AirFlow [1] but I didn't look into the code to understand how they use it.
In my use case I just want to schedule 2 Flink batch jobs using the REST API of AirFlow, where the second one is fired after the first.


Best,
Flavio

On Tue, Feb 2, 2021 at 2:43 AM 姜鑫 <[hidden email]> wrote:
Hi Flavio,

Could you explain what your direct question is? In my opinion, it is possible to define two airflow operators to submit dependent flink job, as long as the first one can reach the end.

Regards,
Xin

2021年2月1日 下午6:43,Flavio Pompermaier <[hidden email]> 写道:

Any advice here?

On Wed, Jan 27, 2021 at 9:49 PM Flavio Pompermaier <[hidden email]> wrote:
Hello everybody,
is there any suggested way/pointer to schedule Flink jobs using Apache AirFlow?
What I'd like to achieve is the submission (using the REST API of AirFlow) of 2 jobs, where the second one can be executed only if the first one succeed.

Thanks in advance
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Integration with Apache AirFlow

Chesnay Schepler
I'm sorry, but aren't these question better suited for the Airflow mailing lists?

On 2/2/2021 12:35 PM, Flavio Pompermaier wrote:
Thank you all for the hints. However looking at the REST API[1] of AirFlow 2.0 I can't find how to setup my DAG (if this is the right concept).
Do I need to first create a Connection? A DAG?  a TaskInstance? How do I specify the 2 BashOperator? 
I was thinking to connect to AirFlow via Java so I can't use the Python API..


On Tue, Feb 2, 2021 at 10:53 AM Arvid Heise <[hidden email]> wrote:
Hi Flavio,

If you know a bit of Python, it's also trivial to add a new Flink operator where you can use REST API.

In general, I'd consider Airflow to be the best choice for your problem, especially if it gets more complicated in the future (do something else if the first job fails).

If you have specific questions, feel free to ask.

Best,

Arvid

On Tue, Feb 2, 2021 at 10:08 AM 姜鑫 <[hidden email]> wrote:
Hi Flavio,

I probably understand what you need. Apache AirFlow is a scheduling framework which you can define your own dependent operators, therefore you can define a BashOperator to submit flink job to you local flink cluster. For example:
```
t1 = BashOperator(
    task_id=‘flink-wordcount',
    bash_command=‘./bin/flink run flink/build-target/examples/batch/WordCount.jar',
    ...
)
```
Alse Airflow supports submitting jobs to kubernetes and you can even implement your own operator if bash command doesn’t meet your demands.

Indeed Flink AI (flink-ai-extended ?) needs an enhanced version of AirFlow, but it is mainly for streaming scenario which means the job won’t stop. In your case which are all batch jobs it doesn’t help much. Hope this helps.

Regard,
Xin


2021年2月2日 下午4:30,Flavio Pompermaier <[hidden email]> 写道:

Hi Xin,
let me state first that I never used AirFlow so I can probably miss some background here.
I just want to externalize the job scheduling to some consolidated framework and from what I see Apache AirFlow is probably what I need.
However I can't find any good blog post or documentation about how to integrate these 2 technologies using REST API of both services.
I saw that Flink AI decided to use a customized/enhanced version of AirFlow [1] but I didn't look into the code to understand how they use it.
In my use case I just want to schedule 2 Flink batch jobs using the REST API of AirFlow, where the second one is fired after the first.


Best,
Flavio

On Tue, Feb 2, 2021 at 2:43 AM 姜鑫 <[hidden email]> wrote:
Hi Flavio,

Could you explain what your direct question is? In my opinion, it is possible to define two airflow operators to submit dependent flink job, as long as the first one can reach the end.

Regards,
Xin

2021年2月1日 下午6:43,Flavio Pompermaier <[hidden email]> 写道:

Any advice here?

On Wed, Jan 27, 2021 at 9:49 PM Flavio Pompermaier <[hidden email]> wrote:
Hello everybody,
is there any suggested way/pointer to schedule Flink jobs using Apache AirFlow?
What I'd like to achieve is the submission (using the REST API of AirFlow) of 2 jobs, where the second one can be executed only if the first one succeed.

Thanks in advance
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Integration with Apache AirFlow

Flavio Pompermaier
You're probably right Chesnay, I just asked to this mailing list to know if there are any pointers or blog post about this topic from a Flink perspective.
Then the conversation has gone in the wrong direction.

Best,
Flavio

On Tue, Feb 2, 2021 at 12:48 PM Chesnay Schepler <[hidden email]> wrote:
I'm sorry, but aren't these question better suited for the Airflow mailing lists?

On 2/2/2021 12:35 PM, Flavio Pompermaier wrote:
Thank you all for the hints. However looking at the REST API[1] of AirFlow 2.0 I can't find how to setup my DAG (if this is the right concept).
Do I need to first create a Connection? A DAG?  a TaskInstance? How do I specify the 2 BashOperator? 
I was thinking to connect to AirFlow via Java so I can't use the Python API..


On Tue, Feb 2, 2021 at 10:53 AM Arvid Heise <[hidden email]> wrote:
Hi Flavio,

If you know a bit of Python, it's also trivial to add a new Flink operator where you can use REST API.

In general, I'd consider Airflow to be the best choice for your problem, especially if it gets more complicated in the future (do something else if the first job fails).

If you have specific questions, feel free to ask.

Best,

Arvid

On Tue, Feb 2, 2021 at 10:08 AM 姜鑫 <[hidden email]> wrote:
Hi Flavio,

I probably understand what you need. Apache AirFlow is a scheduling framework which you can define your own dependent operators, therefore you can define a BashOperator to submit flink job to you local flink cluster. For example:
```
t1 = BashOperator(
    task_id=‘flink-wordcount',
    bash_command=‘./bin/flink run flink/build-target/examples/batch/WordCount.jar',
    ...
)
```
Alse Airflow supports submitting jobs to kubernetes and you can even implement your own operator if bash command doesn’t meet your demands.

Indeed Flink AI (flink-ai-extended ?) needs an enhanced version of AirFlow, but it is mainly for streaming scenario which means the job won’t stop. In your case which are all batch jobs it doesn’t help much. Hope this helps.

Regard,
Xin


2021年2月2日 下午4:30,Flavio Pompermaier <[hidden email]> 写道:

Hi Xin,
let me state first that I never used AirFlow so I can probably miss some background here.
I just want to externalize the job scheduling to some consolidated framework and from what I see Apache AirFlow is probably what I need.
However I can't find any good blog post or documentation about how to integrate these 2 technologies using REST API of both services.
I saw that Flink AI decided to use a customized/enhanced version of AirFlow [1] but I didn't look into the code to understand how they use it.
In my use case I just want to schedule 2 Flink batch jobs using the REST API of AirFlow, where the second one is fired after the first.


Best,
Flavio

On Tue, Feb 2, 2021 at 2:43 AM 姜鑫 <[hidden email]> wrote:
Hi Flavio,

Could you explain what your direct question is? In my opinion, it is possible to define two airflow operators to submit dependent flink job, as long as the first one can reach the end.

Regards,
Xin

2021年2月1日 下午6:43,Flavio Pompermaier <[hidden email]> 写道:

Any advice here?

On Wed, Jan 27, 2021 at 9:49 PM Flavio Pompermaier <[hidden email]> wrote:
Hello everybody,
is there any suggested way/pointer to schedule Flink jobs using Apache AirFlow?
What I'd like to achieve is the submission (using the REST API of AirFlow) of 2 jobs, where the second one can be executed only if the first one succeed.

Thanks in advance
Flavio