Hello everybody,
is there any suggested way/pointer to schedule Flink jobs using Apache AirFlow? What I'd like to achieve is the submission (using the REST API of AirFlow) of 2 jobs, where the second one can be executed only if the first one succeed. Thanks in advance Flavio |
Any advice here? On Wed, Jan 27, 2021 at 9:49 PM Flavio Pompermaier <[hidden email]> wrote:
|
Hi Flavio, Could you explain what your direct question is? In my opinion, it is possible to define two airflow operators to submit dependent flink job, as long as the first one can reach the end. Regards, Xin
|
Hi Xin, let me state first that I never used AirFlow so I can probably miss some background here. I just want to externalize the job scheduling to some consolidated framework and from what I see Apache AirFlow is probably what I need. However I can't find any good blog post or documentation about how to integrate these 2 technologies using REST API of both services. I saw that Flink AI decided to use a customized/enhanced version of AirFlow [1] but I didn't look into the code to understand how they use it. In my use case I just want to schedule 2 Flink batch jobs using the REST API of AirFlow, where the second one is fired after the first. Best, Flavio On Tue, Feb 2, 2021 at 2:43 AM 姜鑫 <[hidden email]> wrote:
|
Hi Flavio, I probably understand what you need. Apache AirFlow is a scheduling framework which you can define your own dependent operators, therefore you can define a BashOperator to submit flink job to you local flink cluster. For example: ``` t1 = BashOperator( task_id=‘flink-wordcount', bash_command=‘./bin/flink run flink/build-target/examples/batch/WordCount.jar', ... ) ``` Alse Airflow supports submitting jobs to kubernetes and you can even implement your own operator if bash command doesn’t meet your demands. Indeed Flink AI (flink-ai-extended ?) needs an enhanced version of AirFlow, but it is mainly for streaming scenario which means the job won’t stop. In your case which are all batch jobs it doesn’t help much. Hope this helps. Regard, Xin
|
Hi Flavio, If you know a bit of Python, it's also trivial to add a new Flink operator where you can use REST API. In general, I'd consider Airflow to be the best choice for your problem, especially if it gets more complicated in the future (do something else if the first job fails). If you have specific questions, feel free to ask. Best, Arvid On Tue, Feb 2, 2021 at 10:08 AM 姜鑫 <[hidden email]> wrote:
|
Thank you all for the hints. However looking at the REST API[1] of AirFlow 2.0 I can't find how to setup my DAG (if this is the right concept). Do I need to first create a Connection? A DAG? a TaskInstance? How do I specify the 2 BashOperator? I was thinking to connect to AirFlow via Java so I can't use the Python API.. On Tue, Feb 2, 2021 at 10:53 AM Arvid Heise <[hidden email]> wrote:
|
I'm sorry, but aren't these question better suited for the Airflow
mailing lists?
On 2/2/2021 12:35 PM, Flavio
Pompermaier wrote:
|
You're probably right Chesnay, I just asked to this mailing list to know if there are any pointers or blog post about this topic from a Flink perspective. Then the conversation has gone in the wrong direction. Best, Flavio On Tue, Feb 2, 2021 at 12:48 PM Chesnay Schepler <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |