Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

LINZ, Arnaud

Hi,

 

I use Flink 1.0.0. I have a persistent yarn container set (a persistent flink job manager) that I use for streaming jobs ; and I use the “yarn-cluster” mode to launch my batches.

 

I’ve just switched “HA” mode on for my streaming persistent job manager and it seems to works ; however my batches are not working any longer because they now execute themselves inside the persistent container (and fail because it lacks slots) and not in a separate standalone job manager.

 

My batch launch options:

 

CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"

JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch"

 

$FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA $JAR_SUPP $listArgs $ACTION

 

My persistent cluster launch option :

 

export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 -Drecovery.mode=zookeeper -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}  -Dstate.backend=filesystem -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/checkpoints -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"

 

$FLINK_DIR/yarn-session.sh -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink

 

I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the container for now, but I lack HA.

 

Is it a (un)known bug or am I missing a magic option?

 

Best regards,

Arnaud

 




L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Ufuk Celebi
Hey Arnaud,

The cause of this is probably that both jobs use the same ZooKeeper
root path, in which case all task managers connect to the same leading
job manager.

I think you forgot to the add the y in the -Drecovery.mode=standalone
for the batch jobs, e.g.

-yDrecovery.mode=standalone

Can you try this?

– Ufuk

On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:

> Hi,
>
>
>
> I use Flink 1.0.0. I have a persistent yarn container set (a persistent
> flink job manager) that I use for streaming jobs ; and I use the
> “yarn-cluster” mode to launch my batches.
>
>
>
> I’ve just switched “HA” mode on for my streaming persistent job manager and
> it seems to works ; however my batches are not working any longer because
> they now execute themselves inside the persistent container (and fail
> because it lacks slots) and not in a separate standalone job manager.
>
>
>
> My batch launch options:
>
>
>
> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>
> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
> -Dyarn.properties-file.location=/tmp/flink/batch"
>
>
>
> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA $JAR_SUPP
> $listArgs $ACTION
>
>
>
> My persistent cluster launch option :
>
>
>
> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
> -Drecovery.mode=zookeeper
> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
> -Dstate.backend=filesystem
> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/checkpoints
> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>
>
>
> $FLINK_DIR/yarn-session.sh -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm
> $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink
>
>
>
> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the container
> for now, but I lack HA.
>
>
>
> Is it a (un)known bug or am I missing a magic option?
>
>
>
> Best regards,
>
> Arnaud
>
>
>
>
> ________________________________
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous
> n'êtes pas destinataire de ce message, merci de le détruire et d'avertir
> l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The
> company that sent this message cannot therefore be held liable for its
> content nor attachments. Any unauthorized use or dissemination is
> prohibited. If you are not the intended recipient of this message, then
> please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

LINZ, Arnaud
Hi,

The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers.

The -Drecovery.mode=standalone was passed inside the    JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")

I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container.

Complete line =
/usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s -yD recovery.mode=standalone --class com.bouygtel.kubera.main.segstage.MainGeoSegStage /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT-allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg

JVM_ARGS =
-Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch


Arnaud


-----Message d'origine-----
De : Ufuk Celebi [mailto:[hidden email]]
Envoyé : lundi 6 juin 2016 14:37
À : [hidden email]
Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Hey Arnaud,

The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager.

I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g.

-yDrecovery.mode=standalone

Can you try this?

– Ufuk

On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:

> Hi,
>
>
>
> I use Flink 1.0.0. I have a persistent yarn container set (a
> persistent flink job manager) that I use for streaming jobs ; and I
> use the “yarn-cluster” mode to launch my batches.
>
>
>
> I’ve just switched “HA” mode on for my streaming persistent job
> manager and it seems to works ; however my batches are not working any
> longer because they now execute themselves inside the persistent
> container (and fail because it lacks slots) and not in a separate standalone job manager.
>
>
>
> My batch launch options:
>
>
>
> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>
> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
> -Dyarn.properties-file.location=/tmp/flink/batch"
>
>
>
> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
> $JAR_SUPP $listArgs $ACTION
>
>
>
> My persistent cluster launch option :
>
>
>
> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
> -Drecovery.mode=zookeeper
> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
> -Dstate.backend=filesystem
> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH
> }/checkpoints
> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>
>
>
> $FLINK_DIR/yarn-session.sh
> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm
> $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink
>
>
>
> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
> container for now, but I lack HA.
>
>
>
> Is it a (un)known bug or am I missing a magic option?
>
>
>
> Best regards,
>
> Arnaud
>
>
>
>
> ________________________________
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société
> expéditrice ne peut être tenue responsable de son contenu ni de ses
> pièces jointes. Toute utilisation ou diffusion non autorisée est
> interdite. Si vous n'êtes pas destinataire de ce message, merci de le
> détruire et d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet.
> The company that sent this message cannot therefore be held liable for
> its content nor attachments. Any unauthorized use or dissemination is
> prohibited. If you are not the intended recipient of this message,
> then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Ufuk Celebi
Thanks for clarification. I think it might be related to the YARN
properties file, which is still being used for the batch jobs. Can you
try to delete it between submissions as a temporary workaround to
check whether it's related?

– Ufuk

On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote:

> Hi,
>
> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers.
>
> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")
>
> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container.
>
> Complete line =
> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s -yD recovery.mode=standalone --class com.bouygtel.kubera.main.segstage.MainGeoSegStage /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT-allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>
> JVM_ARGS =
> -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch
>
>
> Arnaud
>
>
> -----Message d'origine-----
> De : Ufuk Celebi [mailto:[hidden email]]
> Envoyé : lundi 6 juin 2016 14:37
> À : [hidden email]
> Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Hey Arnaud,
>
> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager.
>
> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g.
>
> -yDrecovery.mode=standalone
>
> Can you try this?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:
>> Hi,
>>
>>
>>
>> I use Flink 1.0.0. I have a persistent yarn container set (a
>> persistent flink job manager) that I use for streaming jobs ; and I
>> use the “yarn-cluster” mode to launch my batches.
>>
>>
>>
>> I’ve just switched “HA” mode on for my streaming persistent job
>> manager and it seems to works ; however my batches are not working any
>> longer because they now execute themselves inside the persistent
>> container (and fail because it lacks slots) and not in a separate standalone job manager.
>>
>>
>>
>> My batch launch options:
>>
>>
>>
>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>>
>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>
>>
>>
>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>> $JAR_SUPP $listArgs $ACTION
>>
>>
>>
>> My persistent cluster launch option :
>>
>>
>>
>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>> -Drecovery.mode=zookeeper
>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>> -Dstate.backend=filesystem
>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH
>> }/checkpoints
>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>
>>
>>
>> $FLINK_DIR/yarn-session.sh
>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm
>> $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink
>>
>>
>>
>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>> container for now, but I lack HA.
>>
>>
>>
>> Is it a (un)known bug or am I missing a magic option?
>>
>>
>>
>> Best regards,
>>
>> Arnaud
>>
>>
>>
>>
>> ________________________________
>>
>> L'intégrité de ce message n'étant pas assurée sur internet, la société
>> expéditrice ne peut être tenue responsable de son contenu ni de ses
>> pièces jointes. Toute utilisation ou diffusion non autorisée est
>> interdite. Si vous n'êtes pas destinataire de ce message, merci de le
>> détruire et d'avertir l'expéditeur.
>>
>> The integrity of this message cannot be guaranteed on the Internet.
>> The company that sent this message cannot therefore be held liable for
>> its content nor attachments. Any unauthorized use or dissemination is
>> prohibited. If you are not the intended recipient of this message,
>> then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

LINZ, Arnaud
I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :)
So it's really a problem of flink finding the right property file.

I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one.

 Thanks,
Arnaud

-----Message d'origine-----
De : Ufuk Celebi [mailto:[hidden email]]
Envoyé : lundi 6 juin 2016 16:01
À : [hidden email]
Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related?

– Ufuk

On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote:

> Hi,
>
> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers.
>
> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")
>
> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container.
>
> Complete line =
> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s
> -yD recovery.mode=standalone --class
> com.bouygtel.kubera.main.segstage.MainGeoSegStage
> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT
> -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c
> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>
> JVM_ARGS =
> -Drecovery.mode=standalone
> -Dyarn.properties-file.location=/tmp/flink/batch
>
>
> Arnaud
>
>
> -----Message d'origine-----
> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
> 14:37 À : [hidden email] Objet : Re: Yarn batch not working
> with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Hey Arnaud,
>
> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager.
>
> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g.
>
> -yDrecovery.mode=standalone
>
> Can you try this?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:
>> Hi,
>>
>>
>>
>> I use Flink 1.0.0. I have a persistent yarn container set (a
>> persistent flink job manager) that I use for streaming jobs ; and I
>> use the “yarn-cluster” mode to launch my batches.
>>
>>
>>
>> I’ve just switched “HA” mode on for my streaming persistent job
>> manager and it seems to works ; however my batches are not working
>> any longer because they now execute themselves inside the persistent
>> container (and fail because it lacks slots) and not in a separate standalone job manager.
>>
>>
>>
>> My batch launch options:
>>
>>
>>
>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>>
>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>
>>
>>
>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>> $JAR_SUPP $listArgs $ACTION
>>
>>
>>
>> My persistent cluster launch option :
>>
>>
>>
>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>> -Drecovery.mode=zookeeper
>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>> -Dstate.backend=filesystem
>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT
>> H
>> }/checkpoints
>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>
>>
>>
>> $FLINK_DIR/yarn-session.sh
>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm
>> $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink
>>
>>
>>
>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>> container for now, but I lack HA.
>>
>>
>>
>> Is it a (un)known bug or am I missing a magic option?
>>
>>
>>
>> Best regards,
>>
>> Arnaud
>>
>>
>>
>>
>> ________________________________
>>
>> L'intégrité de ce message n'étant pas assurée sur internet, la
>> société expéditrice ne peut être tenue responsable de son contenu ni
>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée
>> est interdite. Si vous n'êtes pas destinataire de ce message, merci
>> de le détruire et d'avertir l'expéditeur.
>>
>> The integrity of this message cannot be guaranteed on the Internet.
>> The company that sent this message cannot therefore be held liable
>> for its content nor attachments. Any unauthorized use or
>> dissemination is prohibited. If you are not the intended recipient of
>> this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

LINZ, Arnaud
In reply to this post by Ufuk Celebi
Hi,

I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem.

I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState().  I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call.

Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ?

Thanks,
Arnaud

-----Message d'origine-----
De : LINZ, Arnaud
Envoyé : lundi 6 juin 2016 17:53
À : [hidden email]
Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file.

I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one.

 Thanks,
Arnaud

-----Message d'origine-----
De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related?

– Ufuk

On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote:

> Hi,
>
> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers.
>
> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")
>
> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container.
>
> Complete line =
> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s
> -yD recovery.mode=standalone --class
> com.bouygtel.kubera.main.segstage.MainGeoSegStage
> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT
> -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c
> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>
> JVM_ARGS =
> -Drecovery.mode=standalone
> -Dyarn.properties-file.location=/tmp/flink/batch
>
>
> Arnaud
>
>
> -----Message d'origine-----
> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
> 14:37 À : [hidden email] Objet : Re: Yarn batch not working
> with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Hey Arnaud,
>
> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager.
>
> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g.
>
> -yDrecovery.mode=standalone
>
> Can you try this?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:
>> Hi,
>>
>>
>>
>> I use Flink 1.0.0. I have a persistent yarn container set (a
>> persistent flink job manager) that I use for streaming jobs ; and I
>> use the “yarn-cluster” mode to launch my batches.
>>
>>
>>
>> I’ve just switched “HA” mode on for my streaming persistent job
>> manager and it seems to works ; however my batches are not working
>> any longer because they now execute themselves inside the persistent
>> container (and fail because it lacks slots) and not in a separate standalone job manager.
>>
>>
>>
>> My batch launch options:
>>
>>
>>
>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>>
>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>
>>
>>
>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>> $JAR_SUPP $listArgs $ACTION
>>
>>
>>
>> My persistent cluster launch option :
>>
>>
>>
>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>> -Drecovery.mode=zookeeper
>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>> -Dstate.backend=filesystem
>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT
>> H
>> }/checkpoints
>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>
>>
>>
>> $FLINK_DIR/yarn-session.sh
>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm
>> $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink
>>
>>
>>
>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>> container for now, but I lack HA.
>>
>>
>>
>> Is it a (un)known bug or am I missing a magic option?
>>
>>
>>
>> Best regards,
>>
>> Arnaud
>>
>>
>>
>>
>> ________________________________
>>
>> L'intégrité de ce message n'étant pas assurée sur internet, la
>> société expéditrice ne peut être tenue responsable de son contenu ni
>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée
>> est interdite. Si vous n'êtes pas destinataire de ce message, merci
>> de le détruire et d'avertir l'expéditeur.
>>
>> The integrity of this message cannot be guaranteed on the Internet.
>> The company that sent this message cannot therefore be held liable
>> for its content nor attachments. Any unauthorized use or
>> dissemination is prohibited. If you are not the intended recipient of
>> this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

LINZ, Arnaud
In reply to this post by Ufuk Celebi
Ooopsss....
My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue!

But I still have the property  file path issue  :)

-----Message d'origine-----
De : LINZ, Arnaud
Envoyé : mercredi 15 juin 2016 14:35
À : '[hidden email]' <[hidden email]>
Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Hi,

I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem.

I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState().  I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call.

Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ?

Thanks,
Arnaud

-----Message d'origine-----
De : LINZ, Arnaud
Envoyé : lundi 6 juin 2016 17:53
À : [hidden email]
Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file.

I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one.

 Thanks,
Arnaud

-----Message d'origine-----
De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related?

– Ufuk

On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote:

> Hi,
>
> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers.
>
> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")
>
> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container.
>
> Complete line =
> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s
> -yD recovery.mode=standalone --class
> com.bouygtel.kubera.main.segstage.MainGeoSegStage
> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT
> -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c
> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>
> JVM_ARGS =
> -Drecovery.mode=standalone
> -Dyarn.properties-file.location=/tmp/flink/batch
>
>
> Arnaud
>
>
> -----Message d'origine-----
> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
> 14:37 À : [hidden email] Objet : Re: Yarn batch not working
> with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Hey Arnaud,
>
> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager.
>
> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g.
>
> -yDrecovery.mode=standalone
>
> Can you try this?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:
>> Hi,
>>
>>
>>
>> I use Flink 1.0.0. I have a persistent yarn container set (a
>> persistent flink job manager) that I use for streaming jobs ; and I
>> use the “yarn-cluster” mode to launch my batches.
>>
>>
>>
>> I’ve just switched “HA” mode on for my streaming persistent job
>> manager and it seems to works ; however my batches are not working
>> any longer because they now execute themselves inside the persistent
>> container (and fail because it lacks slots) and not in a separate standalone job manager.
>>
>>
>>
>> My batch launch options:
>>
>>
>>
>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>>
>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>
>>
>>
>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>> $JAR_SUPP $listArgs $ACTION
>>
>>
>>
>> My persistent cluster launch option :
>>
>>
>>
>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>> -Drecovery.mode=zookeeper
>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>> -Dstate.backend=filesystem
>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT
>> H
>> }/checkpoints
>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>
>>
>>
>> $FLINK_DIR/yarn-session.sh
>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm
>> $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink
>>
>>
>>
>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>> container for now, but I lack HA.
>>
>>
>>
>> Is it a (un)known bug or am I missing a magic option?
>>
>>
>>
>> Best regards,
>>
>> Arnaud
>>
>>
>>
>>
>> ________________________________
>>
>> L'intégrité de ce message n'étant pas assurée sur internet, la
>> société expéditrice ne peut être tenue responsable de son contenu ni
>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée
>> est interdite. Si vous n'êtes pas destinataire de ce message, merci
>> de le détruire et d'avertir l'expéditeur.
>>
>> The integrity of this message cannot be guaranteed on the Internet.
>> The company that sent this message cannot therefore be held liable
>> for its content nor attachments. Any unauthorized use or
>> dissemination is prohibited. If you are not the intended recipient of
>> this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Ufuk Celebi
I've created an issue here: https://issues.apache.org/jira/browse/FLINK-4079

Hopefully it will be fixed in 1.1 and we can provide a bugfix for 1.0.4.

On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> wrote:

> Ooopsss....
> My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue!
>
> But I still have the property  file path issue  :)
>
> -----Message d'origine-----
> De : LINZ, Arnaud
> Envoyé : mercredi 15 juin 2016 14:35
> À : '[hidden email]' <[hidden email]>
> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Hi,
>
> I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem.
>
> I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState().  I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call.
>
> Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ?
>
> Thanks,
> Arnaud
>
> -----Message d'origine-----
> De : LINZ, Arnaud
> Envoyé : lundi 6 juin 2016 17:53
> À : [hidden email]
> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file.
>
> I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one.
>
>  Thanks,
> Arnaud
>
> -----Message d'origine-----
> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote:
>> Hi,
>>
>> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers.
>>
>> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")
>>
>> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container.
>>
>> Complete line =
>> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
>> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s
>> -yD recovery.mode=standalone --class
>> com.bouygtel.kubera.main.segstage.MainGeoSegStage
>> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT
>> -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c
>> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>>
>> JVM_ARGS =
>> -Drecovery.mode=standalone
>> -Dyarn.properties-file.location=/tmp/flink/batch
>>
>>
>> Arnaud
>>
>>
>> -----Message d'origine-----
>> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
>> 14:37 À : [hidden email] Objet : Re: Yarn batch not working
>> with standalone yarn job manager once a persistent, HA job manager is launched ?
>>
>> Hey Arnaud,
>>
>> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager.
>>
>> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g.
>>
>> -yDrecovery.mode=standalone
>>
>> Can you try this?
>>
>> – Ufuk
>>
>> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:
>>> Hi,
>>>
>>>
>>>
>>> I use Flink 1.0.0. I have a persistent yarn container set (a
>>> persistent flink job manager) that I use for streaming jobs ; and I
>>> use the “yarn-cluster” mode to launch my batches.
>>>
>>>
>>>
>>> I’ve just switched “HA” mode on for my streaming persistent job
>>> manager and it seems to works ; however my batches are not working
>>> any longer because they now execute themselves inside the persistent
>>> container (and fail because it lacks slots) and not in a separate standalone job manager.
>>>
>>>
>>>
>>> My batch launch options:
>>>
>>>
>>>
>>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>>>
>>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>>
>>>
>>>
>>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>>> $JAR_SUPP $listArgs $ACTION
>>>
>>>
>>>
>>> My persistent cluster launch option :
>>>
>>>
>>>
>>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>>> -Drecovery.mode=zookeeper
>>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>>> -Dstate.backend=filesystem
>>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT
>>> H
>>> }/checkpoints
>>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>>
>>>
>>>
>>> $FLINK_DIR/yarn-session.sh
>>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm
>>> $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink
>>>
>>>
>>>
>>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>>> container for now, but I lack HA.
>>>
>>>
>>>
>>> Is it a (un)known bug or am I missing a magic option?
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Arnaud
>>>
>>>
>>>
>>>
>>> ________________________________
>>>
>>> L'intégrité de ce message n'étant pas assurée sur internet, la
>>> société expéditrice ne peut être tenue responsable de son contenu ni
>>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée
>>> est interdite. Si vous n'êtes pas destinataire de ce message, merci
>>> de le détruire et d'avertir l'expéditeur.
>>>
>>> The integrity of this message cannot be guaranteed on the Internet.
>>> The company that sent this message cannot therefore be held liable
>>> for its content nor attachments. Any unauthorized use or
>>> dissemination is prohibited. If you are not the intended recipient of
>>> this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Maximilian Michels
In reply to this post by LINZ, Arnaud
Hi Arnaud,

One issue per thread please. That makes things a lot easier for us :)

Something positive first: We are reworking the resuming of existing
Flink Yarn applications. It'll be much easier to resume a cluster
using simply the Yarn ID or re-discoering the Yarn session using the
properties file.

The dynamic properties are a shortcut to modifying the Flink
configuration of the cluster _only_ upon startup. Afterwards, they are
already set at the containers. We might change this for the 1.1.0
release. It should work if you put "yarn.properties-file.location:
/custom/location" in your flink-conf.yaml before you execute
"./bin/flink".

Cheers,
Max

On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> wrote:

> Ooopsss....
> My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue!
>
> But I still have the property  file path issue  :)
>
> -----Message d'origine-----
> De : LINZ, Arnaud
> Envoyé : mercredi 15 juin 2016 14:35
> À : '[hidden email]' <[hidden email]>
> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Hi,
>
> I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem.
>
> I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState().  I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call.
>
> Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ?
>
> Thanks,
> Arnaud
>
> -----Message d'origine-----
> De : LINZ, Arnaud
> Envoyé : lundi 6 juin 2016 17:53
> À : [hidden email]
> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file.
>
> I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one.
>
>  Thanks,
> Arnaud
>
> -----Message d'origine-----
> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote:
>> Hi,
>>
>> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers.
>>
>> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")
>>
>> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container.
>>
>> Complete line =
>> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
>> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s
>> -yD recovery.mode=standalone --class
>> com.bouygtel.kubera.main.segstage.MainGeoSegStage
>> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT
>> -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c
>> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>>
>> JVM_ARGS =
>> -Drecovery.mode=standalone
>> -Dyarn.properties-file.location=/tmp/flink/batch
>>
>>
>> Arnaud
>>
>>
>> -----Message d'origine-----
>> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
>> 14:37 À : [hidden email] Objet : Re: Yarn batch not working
>> with standalone yarn job manager once a persistent, HA job manager is launched ?
>>
>> Hey Arnaud,
>>
>> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager.
>>
>> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g.
>>
>> -yDrecovery.mode=standalone
>>
>> Can you try this?
>>
>> – Ufuk
>>
>> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:
>>> Hi,
>>>
>>>
>>>
>>> I use Flink 1.0.0. I have a persistent yarn container set (a
>>> persistent flink job manager) that I use for streaming jobs ; and I
>>> use the “yarn-cluster” mode to launch my batches.
>>>
>>>
>>>
>>> I’ve just switched “HA” mode on for my streaming persistent job
>>> manager and it seems to works ; however my batches are not working
>>> any longer because they now execute themselves inside the persistent
>>> container (and fail because it lacks slots) and not in a separate standalone job manager.
>>>
>>>
>>>
>>> My batch launch options:
>>>
>>>
>>>
>>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>>>
>>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>>
>>>
>>>
>>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>>> $JAR_SUPP $listArgs $ACTION
>>>
>>>
>>>
>>> My persistent cluster launch option :
>>>
>>>
>>>
>>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>>> -Drecovery.mode=zookeeper
>>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>>> -Dstate.backend=filesystem
>>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT
>>> H
>>> }/checkpoints
>>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>>
>>>
>>>
>>> $FLINK_DIR/yarn-session.sh
>>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm
>>> $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink
>>>
>>>
>>>
>>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>>> container for now, but I lack HA.
>>>
>>>
>>>
>>> Is it a (un)known bug or am I missing a magic option?
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Arnaud
>>>
>>>
>>>
>>>
>>> ________________________________
>>>
>>> L'intégrité de ce message n'étant pas assurée sur internet, la
>>> société expéditrice ne peut être tenue responsable de son contenu ni
>>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée
>>> est interdite. Si vous n'êtes pas destinataire de ce message, merci
>>> de le détruire et d'avertir l'expéditeur.
>>>
>>> The integrity of this message cannot be guaranteed on the Internet.
>>> The company that sent this message cannot therefore be held liable
>>> for its content nor attachments. Any unauthorized use or
>>> dissemination is prohibited. If you are not the intended recipient of
>>> this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Maximilian Michels
Just had a quick chat with Ufuk. The issue is that in 1.x the Yarn
properties file is loaded regardless of whether "-m yarn-cluster" is
specified on the command-line. This loads the dynamic properties from
the Yarn properties file and applies all configuration of the running
(session) cluster cluster to the to-be-created cluster.

Will be fixed in 1.1 and probably backported to 1.0.4.

On Wed, Jun 15, 2016 at 6:05 PM, Maximilian Michels <[hidden email]> wrote:

> Hi Arnaud,
>
> One issue per thread please. That makes things a lot easier for us :)
>
> Something positive first: We are reworking the resuming of existing
> Flink Yarn applications. It'll be much easier to resume a cluster
> using simply the Yarn ID or re-discoering the Yarn session using the
> properties file.
>
> The dynamic properties are a shortcut to modifying the Flink
> configuration of the cluster _only_ upon startup. Afterwards, they are
> already set at the containers. We might change this for the 1.1.0
> release. It should work if you put "yarn.properties-file.location:
> /custom/location" in your flink-conf.yaml before you execute
> "./bin/flink".
>
> Cheers,
> Max
>
> On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> wrote:
>> Ooopsss....
>> My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue!
>>
>> But I still have the property  file path issue  :)
>>
>> -----Message d'origine-----
>> De : LINZ, Arnaud
>> Envoyé : mercredi 15 juin 2016 14:35
>> À : '[hidden email]' <[hidden email]>
>> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>>
>> Hi,
>>
>> I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem.
>>
>> I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState().  I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call.
>>
>> Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ?
>>
>> Thanks,
>> Arnaud
>>
>> -----Message d'origine-----
>> De : LINZ, Arnaud
>> Envoyé : lundi 6 juin 2016 17:53
>> À : [hidden email]
>> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>>
>> I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file.
>>
>> I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one.
>>
>>  Thanks,
>> Arnaud
>>
>> -----Message d'origine-----
>> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>>
>> Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related?
>>
>> – Ufuk
>>
>> On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote:
>>> Hi,
>>>
>>> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers.
>>>
>>> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")
>>>
>>> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container.
>>>
>>> Complete line =
>>> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
>>> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s
>>> -yD recovery.mode=standalone --class
>>> com.bouygtel.kubera.main.segstage.MainGeoSegStage
>>> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT
>>> -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c
>>> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>>>
>>> JVM_ARGS =
>>> -Drecovery.mode=standalone
>>> -Dyarn.properties-file.location=/tmp/flink/batch
>>>
>>>
>>> Arnaud
>>>
>>>
>>> -----Message d'origine-----
>>> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
>>> 14:37 À : [hidden email] Objet : Re: Yarn batch not working
>>> with standalone yarn job manager once a persistent, HA job manager is launched ?
>>>
>>> Hey Arnaud,
>>>
>>> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager.
>>>
>>> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g.
>>>
>>> -yDrecovery.mode=standalone
>>>
>>> Can you try this?
>>>
>>> – Ufuk
>>>
>>> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I use Flink 1.0.0. I have a persistent yarn container set (a
>>>> persistent flink job manager) that I use for streaming jobs ; and I
>>>> use the “yarn-cluster” mode to launch my batches.
>>>>
>>>>
>>>>
>>>> I’ve just switched “HA” mode on for my streaming persistent job
>>>> manager and it seems to works ; however my batches are not working
>>>> any longer because they now execute themselves inside the persistent
>>>> container (and fail because it lacks slots) and not in a separate standalone job manager.
>>>>
>>>>
>>>>
>>>> My batch launch options:
>>>>
>>>>
>>>>
>>>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>>>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>>>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>>>>
>>>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>>>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>>>
>>>>
>>>>
>>>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>>>> $JAR_SUPP $listArgs $ACTION
>>>>
>>>>
>>>>
>>>> My persistent cluster launch option :
>>>>
>>>>
>>>>
>>>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>>>> -Drecovery.mode=zookeeper
>>>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>>>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>>>> -Dstate.backend=filesystem
>>>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT
>>>> H
>>>> }/checkpoints
>>>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>>>
>>>>
>>>>
>>>> $FLINK_DIR/yarn-session.sh
>>>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>>>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm
>>>> $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink
>>>>
>>>>
>>>>
>>>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>>>> container for now, but I lack HA.
>>>>
>>>>
>>>>
>>>> Is it a (un)known bug or am I missing a magic option?
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Arnaud
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>>
>>>> L'intégrité de ce message n'étant pas assurée sur internet, la
>>>> société expéditrice ne peut être tenue responsable de son contenu ni
>>>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée
>>>> est interdite. Si vous n'êtes pas destinataire de ce message, merci
>>>> de le détruire et d'avertir l'expéditeur.
>>>>
>>>> The integrity of this message cannot be guaranteed on the Internet.
>>>> The company that sent this message cannot therefore be held liable
>>>> for its content nor attachments. Any unauthorized use or
>>>> dissemination is prohibited. If you are not the intended recipient of
>>>> this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

LINZ, Arnaud
In reply to this post by Maximilian Michels
Okay, is there a way to specify the flink-conf.yaml to use on the ./bin/flink command-line? I see no such option. I guess I have to set FLINK_CONF_DIR before the call ?

-----Message d'origine-----
De : Maximilian Michels [mailto:[hidden email]]
Envoyé : mercredi 15 juin 2016 18:06
À : [hidden email]
Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Hi Arnaud,

One issue per thread please. That makes things a lot easier for us :)

Something positive first: We are reworking the resuming of existing Flink Yarn applications. It'll be much easier to resume a cluster using simply the Yarn ID or re-discoering the Yarn session using the properties file.

The dynamic properties are a shortcut to modifying the Flink configuration of the cluster _only_ upon startup. Afterwards, they are already set at the containers. We might change this for the 1.1.0 release. It should work if you put "yarn.properties-file.location:
/custom/location" in your flink-conf.yaml before you execute "./bin/flink".

Cheers,
Max

On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> wrote:

> Ooopsss....
> My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue!
>
> But I still have the property  file path issue  :)
>
> -----Message d'origine-----
> De : LINZ, Arnaud
> Envoyé : mercredi 15 juin 2016 14:35
> À : '[hidden email]' <[hidden email]> Objet : RE: Yarn
> batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Hi,
>
> I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem.
>
> I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState().  I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call.
>
> Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ?
>
> Thanks,
> Arnaud
>
> -----Message d'origine-----
> De : LINZ, Arnaud
> Envoyé : lundi 6 juin 2016 17:53
> À : [hidden email]
> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file.
>
> I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one.
>
>  Thanks,
> Arnaud
>
> -----Message d'origine-----
> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote:
>> Hi,
>>
>> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers.
>>
>> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")
>>
>> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container.
>>
>> Complete line =
>> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
>> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s
>> -yD recovery.mode=standalone --class
>> com.bouygtel.kubera.main.segstage.MainGeoSegStage
>> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHO
>> T -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c
>> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>>
>> JVM_ARGS =
>> -Drecovery.mode=standalone
>> -Dyarn.properties-file.location=/tmp/flink/batch
>>
>>
>> Arnaud
>>
>>
>> -----Message d'origine-----
>> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
>> 14:37 À : [hidden email] Objet : Re: Yarn batch not working
>> with standalone yarn job manager once a persistent, HA job manager is launched ?
>>
>> Hey Arnaud,
>>
>> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager.
>>
>> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g.
>>
>> -yDrecovery.mode=standalone
>>
>> Can you try this?
>>
>> – Ufuk
>>
>> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:
>>> Hi,
>>>
>>>
>>>
>>> I use Flink 1.0.0. I have a persistent yarn container set (a
>>> persistent flink job manager) that I use for streaming jobs ; and I
>>> use the “yarn-cluster” mode to launch my batches.
>>>
>>>
>>>
>>> I’ve just switched “HA” mode on for my streaming persistent job
>>> manager and it seems to works ; however my batches are not working
>>> any longer because they now execute themselves inside the persistent
>>> container (and fail because it lacks slots) and not in a separate standalone job manager.
>>>
>>>
>>>
>>> My batch launch options:
>>>
>>>
>>>
>>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>>>
>>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>>
>>>
>>>
>>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>>> $JAR_SUPP $listArgs $ACTION
>>>
>>>
>>>
>>> My persistent cluster launch option :
>>>
>>>
>>>
>>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>>> -Drecovery.mode=zookeeper
>>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>>> -Dstate.backend=filesystem
>>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PA
>>> T
>>> H
>>> }/checkpoints
>>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>>
>>>
>>>
>>> $FLINK_DIR/yarn-session.sh
>>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS
>>> -tm $FLINK_MEMORY -qu $FLINK_QUEUE  -nm
>>> ${GANESH_TYPE_PF}_KuberaFlink
>>>
>>>
>>>
>>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>>> container for now, but I lack HA.
>>>
>>>
>>>
>>> Is it a (un)known bug or am I missing a magic option?
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Arnaud
>>>
>>>
>>>
>>>
>>> ________________________________
>>>
>>> L'intégrité de ce message n'étant pas assurée sur internet, la
>>> société expéditrice ne peut être tenue responsable de son contenu ni
>>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée
>>> est interdite. Si vous n'êtes pas destinataire de ce message, merci
>>> de le détruire et d'avertir l'expéditeur.
>>>
>>> The integrity of this message cannot be guaranteed on the Internet.
>>> The company that sent this message cannot therefore be held liable
>>> for its content nor attachments. Any unauthorized use or
>>> dissemination is prohibited. If you are not the intended recipient
>>> of this message, then please delete it and notify the sender.
Reply | Threaded
Open this post in threaded view
|

Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Till Rohrmann
Hi Arnaud,

at the moment the environment variable is the only way to specify a different config directory for the CLIFrontend. But it totally makes sense to introduce a --configDir parameter for the flink shell script. I'll open an issue for this.

Cheers,
Till

On Thu, Jun 16, 2016 at 5:36 PM, LINZ, Arnaud <[hidden email]> wrote:
Okay, is there a way to specify the flink-conf.yaml to use on the ./bin/flink command-line? I see no such option. I guess I have to set FLINK_CONF_DIR before the call ?

-----Message d'origine-----
De : Maximilian Michels [mailto:[hidden email]]
Envoyé : mercredi 15 juin 2016 18:06
À : [hidden email]
Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Hi Arnaud,

One issue per thread please. That makes things a lot easier for us :)

Something positive first: We are reworking the resuming of existing Flink Yarn applications. It'll be much easier to resume a cluster using simply the Yarn ID or re-discoering the Yarn session using the properties file.

The dynamic properties are a shortcut to modifying the Flink configuration of the cluster _only_ upon startup. Afterwards, they are already set at the containers. We might change this for the 1.1.0 release. It should work if you put "yarn.properties-file.location:
/custom/location" in your flink-conf.yaml before you execute "./bin/flink".

Cheers,
Max

On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> wrote:
> Ooopsss....
> My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue!
>
> But I still have the property  file path issue  :)
>
> -----Message d'origine-----
> De : LINZ, Arnaud
> Envoyé : mercredi 15 juin 2016 14:35
> À : '[hidden email]' <[hidden email]> Objet : RE: Yarn
> batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Hi,
>
> I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem.
>
> I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState().  I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call.
>
> Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ?
>
> Thanks,
> Arnaud
>
> -----Message d'origine-----
> De : LINZ, Arnaud
> Envoyé : lundi 6 juin 2016 17:53
> À : [hidden email]
> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file.
>
> I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one.
>
>  Thanks,
> Arnaud
>
> -----Message d'origine-----
> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?
>
> Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote:
>> Hi,
>>
>> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers.
>>
>> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")
>>
>> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container.
>>
>> Complete line =
>> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
>> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s
>> -yD recovery.mode=standalone --class
>> com.bouygtel.kubera.main.segstage.MainGeoSegStage
>> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHO
>> T -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c
>> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>>
>> JVM_ARGS =
>> -Drecovery.mode=standalone
>> -Dyarn.properties-file.location=/tmp/flink/batch
>>
>>
>> Arnaud
>>
>>
>> -----Message d'origine-----
>> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
>> 14:37 À : [hidden email] Objet : Re: Yarn batch not working
>> with standalone yarn job manager once a persistent, HA job manager is launched ?
>>
>> Hey Arnaud,
>>
>> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager.
>>
>> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g.
>>
>> -yDrecovery.mode=standalone
>>
>> Can you try this?
>>
>> – Ufuk
>>
>> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote:
>>> Hi,
>>>
>>>
>>>
>>> I use Flink 1.0.0. I have a persistent yarn container set (a
>>> persistent flink job manager) that I use for streaming jobs ; and I
>>> use the “yarn-cluster” mode to launch my batches.
>>>
>>>
>>>
>>> I’ve just switched “HA” mode on for my streaming persistent job
>>> manager and it seems to works ; however my batches are not working
>>> any longer because they now execute themselves inside the persistent
>>> container (and fail because it lacks slots) and not in a separate standalone job manager.
>>>
>>>
>>>
>>> My batch launch options:
>>>
>>>
>>>
>>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>>>
>>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>>
>>>
>>>
>>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>>> $JAR_SUPP $listArgs $ACTION
>>>
>>>
>>>
>>> My persistent cluster launch option :
>>>
>>>
>>>
>>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>>> -Drecovery.mode=zookeeper
>>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>>> -Dstate.backend=filesystem
>>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PA
>>> T
>>> H
>>> }/checkpoints
>>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>>
>>>
>>>
>>> $FLINK_DIR/yarn-session.sh
>>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS
>>> -tm $FLINK_MEMORY -qu $FLINK_QUEUE  -nm
>>> ${GANESH_TYPE_PF}_KuberaFlink
>>>
>>>
>>>
>>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>>> container for now, but I lack HA.
>>>
>>>
>>>
>>> Is it a (un)known bug or am I missing a magic option?
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Arnaud
>>>
>>>
>>>
>>>
>>> ________________________________
>>>
>>> L'intégrité de ce message n'étant pas assurée sur internet, la
>>> société expéditrice ne peut être tenue responsable de son contenu ni
>>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée
>>> est interdite. Si vous n'êtes pas destinataire de ce message, merci
>>> de le détruire et d'avertir l'expéditeur.
>>>
>>> The integrity of this message cannot be guaranteed on the Internet.
>>> The company that sent this message cannot therefore be held liable
>>> for its content nor attachments. Any unauthorized use or
>>> dissemination is prohibited. If you are not the intended recipient
>>> of this message, then please delete it and notify the sender.

Reply | Threaded
Open this post in threaded view
|

Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Maximilian Michels
+1 for a CLI parameter for loading the config from a custom location

On Thu, Jun 16, 2016 at 6:01 PM, Till Rohrmann <[hidden email]> wrote:

> Hi Arnaud,
>
> at the moment the environment variable is the only way to specify a
> different config directory for the CLIFrontend. But it totally makes sense
> to introduce a --configDir parameter for the flink shell script. I'll open
> an issue for this.
>
> Cheers,
> Till
>
> On Thu, Jun 16, 2016 at 5:36 PM, LINZ, Arnaud <[hidden email]>
> wrote:
>>
>> Okay, is there a way to specify the flink-conf.yaml to use on the
>> ./bin/flink command-line? I see no such option. I guess I have to set
>> FLINK_CONF_DIR before the call ?
>>
>> -----Message d'origine-----
>> De : Maximilian Michels [mailto:[hidden email]]
>> Envoyé : mercredi 15 juin 2016 18:06
>> À : [hidden email]
>> Objet : Re: Yarn batch not working with standalone yarn job manager once a
>> persistent, HA job manager is launched ?
>>
>> Hi Arnaud,
>>
>> One issue per thread please. That makes things a lot easier for us :)
>>
>> Something positive first: We are reworking the resuming of existing Flink
>> Yarn applications. It'll be much easier to resume a cluster using simply the
>> Yarn ID or re-discoering the Yarn session using the properties file.
>>
>> The dynamic properties are a shortcut to modifying the Flink configuration
>> of the cluster _only_ upon startup. Afterwards, they are already set at the
>> containers. We might change this for the 1.1.0 release. It should work if
>> you put "yarn.properties-file.location:
>> /custom/location" in your flink-conf.yaml before you execute
>> "./bin/flink".
>>
>> Cheers,
>> Max
>>
>> On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]>
>> wrote:
>> > Ooopsss....
>> > My mistake, snapshot/restore do works in a local env, I've had a weird
>> > configuration issue!
>> >
>> > But I still have the property  file path issue  :)
>> >
>> > -----Message d'origine-----
>> > De : LINZ, Arnaud
>> > Envoyé : mercredi 15 juin 2016 14:35
>> > À : '[hidden email]' <[hidden email]> Objet : RE: Yarn
>> > batch not working with standalone yarn job manager once a persistent, HA
>> > job manager is launched ?
>> >
>> > Hi,
>> >
>> > I haven't had the time to investigate the bad configuration file path
>> > issue yet (if you have any idea why yarn.properties-file.location is ignored
>> > you are welcome) , but I'm facing another HA-problem.
>> >
>> > I'm trying to make my custom streaming sources HA compliant by
>> > implementing snapshotState() & restoreState().  I would like to test that
>> > mechanism in my junit tests, because it can be complex, but I was unable to
>> > simulate a "recover" on a local flink environment: snapshotState() is never
>> > triggered and launching an exception inside the execution chain does not
>> > lead to recovery but ends the execution, despite the
>> > streamExecEnv.enableCheckpointing(timeout) call.
>> >
>> > Is there a way to locally test this mechanism (other than poorly
>> > simulating it by explicitly calling snapshot & restore in a overridden
>> > source) ?
>> >
>> > Thanks,
>> > Arnaud
>> >
>> > -----Message d'origine-----
>> > De : LINZ, Arnaud
>> > Envoyé : lundi 6 juin 2016 17:53
>> > À : [hidden email]
>> > Objet : RE: Yarn batch not working with standalone yarn job manager once
>> > a persistent, HA job manager is launched ?
>> >
>> > I've deleted the '/tmp/.yarn-properties-user' file created for the
>> > persistent containter, and the batches do go into their own right container.
>> > However, that's not a workable workaround as I'm no longer able to submit
>> > streaming apps in the persistant container that way :) So it's really a
>> > problem of flink finding the right property file.
>> >
>> > I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the
>> > batch command line (also configured in the JVM_ARGS var), with no change of
>> > behaviour. Note that I do have a standalone yarn container created, but the
>> > job is submitted in the other other one.
>> >
>> >  Thanks,
>> > Arnaud
>> >
>> > -----Message d'origine-----
>> > De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
>> > 16:01 À : [hidden email] Objet : Re: Yarn batch not working with
>> > standalone yarn job manager once a persistent, HA job manager is launched ?
>> >
>> > Thanks for clarification. I think it might be related to the YARN
>> > properties file, which is still being used for the batch jobs. Can you try
>> > to delete it between submissions as a temporary workaround to check whether
>> > it's related?
>> >
>> > – Ufuk
>> >
>> > On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]>
>> > wrote:
>> >> Hi,
>> >>
>> >> The zookeeper path is only for my persistent container, and I do use a
>> >> different one for all my persistent containers.
>> >>
>> >> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS
>> >> ("${JVM_ARGS} -Drecovery.mode=standalone
>> >> -Dyarn.properties-file.location=/tmp/flink/batch")
>> >>
>> >> I've tried using -yD recovery.mode=standalone on the flink command line
>> >> too, but it does not solve the pb; it stills use the pre-existing container.
>> >>
>> >> Complete line =
>> >> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
>> >> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s
>> >> -yD recovery.mode=standalone --class
>> >> com.bouygtel.kubera.main.segstage.MainGeoSegStage
>> >> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHO
>> >> T -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c
>> >> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>> >>
>> >> JVM_ARGS =
>> >> -Drecovery.mode=standalone
>> >> -Dyarn.properties-file.location=/tmp/flink/batch
>> >>
>> >>
>> >> Arnaud
>> >>
>> >>
>> >> -----Message d'origine-----
>> >> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
>> >> 14:37 À : [hidden email] Objet : Re: Yarn batch not working
>> >> with standalone yarn job manager once a persistent, HA job manager is
>> >> launched ?
>> >>
>> >> Hey Arnaud,
>> >>
>> >> The cause of this is probably that both jobs use the same ZooKeeper
>> >> root path, in which case all task managers connect to the same leading job
>> >> manager.
>> >>
>> >> I think you forgot to the add the y in the -Drecovery.mode=standalone
>> >> for the batch jobs, e.g.
>> >>
>> >> -yDrecovery.mode=standalone
>> >>
>> >> Can you try this?
>> >>
>> >> – Ufuk
>> >>
>> >> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]>
>> >> wrote:
>> >>> Hi,
>> >>>
>> >>>
>> >>>
>> >>> I use Flink 1.0.0. I have a persistent yarn container set (a
>> >>> persistent flink job manager) that I use for streaming jobs ; and I
>> >>> use the “yarn-cluster” mode to launch my batches.
>> >>>
>> >>>
>> >>>
>> >>> I’ve just switched “HA” mode on for my streaming persistent job
>> >>> manager and it seems to works ; however my batches are not working
>> >>> any longer because they now execute themselves inside the persistent
>> >>> container (and fail because it lacks slots) and not in a separate
>> >>> standalone job manager.
>> >>>
>> >>>
>> >>>
>> >>> My batch launch options:
>> >>>
>> >>>
>> >>>
>> >>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>> >>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>> >>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD
>> >>> akka.ask.timeout=300s"
>> >>>
>> >>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>> >>> -Dyarn.properties-file.location=/tmp/flink/batch"
>> >>>
>> >>>
>> >>>
>> >>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>> >>> $JAR_SUPP $listArgs $ACTION
>> >>>
>> >>>
>> >>>
>> >>> My persistent cluster launch option :
>> >>>
>> >>>
>> >>>
>> >>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>> >>> -Drecovery.mode=zookeeper
>> >>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>> >>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>> >>> -Dstate.backend=filesystem
>> >>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PA
>> >>> T
>> >>> H
>> >>> }/checkpoints
>> >>>
>> >>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>> >>>
>> >>>
>> >>>
>> >>> $FLINK_DIR/yarn-session.sh
>> >>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>> >>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS
>> >>> -tm $FLINK_MEMORY -qu $FLINK_QUEUE  -nm
>> >>> ${GANESH_TYPE_PF}_KuberaFlink
>> >>>
>> >>>
>> >>>
>> >>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>> >>> container for now, but I lack HA.
>> >>>
>> >>>
>> >>>
>> >>> Is it a (un)known bug or am I missing a magic option?
>> >>>
>> >>>
>> >>>
>> >>> Best regards,
>> >>>
>> >>> Arnaud
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ________________________________
>> >>>
>> >>> L'intégrité de ce message n'étant pas assurée sur internet, la
>> >>> société expéditrice ne peut être tenue responsable de son contenu ni
>> >>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée
>> >>> est interdite. Si vous n'êtes pas destinataire de ce message, merci
>> >>> de le détruire et d'avertir l'expéditeur.
>> >>>
>> >>> The integrity of this message cannot be guaranteed on the Internet.
>> >>> The company that sent this message cannot therefore be held liable
>> >>> for its content nor attachments. Any unauthorized use or
>> >>> dissemination is prohibited. If you are not the intended recipient
>> >>> of this message, then please delete it and notify the sender.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

Ufuk Celebi
I've created an issue for this here:
https://issues.apache.org/jira/browse/FLINK-4095

On Mon, Jun 20, 2016 at 11:09 AM, Maximilian Michels <[hidden email]> wrote:

> +1 for a CLI parameter for loading the config from a custom location
>
> On Thu, Jun 16, 2016 at 6:01 PM, Till Rohrmann <[hidden email]> wrote:
>> Hi Arnaud,
>>
>> at the moment the environment variable is the only way to specify a
>> different config directory for the CLIFrontend. But it totally makes sense
>> to introduce a --configDir parameter for the flink shell script. I'll open
>> an issue for this.
>>
>> Cheers,
>> Till
>>
>> On Thu, Jun 16, 2016 at 5:36 PM, LINZ, Arnaud <[hidden email]>
>> wrote:
>>>
>>> Okay, is there a way to specify the flink-conf.yaml to use on the
>>> ./bin/flink command-line? I see no such option. I guess I have to set
>>> FLINK_CONF_DIR before the call ?
>>>
>>> -----Message d'origine-----
>>> De : Maximilian Michels [mailto:[hidden email]]
>>> Envoyé : mercredi 15 juin 2016 18:06
>>> À : [hidden email]
>>> Objet : Re: Yarn batch not working with standalone yarn job manager once a
>>> persistent, HA job manager is launched ?
>>>
>>> Hi Arnaud,
>>>
>>> One issue per thread please. That makes things a lot easier for us :)
>>>
>>> Something positive first: We are reworking the resuming of existing Flink
>>> Yarn applications. It'll be much easier to resume a cluster using simply the
>>> Yarn ID or re-discoering the Yarn session using the properties file.
>>>
>>> The dynamic properties are a shortcut to modifying the Flink configuration
>>> of the cluster _only_ upon startup. Afterwards, they are already set at the
>>> containers. We might change this for the 1.1.0 release. It should work if
>>> you put "yarn.properties-file.location:
>>> /custom/location" in your flink-conf.yaml before you execute
>>> "./bin/flink".
>>>
>>> Cheers,
>>> Max
>>>
>>> On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]>
>>> wrote:
>>> > Ooopsss....
>>> > My mistake, snapshot/restore do works in a local env, I've had a weird
>>> > configuration issue!
>>> >
>>> > But I still have the property  file path issue  :)
>>> >
>>> > -----Message d'origine-----
>>> > De : LINZ, Arnaud
>>> > Envoyé : mercredi 15 juin 2016 14:35
>>> > À : '[hidden email]' <[hidden email]> Objet : RE: Yarn
>>> > batch not working with standalone yarn job manager once a persistent, HA
>>> > job manager is launched ?
>>> >
>>> > Hi,
>>> >
>>> > I haven't had the time to investigate the bad configuration file path
>>> > issue yet (if you have any idea why yarn.properties-file.location is ignored
>>> > you are welcome) , but I'm facing another HA-problem.
>>> >
>>> > I'm trying to make my custom streaming sources HA compliant by
>>> > implementing snapshotState() & restoreState().  I would like to test that
>>> > mechanism in my junit tests, because it can be complex, but I was unable to
>>> > simulate a "recover" on a local flink environment: snapshotState() is never
>>> > triggered and launching an exception inside the execution chain does not
>>> > lead to recovery but ends the execution, despite the
>>> > streamExecEnv.enableCheckpointing(timeout) call.
>>> >
>>> > Is there a way to locally test this mechanism (other than poorly
>>> > simulating it by explicitly calling snapshot & restore in a overridden
>>> > source) ?
>>> >
>>> > Thanks,
>>> > Arnaud
>>> >
>>> > -----Message d'origine-----
>>> > De : LINZ, Arnaud
>>> > Envoyé : lundi 6 juin 2016 17:53
>>> > À : [hidden email]
>>> > Objet : RE: Yarn batch not working with standalone yarn job manager once
>>> > a persistent, HA job manager is launched ?
>>> >
>>> > I've deleted the '/tmp/.yarn-properties-user' file created for the
>>> > persistent containter, and the batches do go into their own right container.
>>> > However, that's not a workable workaround as I'm no longer able to submit
>>> > streaming apps in the persistant container that way :) So it's really a
>>> > problem of flink finding the right property file.
>>> >
>>> > I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the
>>> > batch command line (also configured in the JVM_ARGS var), with no change of
>>> > behaviour. Note that I do have a standalone yarn container created, but the
>>> > job is submitted in the other other one.
>>> >
>>> >  Thanks,
>>> > Arnaud
>>> >
>>> > -----Message d'origine-----
>>> > De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
>>> > 16:01 À : [hidden email] Objet : Re: Yarn batch not working with
>>> > standalone yarn job manager once a persistent, HA job manager is launched ?
>>> >
>>> > Thanks for clarification. I think it might be related to the YARN
>>> > properties file, which is still being used for the batch jobs. Can you try
>>> > to delete it between submissions as a temporary workaround to check whether
>>> > it's related?
>>> >
>>> > – Ufuk
>>> >
>>> > On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]>
>>> > wrote:
>>> >> Hi,
>>> >>
>>> >> The zookeeper path is only for my persistent container, and I do use a
>>> >> different one for all my persistent containers.
>>> >>
>>> >> The -Drecovery.mode=standalone was passed inside the    JVM_ARGS
>>> >> ("${JVM_ARGS} -Drecovery.mode=standalone
>>> >> -Dyarn.properties-file.location=/tmp/flink/batch")
>>> >>
>>> >> I've tried using -yD recovery.mode=standalone on the flink command line
>>> >> too, but it does not solve the pb; it stills use the pre-existing container.
>>> >>
>>> >> Complete line =
>>> >> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
>>> >> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s
>>> >> -yD recovery.mode=standalone --class
>>> >> com.bouygtel.kubera.main.segstage.MainGeoSegStage
>>> >> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHO
>>> >> T -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c
>>> >> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>>> >>
>>> >> JVM_ARGS =
>>> >> -Drecovery.mode=standalone
>>> >> -Dyarn.properties-file.location=/tmp/flink/batch
>>> >>
>>> >>
>>> >> Arnaud
>>> >>
>>> >>
>>> >> -----Message d'origine-----
>>> >> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016
>>> >> 14:37 À : [hidden email] Objet : Re: Yarn batch not working
>>> >> with standalone yarn job manager once a persistent, HA job manager is
>>> >> launched ?
>>> >>
>>> >> Hey Arnaud,
>>> >>
>>> >> The cause of this is probably that both jobs use the same ZooKeeper
>>> >> root path, in which case all task managers connect to the same leading job
>>> >> manager.
>>> >>
>>> >> I think you forgot to the add the y in the -Drecovery.mode=standalone
>>> >> for the batch jobs, e.g.
>>> >>
>>> >> -yDrecovery.mode=standalone
>>> >>
>>> >> Can you try this?
>>> >>
>>> >> – Ufuk
>>> >>
>>> >> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]>
>>> >> wrote:
>>> >>> Hi,
>>> >>>
>>> >>>
>>> >>>
>>> >>> I use Flink 1.0.0. I have a persistent yarn container set (a
>>> >>> persistent flink job manager) that I use for streaming jobs ; and I
>>> >>> use the “yarn-cluster” mode to launch my batches.
>>> >>>
>>> >>>
>>> >>>
>>> >>> I’ve just switched “HA” mode on for my streaming persistent job
>>> >>> manager and it seems to works ; however my batches are not working
>>> >>> any longer because they now execute themselves inside the persistent
>>> >>> container (and fail because it lacks slots) and not in a separate
>>> >>> standalone job manager.
>>> >>>
>>> >>>
>>> >>>
>>> >>> My batch launch options:
>>> >>>
>>> >>>
>>> >>>
>>> >>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm
>>> >>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD
>>> >>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD
>>> >>> akka.ask.timeout=300s"
>>> >>>
>>> >>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone
>>> >>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>> >>>
>>> >>>
>>> >>>
>>> >>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA
>>> >>> $JAR_SUPP $listArgs $ACTION
>>> >>>
>>> >>>
>>> >>>
>>> >>> My persistent cluster launch option :
>>> >>>
>>> >>>
>>> >>>
>>> >>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>>> >>> -Drecovery.mode=zookeeper
>>> >>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>>> >>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>>> >>> -Dstate.backend=filesystem
>>> >>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PA
>>> >>> T
>>> >>> H
>>> >>> }/checkpoints
>>> >>>
>>> >>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>> >>>
>>> >>>
>>> >>>
>>> >>> $FLINK_DIR/yarn-session.sh
>>> >>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>>> >>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS
>>> >>> -tm $FLINK_MEMORY -qu $FLINK_QUEUE  -nm
>>> >>> ${GANESH_TYPE_PF}_KuberaFlink
>>> >>>
>>> >>>
>>> >>>
>>> >>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the
>>> >>> container for now, but I lack HA.
>>> >>>
>>> >>>
>>> >>>
>>> >>> Is it a (un)known bug or am I missing a magic option?
>>> >>>
>>> >>>
>>> >>>
>>> >>> Best regards,
>>> >>>
>>> >>> Arnaud
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> ________________________________
>>> >>>
>>> >>> L'intégrité de ce message n'étant pas assurée sur internet, la
>>> >>> société expéditrice ne peut être tenue responsable de son contenu ni
>>> >>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée
>>> >>> est interdite. Si vous n'êtes pas destinataire de ce message, merci
>>> >>> de le détruire et d'avertir l'expéditeur.
>>> >>>
>>> >>> The integrity of this message cannot be guaranteed on the Internet.
>>> >>> The company that sent this message cannot therefore be held liable
>>> >>> for its content nor attachments. Any unauthorized use or
>>> >>> dissemination is prohibited. If you are not the intended recipient
>>> >>> of this message, then please delete it and notify the sender.
>>
>>