Hi, I use Flink 1.0.0. I have a persistent yarn container set (a persistent flink job manager) that I use for streaming jobs ; and I use the “yarn-cluster”
mode to launch my batches. I’ve just switched “HA” mode on for my streaming persistent job manager and it seems to works ; however my batches are not working any longer because
they now execute themselves inside the persistent container (and fail because it lacks slots) and not in a separate standalone job manager. My batch launch options: CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
-yD akka.ask.timeout=300s" JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch" $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA $JAR_SUPP $listArgs $ACTION My persistent cluster launch option : export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 -Drecovery.mode=zookeeper -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
-Dstate.backend=filesystem -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/checkpoints -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" $FLINK_DIR/yarn-session.sh -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm $FLINK_MEMORY
-qu $FLINK_QUEUE -nm ${GANESH_TYPE_PF}_KuberaFlink I’ve switched back to the
FLINK_HA_OPTIONS=""
way of launching the container for now, but I lack HA. Is it a (un)known bug or am I missing a magic option? Best regards, Arnaud L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur. The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender. |
Hey Arnaud,
The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager. I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g. -yDrecovery.mode=standalone Can you try this? – Ufuk On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote: > Hi, > > > > I use Flink 1.0.0. I have a persistent yarn container set (a persistent > flink job manager) that I use for streaming jobs ; and I use the > “yarn-cluster” mode to launch my batches. > > > > I’ve just switched “HA” mode on for my streaming persistent job manager and > it seems to works ; however my batches are not working any longer because > they now execute themselves inside the persistent container (and fail > because it lacks slots) and not in a separate standalone job manager. > > > > My batch launch options: > > > > CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm > $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD > yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s" > > JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone > -Dyarn.properties-file.location=/tmp/flink/batch" > > > > $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA $JAR_SUPP > $listArgs $ACTION > > > > My persistent cluster launch option : > > > > export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 > -Drecovery.mode=zookeeper > -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} > -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} > -Dstate.backend=filesystem > -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/checkpoints > -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" > > > > $FLINK_DIR/yarn-session.sh -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO > $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm > $FLINK_MEMORY -qu $FLINK_QUEUE -nm ${GANESH_TYPE_PF}_KuberaFlink > > > > I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the container > for now, but I lack HA. > > > > Is it a (un)known bug or am I missing a magic option? > > > > Best regards, > > Arnaud > > > > > ________________________________ > > L'intégrité de ce message n'étant pas assurée sur internet, la société > expéditrice ne peut être tenue responsable de son contenu ni de ses pièces > jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous > n'êtes pas destinataire de ce message, merci de le détruire et d'avertir > l'expéditeur. > > The integrity of this message cannot be guaranteed on the Internet. The > company that sent this message cannot therefore be held liable for its > content nor attachments. Any unauthorized use or dissemination is > prohibited. If you are not the intended recipient of this message, then > please delete it and notify the sender. |
Hi,
The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers. The -Drecovery.mode=standalone was passed inside the JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch") I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container. Complete line = /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s -yD recovery.mode=standalone --class com.bouygtel.kubera.main.segstage.MainGeoSegStage /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT-allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg JVM_ARGS = -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch Arnaud -----Message d'origine----- De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 14:37 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? Hey Arnaud, The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager. I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g. -yDrecovery.mode=standalone Can you try this? – Ufuk On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote: > Hi, > > > > I use Flink 1.0.0. I have a persistent yarn container set (a > persistent flink job manager) that I use for streaming jobs ; and I > use the “yarn-cluster” mode to launch my batches. > > > > I’ve just switched “HA” mode on for my streaming persistent job > manager and it seems to works ; however my batches are not working any > longer because they now execute themselves inside the persistent > container (and fail because it lacks slots) and not in a separate standalone job manager. > > > > My batch launch options: > > > > CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm > $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD > yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s" > > JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone > -Dyarn.properties-file.location=/tmp/flink/batch" > > > > $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA > $JAR_SUPP $listArgs $ACTION > > > > My persistent cluster launch option : > > > > export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 > -Drecovery.mode=zookeeper > -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} > -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} > -Dstate.backend=filesystem > -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH > }/checkpoints > -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" > > > > $FLINK_DIR/yarn-session.sh > -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO > $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm > $FLINK_MEMORY -qu $FLINK_QUEUE -nm ${GANESH_TYPE_PF}_KuberaFlink > > > > I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the > container for now, but I lack HA. > > > > Is it a (un)known bug or am I missing a magic option? > > > > Best regards, > > Arnaud > > > > > ________________________________ > > L'intégrité de ce message n'étant pas assurée sur internet, la société > expéditrice ne peut être tenue responsable de son contenu ni de ses > pièces jointes. Toute utilisation ou diffusion non autorisée est > interdite. Si vous n'êtes pas destinataire de ce message, merci de le > détruire et d'avertir l'expéditeur. > > The integrity of this message cannot be guaranteed on the Internet. > The company that sent this message cannot therefore be held liable for > its content nor attachments. Any unauthorized use or dissemination is > prohibited. If you are not the intended recipient of this message, > then please delete it and notify the sender. |
Thanks for clarification. I think it might be related to the YARN
properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related? – Ufuk On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote: > Hi, > > The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers. > > The -Drecovery.mode=standalone was passed inside the JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch") > > I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container. > > Complete line = > /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s -yD recovery.mode=standalone --class com.bouygtel.kubera.main.segstage.MainGeoSegStage /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT-allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg > > JVM_ARGS = > -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch > > > Arnaud > > > -----Message d'origine----- > De : Ufuk Celebi [mailto:[hidden email]] > Envoyé : lundi 6 juin 2016 14:37 > À : [hidden email] > Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? > > Hey Arnaud, > > The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager. > > I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g. > > -yDrecovery.mode=standalone > > Can you try this? > > – Ufuk > > On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote: >> Hi, >> >> >> >> I use Flink 1.0.0. I have a persistent yarn container set (a >> persistent flink job manager) that I use for streaming jobs ; and I >> use the “yarn-cluster” mode to launch my batches. >> >> >> >> I’ve just switched “HA” mode on for my streaming persistent job >> manager and it seems to works ; however my batches are not working any >> longer because they now execute themselves inside the persistent >> container (and fail because it lacks slots) and not in a separate standalone job manager. >> >> >> >> My batch launch options: >> >> >> >> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm >> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD >> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s" >> >> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone >> -Dyarn.properties-file.location=/tmp/flink/batch" >> >> >> >> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA >> $JAR_SUPP $listArgs $ACTION >> >> >> >> My persistent cluster launch option : >> >> >> >> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 >> -Drecovery.mode=zookeeper >> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} >> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} >> -Dstate.backend=filesystem >> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH >> }/checkpoints >> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" >> >> >> >> $FLINK_DIR/yarn-session.sh >> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO >> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm >> $FLINK_MEMORY -qu $FLINK_QUEUE -nm ${GANESH_TYPE_PF}_KuberaFlink >> >> >> >> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the >> container for now, but I lack HA. >> >> >> >> Is it a (un)known bug or am I missing a magic option? >> >> >> >> Best regards, >> >> Arnaud >> >> >> >> >> ________________________________ >> >> L'intégrité de ce message n'étant pas assurée sur internet, la société >> expéditrice ne peut être tenue responsable de son contenu ni de ses >> pièces jointes. Toute utilisation ou diffusion non autorisée est >> interdite. Si vous n'êtes pas destinataire de ce message, merci de le >> détruire et d'avertir l'expéditeur. >> >> The integrity of this message cannot be guaranteed on the Internet. >> The company that sent this message cannot therefore be held liable for >> its content nor attachments. Any unauthorized use or dissemination is >> prohibited. If you are not the intended recipient of this message, >> then please delete it and notify the sender. |
I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :)
So it's really a problem of flink finding the right property file. I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one. Thanks, Arnaud -----Message d'origine----- De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related? – Ufuk On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote: > Hi, > > The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers. > > The -Drecovery.mode=standalone was passed inside the JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch") > > I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container. > > Complete line = > /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu > batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s > -yD recovery.mode=standalone --class > com.bouygtel.kubera.main.segstage.MainGeoSegStage > /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT > -allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c > /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg > > JVM_ARGS = > -Drecovery.mode=standalone > -Dyarn.properties-file.location=/tmp/flink/batch > > > Arnaud > > > -----Message d'origine----- > De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 > 14:37 À : [hidden email] Objet : Re: Yarn batch not working > with standalone yarn job manager once a persistent, HA job manager is launched ? > > Hey Arnaud, > > The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager. > > I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g. > > -yDrecovery.mode=standalone > > Can you try this? > > – Ufuk > > On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote: >> Hi, >> >> >> >> I use Flink 1.0.0. I have a persistent yarn container set (a >> persistent flink job manager) that I use for streaming jobs ; and I >> use the “yarn-cluster” mode to launch my batches. >> >> >> >> I’ve just switched “HA” mode on for my streaming persistent job >> manager and it seems to works ; however my batches are not working >> any longer because they now execute themselves inside the persistent >> container (and fail because it lacks slots) and not in a separate standalone job manager. >> >> >> >> My batch launch options: >> >> >> >> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm >> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD >> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s" >> >> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone >> -Dyarn.properties-file.location=/tmp/flink/batch" >> >> >> >> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA >> $JAR_SUPP $listArgs $ACTION >> >> >> >> My persistent cluster launch option : >> >> >> >> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 >> -Drecovery.mode=zookeeper >> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} >> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} >> -Dstate.backend=filesystem >> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT >> H >> }/checkpoints >> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" >> >> >> >> $FLINK_DIR/yarn-session.sh >> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO >> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm >> $FLINK_MEMORY -qu $FLINK_QUEUE -nm ${GANESH_TYPE_PF}_KuberaFlink >> >> >> >> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the >> container for now, but I lack HA. >> >> >> >> Is it a (un)known bug or am I missing a magic option? >> >> >> >> Best regards, >> >> Arnaud >> >> >> >> >> ________________________________ >> >> L'intégrité de ce message n'étant pas assurée sur internet, la >> société expéditrice ne peut être tenue responsable de son contenu ni >> de ses pièces jointes. Toute utilisation ou diffusion non autorisée >> est interdite. Si vous n'êtes pas destinataire de ce message, merci >> de le détruire et d'avertir l'expéditeur. >> >> The integrity of this message cannot be guaranteed on the Internet. >> The company that sent this message cannot therefore be held liable >> for its content nor attachments. Any unauthorized use or >> dissemination is prohibited. If you are not the intended recipient of >> this message, then please delete it and notify the sender. |
In reply to this post by Ufuk Celebi
Hi,
I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem. I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState(). I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call. Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ? Thanks, Arnaud -----Message d'origine----- De : LINZ, Arnaud Envoyé : lundi 6 juin 2016 17:53 À : [hidden email] Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file. I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one. Thanks, Arnaud -----Message d'origine----- De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related? – Ufuk On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote: > Hi, > > The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers. > > The -Drecovery.mode=standalone was passed inside the JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch") > > I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container. > > Complete line = > /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu > batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s > -yD recovery.mode=standalone --class > com.bouygtel.kubera.main.segstage.MainGeoSegStage > /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT > -allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c > /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg > > JVM_ARGS = > -Drecovery.mode=standalone > -Dyarn.properties-file.location=/tmp/flink/batch > > > Arnaud > > > -----Message d'origine----- > De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 > 14:37 À : [hidden email] Objet : Re: Yarn batch not working > with standalone yarn job manager once a persistent, HA job manager is launched ? > > Hey Arnaud, > > The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager. > > I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g. > > -yDrecovery.mode=standalone > > Can you try this? > > – Ufuk > > On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote: >> Hi, >> >> >> >> I use Flink 1.0.0. I have a persistent yarn container set (a >> persistent flink job manager) that I use for streaming jobs ; and I >> use the “yarn-cluster” mode to launch my batches. >> >> >> >> I’ve just switched “HA” mode on for my streaming persistent job >> manager and it seems to works ; however my batches are not working >> any longer because they now execute themselves inside the persistent >> container (and fail because it lacks slots) and not in a separate standalone job manager. >> >> >> >> My batch launch options: >> >> >> >> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm >> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD >> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s" >> >> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone >> -Dyarn.properties-file.location=/tmp/flink/batch" >> >> >> >> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA >> $JAR_SUPP $listArgs $ACTION >> >> >> >> My persistent cluster launch option : >> >> >> >> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 >> -Drecovery.mode=zookeeper >> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} >> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} >> -Dstate.backend=filesystem >> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT >> H >> }/checkpoints >> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" >> >> >> >> $FLINK_DIR/yarn-session.sh >> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO >> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm >> $FLINK_MEMORY -qu $FLINK_QUEUE -nm ${GANESH_TYPE_PF}_KuberaFlink >> >> >> >> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the >> container for now, but I lack HA. >> >> >> >> Is it a (un)known bug or am I missing a magic option? >> >> >> >> Best regards, >> >> Arnaud >> >> >> >> >> ________________________________ >> >> L'intégrité de ce message n'étant pas assurée sur internet, la >> société expéditrice ne peut être tenue responsable de son contenu ni >> de ses pièces jointes. Toute utilisation ou diffusion non autorisée >> est interdite. Si vous n'êtes pas destinataire de ce message, merci >> de le détruire et d'avertir l'expéditeur. >> >> The integrity of this message cannot be guaranteed on the Internet. >> The company that sent this message cannot therefore be held liable >> for its content nor attachments. Any unauthorized use or >> dissemination is prohibited. If you are not the intended recipient of >> this message, then please delete it and notify the sender. |
In reply to this post by Ufuk Celebi
Ooopsss....
My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue! But I still have the property file path issue :) -----Message d'origine----- De : LINZ, Arnaud Envoyé : mercredi 15 juin 2016 14:35 À : '[hidden email]' <[hidden email]> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? Hi, I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem. I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState(). I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call. Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ? Thanks, Arnaud -----Message d'origine----- De : LINZ, Arnaud Envoyé : lundi 6 juin 2016 17:53 À : [hidden email] Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file. I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one. Thanks, Arnaud -----Message d'origine----- De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related? – Ufuk On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote: > Hi, > > The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers. > > The -Drecovery.mode=standalone was passed inside the JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch") > > I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container. > > Complete line = > /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu > batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s > -yD recovery.mode=standalone --class > com.bouygtel.kubera.main.segstage.MainGeoSegStage > /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT > -allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c > /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg > > JVM_ARGS = > -Drecovery.mode=standalone > -Dyarn.properties-file.location=/tmp/flink/batch > > > Arnaud > > > -----Message d'origine----- > De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 > 14:37 À : [hidden email] Objet : Re: Yarn batch not working > with standalone yarn job manager once a persistent, HA job manager is launched ? > > Hey Arnaud, > > The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager. > > I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g. > > -yDrecovery.mode=standalone > > Can you try this? > > – Ufuk > > On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote: >> Hi, >> >> >> >> I use Flink 1.0.0. I have a persistent yarn container set (a >> persistent flink job manager) that I use for streaming jobs ; and I >> use the “yarn-cluster” mode to launch my batches. >> >> >> >> I’ve just switched “HA” mode on for my streaming persistent job >> manager and it seems to works ; however my batches are not working >> any longer because they now execute themselves inside the persistent >> container (and fail because it lacks slots) and not in a separate standalone job manager. >> >> >> >> My batch launch options: >> >> >> >> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm >> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD >> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s" >> >> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone >> -Dyarn.properties-file.location=/tmp/flink/batch" >> >> >> >> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA >> $JAR_SUPP $listArgs $ACTION >> >> >> >> My persistent cluster launch option : >> >> >> >> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 >> -Drecovery.mode=zookeeper >> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} >> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} >> -Dstate.backend=filesystem >> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT >> H >> }/checkpoints >> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" >> >> >> >> $FLINK_DIR/yarn-session.sh >> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO >> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm >> $FLINK_MEMORY -qu $FLINK_QUEUE -nm ${GANESH_TYPE_PF}_KuberaFlink >> >> >> >> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the >> container for now, but I lack HA. >> >> >> >> Is it a (un)known bug or am I missing a magic option? >> >> >> >> Best regards, >> >> Arnaud >> >> >> >> >> ________________________________ >> >> L'intégrité de ce message n'étant pas assurée sur internet, la >> société expéditrice ne peut être tenue responsable de son contenu ni >> de ses pièces jointes. Toute utilisation ou diffusion non autorisée >> est interdite. Si vous n'êtes pas destinataire de ce message, merci >> de le détruire et d'avertir l'expéditeur. >> >> The integrity of this message cannot be guaranteed on the Internet. >> The company that sent this message cannot therefore be held liable >> for its content nor attachments. Any unauthorized use or >> dissemination is prohibited. If you are not the intended recipient of >> this message, then please delete it and notify the sender. |
I've created an issue here: https://issues.apache.org/jira/browse/FLINK-4079
Hopefully it will be fixed in 1.1 and we can provide a bugfix for 1.0.4. On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> wrote: > Ooopsss.... > My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue! > > But I still have the property file path issue :) > > -----Message d'origine----- > De : LINZ, Arnaud > Envoyé : mercredi 15 juin 2016 14:35 > À : '[hidden email]' <[hidden email]> > Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? > > Hi, > > I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem. > > I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState(). I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call. > > Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ? > > Thanks, > Arnaud > > -----Message d'origine----- > De : LINZ, Arnaud > Envoyé : lundi 6 juin 2016 17:53 > À : [hidden email] > Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? > > I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file. > > I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one. > > Thanks, > Arnaud > > -----Message d'origine----- > De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? > > Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related? > > – Ufuk > > On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote: >> Hi, >> >> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers. >> >> The -Drecovery.mode=standalone was passed inside the JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch") >> >> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container. >> >> Complete line = >> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu >> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s >> -yD recovery.mode=standalone --class >> com.bouygtel.kubera.main.segstage.MainGeoSegStage >> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT >> -allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c >> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg >> >> JVM_ARGS = >> -Drecovery.mode=standalone >> -Dyarn.properties-file.location=/tmp/flink/batch >> >> >> Arnaud >> >> >> -----Message d'origine----- >> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 >> 14:37 À : [hidden email] Objet : Re: Yarn batch not working >> with standalone yarn job manager once a persistent, HA job manager is launched ? >> >> Hey Arnaud, >> >> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager. >> >> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g. >> >> -yDrecovery.mode=standalone >> >> Can you try this? >> >> – Ufuk >> >> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote: >>> Hi, >>> >>> >>> >>> I use Flink 1.0.0. I have a persistent yarn container set (a >>> persistent flink job manager) that I use for streaming jobs ; and I >>> use the “yarn-cluster” mode to launch my batches. >>> >>> >>> >>> I’ve just switched “HA” mode on for my streaming persistent job >>> manager and it seems to works ; however my batches are not working >>> any longer because they now execute themselves inside the persistent >>> container (and fail because it lacks slots) and not in a separate standalone job manager. >>> >>> >>> >>> My batch launch options: >>> >>> >>> >>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm >>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD >>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s" >>> >>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone >>> -Dyarn.properties-file.location=/tmp/flink/batch" >>> >>> >>> >>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA >>> $JAR_SUPP $listArgs $ACTION >>> >>> >>> >>> My persistent cluster launch option : >>> >>> >>> >>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 >>> -Drecovery.mode=zookeeper >>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} >>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} >>> -Dstate.backend=filesystem >>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT >>> H >>> }/checkpoints >>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" >>> >>> >>> >>> $FLINK_DIR/yarn-session.sh >>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO >>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm >>> $FLINK_MEMORY -qu $FLINK_QUEUE -nm ${GANESH_TYPE_PF}_KuberaFlink >>> >>> >>> >>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the >>> container for now, but I lack HA. >>> >>> >>> >>> Is it a (un)known bug or am I missing a magic option? >>> >>> >>> >>> Best regards, >>> >>> Arnaud >>> >>> >>> >>> >>> ________________________________ >>> >>> L'intégrité de ce message n'étant pas assurée sur internet, la >>> société expéditrice ne peut être tenue responsable de son contenu ni >>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée >>> est interdite. Si vous n'êtes pas destinataire de ce message, merci >>> de le détruire et d'avertir l'expéditeur. >>> >>> The integrity of this message cannot be guaranteed on the Internet. >>> The company that sent this message cannot therefore be held liable >>> for its content nor attachments. Any unauthorized use or >>> dissemination is prohibited. If you are not the intended recipient of >>> this message, then please delete it and notify the sender. |
In reply to this post by LINZ, Arnaud
Hi Arnaud,
One issue per thread please. That makes things a lot easier for us :) Something positive first: We are reworking the resuming of existing Flink Yarn applications. It'll be much easier to resume a cluster using simply the Yarn ID or re-discoering the Yarn session using the properties file. The dynamic properties are a shortcut to modifying the Flink configuration of the cluster _only_ upon startup. Afterwards, they are already set at the containers. We might change this for the 1.1.0 release. It should work if you put "yarn.properties-file.location: /custom/location" in your flink-conf.yaml before you execute "./bin/flink". Cheers, Max On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> wrote: > Ooopsss.... > My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue! > > But I still have the property file path issue :) > > -----Message d'origine----- > De : LINZ, Arnaud > Envoyé : mercredi 15 juin 2016 14:35 > À : '[hidden email]' <[hidden email]> > Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? > > Hi, > > I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem. > > I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState(). I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call. > > Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ? > > Thanks, > Arnaud > > -----Message d'origine----- > De : LINZ, Arnaud > Envoyé : lundi 6 juin 2016 17:53 > À : [hidden email] > Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? > > I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file. > > I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one. > > Thanks, > Arnaud > > -----Message d'origine----- > De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? > > Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related? > > – Ufuk > > On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote: >> Hi, >> >> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers. >> >> The -Drecovery.mode=standalone was passed inside the JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch") >> >> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container. >> >> Complete line = >> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu >> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s >> -yD recovery.mode=standalone --class >> com.bouygtel.kubera.main.segstage.MainGeoSegStage >> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT >> -allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c >> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg >> >> JVM_ARGS = >> -Drecovery.mode=standalone >> -Dyarn.properties-file.location=/tmp/flink/batch >> >> >> Arnaud >> >> >> -----Message d'origine----- >> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 >> 14:37 À : [hidden email] Objet : Re: Yarn batch not working >> with standalone yarn job manager once a persistent, HA job manager is launched ? >> >> Hey Arnaud, >> >> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager. >> >> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g. >> >> -yDrecovery.mode=standalone >> >> Can you try this? >> >> – Ufuk >> >> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote: >>> Hi, >>> >>> >>> >>> I use Flink 1.0.0. I have a persistent yarn container set (a >>> persistent flink job manager) that I use for streaming jobs ; and I >>> use the “yarn-cluster” mode to launch my batches. >>> >>> >>> >>> I’ve just switched “HA” mode on for my streaming persistent job >>> manager and it seems to works ; however my batches are not working >>> any longer because they now execute themselves inside the persistent >>> container (and fail because it lacks slots) and not in a separate standalone job manager. >>> >>> >>> >>> My batch launch options: >>> >>> >>> >>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm >>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD >>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s" >>> >>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone >>> -Dyarn.properties-file.location=/tmp/flink/batch" >>> >>> >>> >>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA >>> $JAR_SUPP $listArgs $ACTION >>> >>> >>> >>> My persistent cluster launch option : >>> >>> >>> >>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 >>> -Drecovery.mode=zookeeper >>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} >>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} >>> -Dstate.backend=filesystem >>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT >>> H >>> }/checkpoints >>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" >>> >>> >>> >>> $FLINK_DIR/yarn-session.sh >>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO >>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm >>> $FLINK_MEMORY -qu $FLINK_QUEUE -nm ${GANESH_TYPE_PF}_KuberaFlink >>> >>> >>> >>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the >>> container for now, but I lack HA. >>> >>> >>> >>> Is it a (un)known bug or am I missing a magic option? >>> >>> >>> >>> Best regards, >>> >>> Arnaud >>> >>> >>> >>> >>> ________________________________ >>> >>> L'intégrité de ce message n'étant pas assurée sur internet, la >>> société expéditrice ne peut être tenue responsable de son contenu ni >>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée >>> est interdite. Si vous n'êtes pas destinataire de ce message, merci >>> de le détruire et d'avertir l'expéditeur. >>> >>> The integrity of this message cannot be guaranteed on the Internet. >>> The company that sent this message cannot therefore be held liable >>> for its content nor attachments. Any unauthorized use or >>> dissemination is prohibited. If you are not the intended recipient of >>> this message, then please delete it and notify the sender. |
Just had a quick chat with Ufuk. The issue is that in 1.x the Yarn
properties file is loaded regardless of whether "-m yarn-cluster" is specified on the command-line. This loads the dynamic properties from the Yarn properties file and applies all configuration of the running (session) cluster cluster to the to-be-created cluster. Will be fixed in 1.1 and probably backported to 1.0.4. On Wed, Jun 15, 2016 at 6:05 PM, Maximilian Michels <[hidden email]> wrote: > Hi Arnaud, > > One issue per thread please. That makes things a lot easier for us :) > > Something positive first: We are reworking the resuming of existing > Flink Yarn applications. It'll be much easier to resume a cluster > using simply the Yarn ID or re-discoering the Yarn session using the > properties file. > > The dynamic properties are a shortcut to modifying the Flink > configuration of the cluster _only_ upon startup. Afterwards, they are > already set at the containers. We might change this for the 1.1.0 > release. It should work if you put "yarn.properties-file.location: > /custom/location" in your flink-conf.yaml before you execute > "./bin/flink". > > Cheers, > Max > > On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> wrote: >> Ooopsss.... >> My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue! >> >> But I still have the property file path issue :) >> >> -----Message d'origine----- >> De : LINZ, Arnaud >> Envoyé : mercredi 15 juin 2016 14:35 >> À : '[hidden email]' <[hidden email]> >> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? >> >> Hi, >> >> I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem. >> >> I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState(). I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call. >> >> Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ? >> >> Thanks, >> Arnaud >> >> -----Message d'origine----- >> De : LINZ, Arnaud >> Envoyé : lundi 6 juin 2016 17:53 >> À : [hidden email] >> Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? >> >> I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file. >> >> I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one. >> >> Thanks, >> Arnaud >> >> -----Message d'origine----- >> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? >> >> Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related? >> >> – Ufuk >> >> On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote: >>> Hi, >>> >>> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers. >>> >>> The -Drecovery.mode=standalone was passed inside the JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch") >>> >>> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container. >>> >>> Complete line = >>> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu >>> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s >>> -yD recovery.mode=standalone --class >>> com.bouygtel.kubera.main.segstage.MainGeoSegStage >>> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT >>> -allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c >>> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg >>> >>> JVM_ARGS = >>> -Drecovery.mode=standalone >>> -Dyarn.properties-file.location=/tmp/flink/batch >>> >>> >>> Arnaud >>> >>> >>> -----Message d'origine----- >>> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 >>> 14:37 À : [hidden email] Objet : Re: Yarn batch not working >>> with standalone yarn job manager once a persistent, HA job manager is launched ? >>> >>> Hey Arnaud, >>> >>> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager. >>> >>> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g. >>> >>> -yDrecovery.mode=standalone >>> >>> Can you try this? >>> >>> – Ufuk >>> >>> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote: >>>> Hi, >>>> >>>> >>>> >>>> I use Flink 1.0.0. I have a persistent yarn container set (a >>>> persistent flink job manager) that I use for streaming jobs ; and I >>>> use the “yarn-cluster” mode to launch my batches. >>>> >>>> >>>> >>>> I’ve just switched “HA” mode on for my streaming persistent job >>>> manager and it seems to works ; however my batches are not working >>>> any longer because they now execute themselves inside the persistent >>>> container (and fail because it lacks slots) and not in a separate standalone job manager. >>>> >>>> >>>> >>>> My batch launch options: >>>> >>>> >>>> >>>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm >>>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD >>>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s" >>>> >>>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone >>>> -Dyarn.properties-file.location=/tmp/flink/batch" >>>> >>>> >>>> >>>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA >>>> $JAR_SUPP $listArgs $ACTION >>>> >>>> >>>> >>>> My persistent cluster launch option : >>>> >>>> >>>> >>>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 >>>> -Drecovery.mode=zookeeper >>>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} >>>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} >>>> -Dstate.backend=filesystem >>>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT >>>> H >>>> }/checkpoints >>>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" >>>> >>>> >>>> >>>> $FLINK_DIR/yarn-session.sh >>>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO >>>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm >>>> $FLINK_MEMORY -qu $FLINK_QUEUE -nm ${GANESH_TYPE_PF}_KuberaFlink >>>> >>>> >>>> >>>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the >>>> container for now, but I lack HA. >>>> >>>> >>>> >>>> Is it a (un)known bug or am I missing a magic option? >>>> >>>> >>>> >>>> Best regards, >>>> >>>> Arnaud >>>> >>>> >>>> >>>> >>>> ________________________________ >>>> >>>> L'intégrité de ce message n'étant pas assurée sur internet, la >>>> société expéditrice ne peut être tenue responsable de son contenu ni >>>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée >>>> est interdite. Si vous n'êtes pas destinataire de ce message, merci >>>> de le détruire et d'avertir l'expéditeur. >>>> >>>> The integrity of this message cannot be guaranteed on the Internet. >>>> The company that sent this message cannot therefore be held liable >>>> for its content nor attachments. Any unauthorized use or >>>> dissemination is prohibited. If you are not the intended recipient of >>>> this message, then please delete it and notify the sender. |
In reply to this post by Maximilian Michels
Okay, is there a way to specify the flink-conf.yaml to use on the ./bin/flink command-line? I see no such option. I guess I have to set FLINK_CONF_DIR before the call ?
-----Message d'origine----- De : Maximilian Michels [mailto:[hidden email]] Envoyé : mercredi 15 juin 2016 18:06 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? Hi Arnaud, One issue per thread please. That makes things a lot easier for us :) Something positive first: We are reworking the resuming of existing Flink Yarn applications. It'll be much easier to resume a cluster using simply the Yarn ID or re-discoering the Yarn session using the properties file. The dynamic properties are a shortcut to modifying the Flink configuration of the cluster _only_ upon startup. Afterwards, they are already set at the containers. We might change this for the 1.1.0 release. It should work if you put "yarn.properties-file.location: /custom/location" in your flink-conf.yaml before you execute "./bin/flink". Cheers, Max On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> wrote: > Ooopsss.... > My mistake, snapshot/restore do works in a local env, I've had a weird configuration issue! > > But I still have the property file path issue :) > > -----Message d'origine----- > De : LINZ, Arnaud > Envoyé : mercredi 15 juin 2016 14:35 > À : '[hidden email]' <[hidden email]> Objet : RE: Yarn > batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? > > Hi, > > I haven't had the time to investigate the bad configuration file path issue yet (if you have any idea why yarn.properties-file.location is ignored you are welcome) , but I'm facing another HA-problem. > > I'm trying to make my custom streaming sources HA compliant by implementing snapshotState() & restoreState(). I would like to test that mechanism in my junit tests, because it can be complex, but I was unable to simulate a "recover" on a local flink environment: snapshotState() is never triggered and launching an exception inside the execution chain does not lead to recovery but ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call. > > Is there a way to locally test this mechanism (other than poorly simulating it by explicitly calling snapshot & restore in a overridden source) ? > > Thanks, > Arnaud > > -----Message d'origine----- > De : LINZ, Arnaud > Envoyé : lundi 6 juin 2016 17:53 > À : [hidden email] > Objet : RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? > > I've deleted the '/tmp/.yarn-properties-user' file created for the persistent containter, and the batches do go into their own right container. However, that's not a workable workaround as I'm no longer able to submit streaming apps in the persistant container that way :) So it's really a problem of flink finding the right property file. > > I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch command line (also configured in the JVM_ARGS var), with no change of behaviour. Note that I do have a standalone yarn container created, but the job is submitted in the other other one. > > Thanks, > Arnaud > > -----Message d'origine----- > De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 16:01 À : [hidden email] Objet : Re: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ? > > Thanks for clarification. I think it might be related to the YARN properties file, which is still being used for the batch jobs. Can you try to delete it between submissions as a temporary workaround to check whether it's related? > > – Ufuk > > On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> wrote: >> Hi, >> >> The zookeeper path is only for my persistent container, and I do use a different one for all my persistent containers. >> >> The -Drecovery.mode=standalone was passed inside the JVM_ARGS ("${JVM_ARGS} -Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch") >> >> I've tried using -yD recovery.mode=standalone on the flink command line too, but it does not solve the pb; it stills use the pre-existing container. >> >> Complete line = >> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu >> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s >> -yD recovery.mode=standalone --class >> com.bouygtel.kubera.main.segstage.MainGeoSegStage >> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHO >> T -allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c >> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg >> >> JVM_ARGS = >> -Drecovery.mode=standalone >> -Dyarn.properties-file.location=/tmp/flink/batch >> >> >> Arnaud >> >> >> -----Message d'origine----- >> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 >> 14:37 À : [hidden email] Objet : Re: Yarn batch not working >> with standalone yarn job manager once a persistent, HA job manager is launched ? >> >> Hey Arnaud, >> >> The cause of this is probably that both jobs use the same ZooKeeper root path, in which case all task managers connect to the same leading job manager. >> >> I think you forgot to the add the y in the -Drecovery.mode=standalone for the batch jobs, e.g. >> >> -yDrecovery.mode=standalone >> >> Can you try this? >> >> – Ufuk >> >> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> wrote: >>> Hi, >>> >>> >>> >>> I use Flink 1.0.0. I have a persistent yarn container set (a >>> persistent flink job manager) that I use for streaming jobs ; and I >>> use the “yarn-cluster” mode to launch my batches. >>> >>> >>> >>> I’ve just switched “HA” mode on for my streaming persistent job >>> manager and it seems to works ; however my batches are not working >>> any longer because they now execute themselves inside the persistent >>> container (and fail because it lacks slots) and not in a separate standalone job manager. >>> >>> >>> >>> My batch launch options: >>> >>> >>> >>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm >>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD >>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s" >>> >>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone >>> -Dyarn.properties-file.location=/tmp/flink/batch" >>> >>> >>> >>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA >>> $JAR_SUPP $listArgs $ACTION >>> >>> >>> >>> My persistent cluster launch option : >>> >>> >>> >>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 >>> -Drecovery.mode=zookeeper >>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} >>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} >>> -Dstate.backend=filesystem >>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PA >>> T >>> H >>> }/checkpoints >>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" >>> >>> >>> >>> $FLINK_DIR/yarn-session.sh >>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO >>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS >>> -tm $FLINK_MEMORY -qu $FLINK_QUEUE -nm >>> ${GANESH_TYPE_PF}_KuberaFlink >>> >>> >>> >>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the >>> container for now, but I lack HA. >>> >>> >>> >>> Is it a (un)known bug or am I missing a magic option? >>> >>> >>> >>> Best regards, >>> >>> Arnaud >>> >>> >>> >>> >>> ________________________________ >>> >>> L'intégrité de ce message n'étant pas assurée sur internet, la >>> société expéditrice ne peut être tenue responsable de son contenu ni >>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée >>> est interdite. Si vous n'êtes pas destinataire de ce message, merci >>> de le détruire et d'avertir l'expéditeur. >>> >>> The integrity of this message cannot be guaranteed on the Internet. >>> The company that sent this message cannot therefore be held liable >>> for its content nor attachments. Any unauthorized use or >>> dissemination is prohibited. If you are not the intended recipient >>> of this message, then please delete it and notify the sender. |
Hi Arnaud, at the moment the environment variable is the only way to specify a different config directory for the CLIFrontend. But it totally makes sense to introduce a --configDir parameter for the flink shell script. I'll open an issue for this. Cheers, Till On Thu, Jun 16, 2016 at 5:36 PM, LINZ, Arnaud <[hidden email]> wrote: Okay, is there a way to specify the flink-conf.yaml to use on the ./bin/flink command-line? I see no such option. I guess I have to set FLINK_CONF_DIR before the call ? |
+1 for a CLI parameter for loading the config from a custom location
On Thu, Jun 16, 2016 at 6:01 PM, Till Rohrmann <[hidden email]> wrote: > Hi Arnaud, > > at the moment the environment variable is the only way to specify a > different config directory for the CLIFrontend. But it totally makes sense > to introduce a --configDir parameter for the flink shell script. I'll open > an issue for this. > > Cheers, > Till > > On Thu, Jun 16, 2016 at 5:36 PM, LINZ, Arnaud <[hidden email]> > wrote: >> >> Okay, is there a way to specify the flink-conf.yaml to use on the >> ./bin/flink command-line? I see no such option. I guess I have to set >> FLINK_CONF_DIR before the call ? >> >> -----Message d'origine----- >> De : Maximilian Michels [mailto:[hidden email]] >> Envoyé : mercredi 15 juin 2016 18:06 >> À : [hidden email] >> Objet : Re: Yarn batch not working with standalone yarn job manager once a >> persistent, HA job manager is launched ? >> >> Hi Arnaud, >> >> One issue per thread please. That makes things a lot easier for us :) >> >> Something positive first: We are reworking the resuming of existing Flink >> Yarn applications. It'll be much easier to resume a cluster using simply the >> Yarn ID or re-discoering the Yarn session using the properties file. >> >> The dynamic properties are a shortcut to modifying the Flink configuration >> of the cluster _only_ upon startup. Afterwards, they are already set at the >> containers. We might change this for the 1.1.0 release. It should work if >> you put "yarn.properties-file.location: >> /custom/location" in your flink-conf.yaml before you execute >> "./bin/flink". >> >> Cheers, >> Max >> >> On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> >> wrote: >> > Ooopsss.... >> > My mistake, snapshot/restore do works in a local env, I've had a weird >> > configuration issue! >> > >> > But I still have the property file path issue :) >> > >> > -----Message d'origine----- >> > De : LINZ, Arnaud >> > Envoyé : mercredi 15 juin 2016 14:35 >> > À : '[hidden email]' <[hidden email]> Objet : RE: Yarn >> > batch not working with standalone yarn job manager once a persistent, HA >> > job manager is launched ? >> > >> > Hi, >> > >> > I haven't had the time to investigate the bad configuration file path >> > issue yet (if you have any idea why yarn.properties-file.location is ignored >> > you are welcome) , but I'm facing another HA-problem. >> > >> > I'm trying to make my custom streaming sources HA compliant by >> > implementing snapshotState() & restoreState(). I would like to test that >> > mechanism in my junit tests, because it can be complex, but I was unable to >> > simulate a "recover" on a local flink environment: snapshotState() is never >> > triggered and launching an exception inside the execution chain does not >> > lead to recovery but ends the execution, despite the >> > streamExecEnv.enableCheckpointing(timeout) call. >> > >> > Is there a way to locally test this mechanism (other than poorly >> > simulating it by explicitly calling snapshot & restore in a overridden >> > source) ? >> > >> > Thanks, >> > Arnaud >> > >> > -----Message d'origine----- >> > De : LINZ, Arnaud >> > Envoyé : lundi 6 juin 2016 17:53 >> > À : [hidden email] >> > Objet : RE: Yarn batch not working with standalone yarn job manager once >> > a persistent, HA job manager is launched ? >> > >> > I've deleted the '/tmp/.yarn-properties-user' file created for the >> > persistent containter, and the batches do go into their own right container. >> > However, that's not a workable workaround as I'm no longer able to submit >> > streaming apps in the persistant container that way :) So it's really a >> > problem of flink finding the right property file. >> > >> > I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the >> > batch command line (also configured in the JVM_ARGS var), with no change of >> > behaviour. Note that I do have a standalone yarn container created, but the >> > job is submitted in the other other one. >> > >> > Thanks, >> > Arnaud >> > >> > -----Message d'origine----- >> > De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 >> > 16:01 À : [hidden email] Objet : Re: Yarn batch not working with >> > standalone yarn job manager once a persistent, HA job manager is launched ? >> > >> > Thanks for clarification. I think it might be related to the YARN >> > properties file, which is still being used for the batch jobs. Can you try >> > to delete it between submissions as a temporary workaround to check whether >> > it's related? >> > >> > – Ufuk >> > >> > On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> >> > wrote: >> >> Hi, >> >> >> >> The zookeeper path is only for my persistent container, and I do use a >> >> different one for all my persistent containers. >> >> >> >> The -Drecovery.mode=standalone was passed inside the JVM_ARGS >> >> ("${JVM_ARGS} -Drecovery.mode=standalone >> >> -Dyarn.properties-file.location=/tmp/flink/batch") >> >> >> >> I've tried using -yD recovery.mode=standalone on the flink command line >> >> too, but it does not solve the pb; it stills use the pre-existing container. >> >> >> >> Complete line = >> >> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu >> >> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s >> >> -yD recovery.mode=standalone --class >> >> com.bouygtel.kubera.main.segstage.MainGeoSegStage >> >> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHO >> >> T -allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c >> >> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg >> >> >> >> JVM_ARGS = >> >> -Drecovery.mode=standalone >> >> -Dyarn.properties-file.location=/tmp/flink/batch >> >> >> >> >> >> Arnaud >> >> >> >> >> >> -----Message d'origine----- >> >> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 >> >> 14:37 À : [hidden email] Objet : Re: Yarn batch not working >> >> with standalone yarn job manager once a persistent, HA job manager is >> >> launched ? >> >> >> >> Hey Arnaud, >> >> >> >> The cause of this is probably that both jobs use the same ZooKeeper >> >> root path, in which case all task managers connect to the same leading job >> >> manager. >> >> >> >> I think you forgot to the add the y in the -Drecovery.mode=standalone >> >> for the batch jobs, e.g. >> >> >> >> -yDrecovery.mode=standalone >> >> >> >> Can you try this? >> >> >> >> – Ufuk >> >> >> >> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> >> >> wrote: >> >>> Hi, >> >>> >> >>> >> >>> >> >>> I use Flink 1.0.0. I have a persistent yarn container set (a >> >>> persistent flink job manager) that I use for streaming jobs ; and I >> >>> use the “yarn-cluster” mode to launch my batches. >> >>> >> >>> >> >>> >> >>> I’ve just switched “HA” mode on for my streaming persistent job >> >>> manager and it seems to works ; however my batches are not working >> >>> any longer because they now execute themselves inside the persistent >> >>> container (and fail because it lacks slots) and not in a separate >> >>> standalone job manager. >> >>> >> >>> >> >>> >> >>> My batch launch options: >> >>> >> >>> >> >>> >> >>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm >> >>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD >> >>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD >> >>> akka.ask.timeout=300s" >> >>> >> >>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone >> >>> -Dyarn.properties-file.location=/tmp/flink/batch" >> >>> >> >>> >> >>> >> >>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA >> >>> $JAR_SUPP $listArgs $ACTION >> >>> >> >>> >> >>> >> >>> My persistent cluster launch option : >> >>> >> >>> >> >>> >> >>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 >> >>> -Drecovery.mode=zookeeper >> >>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} >> >>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} >> >>> -Dstate.backend=filesystem >> >>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PA >> >>> T >> >>> H >> >>> }/checkpoints >> >>> >> >>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" >> >>> >> >>> >> >>> >> >>> $FLINK_DIR/yarn-session.sh >> >>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO >> >>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS >> >>> -tm $FLINK_MEMORY -qu $FLINK_QUEUE -nm >> >>> ${GANESH_TYPE_PF}_KuberaFlink >> >>> >> >>> >> >>> >> >>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the >> >>> container for now, but I lack HA. >> >>> >> >>> >> >>> >> >>> Is it a (un)known bug or am I missing a magic option? >> >>> >> >>> >> >>> >> >>> Best regards, >> >>> >> >>> Arnaud >> >>> >> >>> >> >>> >> >>> >> >>> ________________________________ >> >>> >> >>> L'intégrité de ce message n'étant pas assurée sur internet, la >> >>> société expéditrice ne peut être tenue responsable de son contenu ni >> >>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée >> >>> est interdite. Si vous n'êtes pas destinataire de ce message, merci >> >>> de le détruire et d'avertir l'expéditeur. >> >>> >> >>> The integrity of this message cannot be guaranteed on the Internet. >> >>> The company that sent this message cannot therefore be held liable >> >>> for its content nor attachments. Any unauthorized use or >> >>> dissemination is prohibited. If you are not the intended recipient >> >>> of this message, then please delete it and notify the sender. > > |
I've created an issue for this here:
https://issues.apache.org/jira/browse/FLINK-4095 On Mon, Jun 20, 2016 at 11:09 AM, Maximilian Michels <[hidden email]> wrote: > +1 for a CLI parameter for loading the config from a custom location > > On Thu, Jun 16, 2016 at 6:01 PM, Till Rohrmann <[hidden email]> wrote: >> Hi Arnaud, >> >> at the moment the environment variable is the only way to specify a >> different config directory for the CLIFrontend. But it totally makes sense >> to introduce a --configDir parameter for the flink shell script. I'll open >> an issue for this. >> >> Cheers, >> Till >> >> On Thu, Jun 16, 2016 at 5:36 PM, LINZ, Arnaud <[hidden email]> >> wrote: >>> >>> Okay, is there a way to specify the flink-conf.yaml to use on the >>> ./bin/flink command-line? I see no such option. I guess I have to set >>> FLINK_CONF_DIR before the call ? >>> >>> -----Message d'origine----- >>> De : Maximilian Michels [mailto:[hidden email]] >>> Envoyé : mercredi 15 juin 2016 18:06 >>> À : [hidden email] >>> Objet : Re: Yarn batch not working with standalone yarn job manager once a >>> persistent, HA job manager is launched ? >>> >>> Hi Arnaud, >>> >>> One issue per thread please. That makes things a lot easier for us :) >>> >>> Something positive first: We are reworking the resuming of existing Flink >>> Yarn applications. It'll be much easier to resume a cluster using simply the >>> Yarn ID or re-discoering the Yarn session using the properties file. >>> >>> The dynamic properties are a shortcut to modifying the Flink configuration >>> of the cluster _only_ upon startup. Afterwards, they are already set at the >>> containers. We might change this for the 1.1.0 release. It should work if >>> you put "yarn.properties-file.location: >>> /custom/location" in your flink-conf.yaml before you execute >>> "./bin/flink". >>> >>> Cheers, >>> Max >>> >>> On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud <[hidden email]> >>> wrote: >>> > Ooopsss.... >>> > My mistake, snapshot/restore do works in a local env, I've had a weird >>> > configuration issue! >>> > >>> > But I still have the property file path issue :) >>> > >>> > -----Message d'origine----- >>> > De : LINZ, Arnaud >>> > Envoyé : mercredi 15 juin 2016 14:35 >>> > À : '[hidden email]' <[hidden email]> Objet : RE: Yarn >>> > batch not working with standalone yarn job manager once a persistent, HA >>> > job manager is launched ? >>> > >>> > Hi, >>> > >>> > I haven't had the time to investigate the bad configuration file path >>> > issue yet (if you have any idea why yarn.properties-file.location is ignored >>> > you are welcome) , but I'm facing another HA-problem. >>> > >>> > I'm trying to make my custom streaming sources HA compliant by >>> > implementing snapshotState() & restoreState(). I would like to test that >>> > mechanism in my junit tests, because it can be complex, but I was unable to >>> > simulate a "recover" on a local flink environment: snapshotState() is never >>> > triggered and launching an exception inside the execution chain does not >>> > lead to recovery but ends the execution, despite the >>> > streamExecEnv.enableCheckpointing(timeout) call. >>> > >>> > Is there a way to locally test this mechanism (other than poorly >>> > simulating it by explicitly calling snapshot & restore in a overridden >>> > source) ? >>> > >>> > Thanks, >>> > Arnaud >>> > >>> > -----Message d'origine----- >>> > De : LINZ, Arnaud >>> > Envoyé : lundi 6 juin 2016 17:53 >>> > À : [hidden email] >>> > Objet : RE: Yarn batch not working with standalone yarn job manager once >>> > a persistent, HA job manager is launched ? >>> > >>> > I've deleted the '/tmp/.yarn-properties-user' file created for the >>> > persistent containter, and the batches do go into their own right container. >>> > However, that's not a workable workaround as I'm no longer able to submit >>> > streaming apps in the persistant container that way :) So it's really a >>> > problem of flink finding the right property file. >>> > >>> > I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the >>> > batch command line (also configured in the JVM_ARGS var), with no change of >>> > behaviour. Note that I do have a standalone yarn container created, but the >>> > job is submitted in the other other one. >>> > >>> > Thanks, >>> > Arnaud >>> > >>> > -----Message d'origine----- >>> > De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 >>> > 16:01 À : [hidden email] Objet : Re: Yarn batch not working with >>> > standalone yarn job manager once a persistent, HA job manager is launched ? >>> > >>> > Thanks for clarification. I think it might be related to the YARN >>> > properties file, which is still being used for the batch jobs. Can you try >>> > to delete it between submissions as a temporary workaround to check whether >>> > it's related? >>> > >>> > – Ufuk >>> > >>> > On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud <[hidden email]> >>> > wrote: >>> >> Hi, >>> >> >>> >> The zookeeper path is only for my persistent container, and I do use a >>> >> different one for all my persistent containers. >>> >> >>> >> The -Drecovery.mode=standalone was passed inside the JVM_ARGS >>> >> ("${JVM_ARGS} -Drecovery.mode=standalone >>> >> -Dyarn.properties-file.location=/tmp/flink/batch") >>> >> >>> >> I've tried using -yD recovery.mode=standalone on the flink command line >>> >> too, but it does not solve the pb; it stills use the pre-existing container. >>> >> >>> >> Complete line = >>> >> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu >>> >> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s >>> >> -yD recovery.mode=standalone --class >>> >> com.bouygtel.kubera.main.segstage.MainGeoSegStage >>> >> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHO >>> >> T -allinone.jar -j /usr/users/datcrypt/alinz/KBR/GOS/log -c >>> >> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg >>> >> >>> >> JVM_ARGS = >>> >> -Drecovery.mode=standalone >>> >> -Dyarn.properties-file.location=/tmp/flink/batch >>> >> >>> >> >>> >> Arnaud >>> >> >>> >> >>> >> -----Message d'origine----- >>> >> De : Ufuk Celebi [mailto:[hidden email]] Envoyé : lundi 6 juin 2016 >>> >> 14:37 À : [hidden email] Objet : Re: Yarn batch not working >>> >> with standalone yarn job manager once a persistent, HA job manager is >>> >> launched ? >>> >> >>> >> Hey Arnaud, >>> >> >>> >> The cause of this is probably that both jobs use the same ZooKeeper >>> >> root path, in which case all task managers connect to the same leading job >>> >> manager. >>> >> >>> >> I think you forgot to the add the y in the -Drecovery.mode=standalone >>> >> for the batch jobs, e.g. >>> >> >>> >> -yDrecovery.mode=standalone >>> >> >>> >> Can you try this? >>> >> >>> >> – Ufuk >>> >> >>> >> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud <[hidden email]> >>> >> wrote: >>> >>> Hi, >>> >>> >>> >>> >>> >>> >>> >>> I use Flink 1.0.0. I have a persistent yarn container set (a >>> >>> persistent flink job manager) that I use for streaming jobs ; and I >>> >>> use the “yarn-cluster” mode to launch my batches. >>> >>> >>> >>> >>> >>> >>> >>> I’ve just switched “HA” mode on for my streaming persistent job >>> >>> manager and it seems to works ; however my batches are not working >>> >>> any longer because they now execute themselves inside the persistent >>> >>> container (and fail because it lacks slots) and not in a separate >>> >>> standalone job manager. >>> >>> >>> >>> >>> >>> >>> >>> My batch launch options: >>> >>> >>> >>> >>> >>> >>> >>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm >>> >>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD >>> >>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD >>> >>> akka.ask.timeout=300s" >>> >>> >>> >>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone >>> >>> -Dyarn.properties-file.location=/tmp/flink/batch" >>> >>> >>> >>> >>> >>> >>> >>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA >>> >>> $JAR_SUPP $listArgs $ACTION >>> >>> >>> >>> >>> >>> >>> >>> My persistent cluster launch option : >>> >>> >>> >>> >>> >>> >>> >>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 >>> >>> -Drecovery.mode=zookeeper >>> >>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} >>> >>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH} >>> >>> -Dstate.backend=filesystem >>> >>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PA >>> >>> T >>> >>> H >>> >>> }/checkpoints >>> >>> >>> >>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/" >>> >>> >>> >>> >>> >>> >>> >>> $FLINK_DIR/yarn-session.sh >>> >>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO >>> >>> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS >>> >>> -tm $FLINK_MEMORY -qu $FLINK_QUEUE -nm >>> >>> ${GANESH_TYPE_PF}_KuberaFlink >>> >>> >>> >>> >>> >>> >>> >>> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the >>> >>> container for now, but I lack HA. >>> >>> >>> >>> >>> >>> >>> >>> Is it a (un)known bug or am I missing a magic option? >>> >>> >>> >>> >>> >>> >>> >>> Best regards, >>> >>> >>> >>> Arnaud >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> ________________________________ >>> >>> >>> >>> L'intégrité de ce message n'étant pas assurée sur internet, la >>> >>> société expéditrice ne peut être tenue responsable de son contenu ni >>> >>> de ses pièces jointes. Toute utilisation ou diffusion non autorisée >>> >>> est interdite. Si vous n'êtes pas destinataire de ce message, merci >>> >>> de le détruire et d'avertir l'expéditeur. >>> >>> >>> >>> The integrity of this message cannot be guaranteed on the Internet. >>> >>> The company that sent this message cannot therefore be held liable >>> >>> for its content nor attachments. Any unauthorized use or >>> >>> dissemination is prohibited. If you are not the intended recipient >>> >>> of this message, then please delete it and notify the sender. >> >> |
Free forum by Nabble | Edit this page |