Flink distribution housekeeping for YARN sessions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink distribution housekeeping for YARN sessions

Theo
Hi there,

Today I realized that we currently have a lot of not housekept flink distribution jar files and would like to know what to do about this, i.e. how to proper housekeep them.

In the job submitting HDFS home directory, I find a subdirectory called `.flink` with hundreds of subfolders like `application_1573731655031_0420`, having the following structure:

-rw-r--r--   3 dev dev        861 2020-01-27 21:17 /user/dev/.flink/application_1580155950981_0010/4797ff6e-853b-460c-81b3-34078814c5c9-taskmanager-conf.yaml
-rw-r--r--   3 dev dev        691 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/application_1580155950981_0010-flink-conf.yaml2755466919863419496.tmp
-rw-r--r--   3 dev dev        861 2020-01-27 21:17 /user/dev/.flink/application_1580155950981_0010/fdb5ef57-c140-4f6d-9791-c226eb1438ce-taskmanager-conf.yaml
-rw-r--r--   3 dev dev     92.2 M 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/flink-dist_2.11-1.9.1.jar
drwxr-xr-x   - dev dev          0 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/lib
-rw-r--r--   3 dev dev      2.6 K 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/log4j.properties
-rw-r--r--   3 dev dev      2.3 K 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/logback.xml
drwxr-xr-x   - dev dev          0 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/plugins

With having tons of those folders (For each flink session we launched/killed in our CI CD pipeline), they sum up to some terrabytes in our HDFS in used space.
I suppose, I kill our flink sessions wrongly. We start and stop sessions and and jobs separately like so:

Start:
${OS_ROOT}/flink/bin/yarn-session.sh -jm 4g -tm 32g --name "${FLINK_SESSION_NAME}" -d -Denv.java.opts="-XX:+HeapDumpOnOutOfMemoryError"
${OS_ROOT}/flink/bin/flink run -m ${FLINK_HOST} [..savepoint/checkpoint options...] -d -n "${JOB_JAR}" $*
Stop
${OS_ROOT}/flink/bin/flink stop -p ${SAVEPOINT_BASEDIR}/${FLINK_JOB_NAME} -m ${FLINK_HOST} ${ID}
yarn application -kill "${ID}"

yarn application -kill was the best I could find as the flink docu states, the linux session process should only be closed (" Stop the YARN session by stopping the unix process (using CTRL+C) or by entering ‘stop’ into the client.").

Now my question: Is there a more elegant way to kill a yarn session (remotely from some host in the cluster, not necessarily the one starting the detached session), which also does the housekeeping then? Or should I do the housekeeping myself manually? (Pretty easy to script). Do I need to expect any more side effects when killing the session with "yarn application -kill"?

Best regards
Theo

--
SCOOP Software GmbH - Gut Maarhausen - Eiler Straße 3 P - D-51107 Köln
Theo Diefenthal

T +49 221 801916-196 - F +49 221 801916-17 - M +49 160 90506575
[hidden email] - www.scoop-software.de
Sitz der Gesellschaft: Köln, Handelsregister: Köln,
Handelsregisternummer: HRB 36625
Geschäftsführung: Dr. Oleg Balovnev, Frank Heinen,
Martin Müller-Rohde, Dr. Wolfgang Reddig, Roland Scheel
Reply | Threaded
Open this post in threaded view
|

Re: Flink distribution housekeeping for YARN sessions

Till Rohrmann
Hi Theo,

your assumption is correct that Flink won't clean up its files when using `yarn application -kill ID`. This should also hold true for other temporary files generated by Flink's Blob service, shuffle service and io manager. These files are usually stored under /tmp and should be cleaned up eventually, though.

I think a better approach is to reconnect to the Flink Yarn session cluster and then issue the "stop" command. You can either do it via `bin/yarn-session.sh -id APP_ID` and then type "stop" or you do `echo "stop" | bin/yarn-session.sh -id APP_ID`.

I think we should also update the logging statements of the yarn-session.sh which say that you should use `yarn application -kill` in order to stop the process.

Cheers,
Till

On Tue, Jan 28, 2020 at 6:21 PM Theo Diefenthal <[hidden email]> wrote:
Hi there,

Today I realized that we currently have a lot of not housekept flink distribution jar files and would like to know what to do about this, i.e. how to proper housekeep them.

In the job submitting HDFS home directory, I find a subdirectory called `.flink` with hundreds of subfolders like `application_1573731655031_0420`, having the following structure:

-rw-r--r--   3 dev dev        861 2020-01-27 21:17 /user/dev/.flink/application_1580155950981_0010/4797ff6e-853b-460c-81b3-34078814c5c9-taskmanager-conf.yaml
-rw-r--r--   3 dev dev        691 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/application_1580155950981_0010-flink-conf.yaml2755466919863419496.tmp
-rw-r--r--   3 dev dev        861 2020-01-27 21:17 /user/dev/.flink/application_1580155950981_0010/fdb5ef57-c140-4f6d-9791-c226eb1438ce-taskmanager-conf.yaml
-rw-r--r--   3 dev dev     92.2 M 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/flink-dist_2.11-1.9.1.jar
drwxr-xr-x   - dev dev          0 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/lib
-rw-r--r--   3 dev dev      2.6 K 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/log4j.properties
-rw-r--r--   3 dev dev      2.3 K 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/logback.xml
drwxr-xr-x   - dev dev          0 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/plugins

With having tons of those folders (For each flink session we launched/killed in our CI CD pipeline), they sum up to some terrabytes in our HDFS in used space.
I suppose, I kill our flink sessions wrongly. We start and stop sessions and and jobs separately like so:

Start:
${OS_ROOT}/flink/bin/yarn-session.sh -jm 4g -tm 32g --name "${FLINK_SESSION_NAME}" -d -Denv.java.opts="-XX:+HeapDumpOnOutOfMemoryError"
${OS_ROOT}/flink/bin/flink run -m ${FLINK_HOST} [..savepoint/checkpoint options...] -d -n "${JOB_JAR}" $*
Stop
${OS_ROOT}/flink/bin/flink stop -p ${SAVEPOINT_BASEDIR}/${FLINK_JOB_NAME} -m ${FLINK_HOST} ${ID}
yarn application -kill "${ID}"

yarn application -kill was the best I could find as the flink docu states, the linux session process should only be closed (" Stop the YARN session by stopping the unix process (using CTRL+C) or by entering ‘stop’ into the client.").

Now my question: Is there a more elegant way to kill a yarn session (remotely from some host in the cluster, not necessarily the one starting the detached session), which also does the housekeeping then? Or should I do the housekeeping myself manually? (Pretty easy to script). Do I need to expect any more side effects when killing the session with "yarn application -kill"?

Best regards
Theo

--
SCOOP Software GmbH - Gut Maarhausen - Eiler Straße 3 P - D-51107 Köln
Theo Diefenthal

T +49 221 801916-196 - F +49 221 801916-17 - M +49 160 90506575
[hidden email] - www.scoop-software.de
Sitz der Gesellschaft: Köln, Handelsregister: Köln,
Handelsregisternummer: HRB 36625
Geschäftsführung: Dr. Oleg Balovnev, Frank Heinen,
Martin Müller-Rohde, Dr. Wolfgang Reddig, Roland Scheel
Reply | Threaded
Open this post in threaded view
|

Re: Flink distribution housekeeping for YARN sessions

Till Rohrmann
Here is the corresponding JIRA ticket: https://issues.apache.org/jira/browse/FLINK-15806

On Wed, Jan 29, 2020 at 3:16 PM Till Rohrmann <[hidden email]> wrote:
Hi Theo,

your assumption is correct that Flink won't clean up its files when using `yarn application -kill ID`. This should also hold true for other temporary files generated by Flink's Blob service, shuffle service and io manager. These files are usually stored under /tmp and should be cleaned up eventually, though.

I think a better approach is to reconnect to the Flink Yarn session cluster and then issue the "stop" command. You can either do it via `bin/yarn-session.sh -id APP_ID` and then type "stop" or you do `echo "stop" | bin/yarn-session.sh -id APP_ID`.

I think we should also update the logging statements of the yarn-session.sh which say that you should use `yarn application -kill` in order to stop the process.

Cheers,
Till

On Tue, Jan 28, 2020 at 6:21 PM Theo Diefenthal <[hidden email]> wrote:
Hi there,

Today I realized that we currently have a lot of not housekept flink distribution jar files and would like to know what to do about this, i.e. how to proper housekeep them.

In the job submitting HDFS home directory, I find a subdirectory called `.flink` with hundreds of subfolders like `application_1573731655031_0420`, having the following structure:

-rw-r--r--   3 dev dev        861 2020-01-27 21:17 /user/dev/.flink/application_1580155950981_0010/4797ff6e-853b-460c-81b3-34078814c5c9-taskmanager-conf.yaml
-rw-r--r--   3 dev dev        691 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/application_1580155950981_0010-flink-conf.yaml2755466919863419496.tmp
-rw-r--r--   3 dev dev        861 2020-01-27 21:17 /user/dev/.flink/application_1580155950981_0010/fdb5ef57-c140-4f6d-9791-c226eb1438ce-taskmanager-conf.yaml
-rw-r--r--   3 dev dev     92.2 M 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/flink-dist_2.11-1.9.1.jar
drwxr-xr-x   - dev dev          0 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/lib
-rw-r--r--   3 dev dev      2.6 K 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/log4j.properties
-rw-r--r--   3 dev dev      2.3 K 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/logback.xml
drwxr-xr-x   - dev dev          0 2020-01-27 21:16 /user/dev/.flink/application_1580155950981_0010/plugins

With having tons of those folders (For each flink session we launched/killed in our CI CD pipeline), they sum up to some terrabytes in our HDFS in used space.
I suppose, I kill our flink sessions wrongly. We start and stop sessions and and jobs separately like so:

Start:
${OS_ROOT}/flink/bin/yarn-session.sh -jm 4g -tm 32g --name "${FLINK_SESSION_NAME}" -d -Denv.java.opts="-XX:+HeapDumpOnOutOfMemoryError"
${OS_ROOT}/flink/bin/flink run -m ${FLINK_HOST} [..savepoint/checkpoint options...] -d -n "${JOB_JAR}" $*
Stop
${OS_ROOT}/flink/bin/flink stop -p ${SAVEPOINT_BASEDIR}/${FLINK_JOB_NAME} -m ${FLINK_HOST} ${ID}
yarn application -kill "${ID}"

yarn application -kill was the best I could find as the flink docu states, the linux session process should only be closed (" Stop the YARN session by stopping the unix process (using CTRL+C) or by entering ‘stop’ into the client.").

Now my question: Is there a more elegant way to kill a yarn session (remotely from some host in the cluster, not necessarily the one starting the detached session), which also does the housekeeping then? Or should I do the housekeeping myself manually? (Pretty easy to script). Do I need to expect any more side effects when killing the session with "yarn application -kill"?

Best regards
Theo

--
SCOOP Software GmbH - Gut Maarhausen - Eiler Straße 3 P - D-51107 Köln
Theo Diefenthal

T +49 221 801916-196 - F +49 221 801916-17 - M +49 160 90506575
[hidden email] - www.scoop-software.de
Sitz der Gesellschaft: Köln, Handelsregister: Köln,
Handelsregisternummer: HRB 36625
Geschäftsführung: Dr. Oleg Balovnev, Frank Heinen,
Martin Müller-Rohde, Dr. Wolfgang Reddig, Roland Scheel