Hi Steve,
in the past we had some problems with cleaning up old checkpoints. But this was in 1.1.x. These problems should be fixed by now.
Could you try upgrading to Flink 1.2.1 in order to see whether the problem persists? If this is the case, then it would be great if you could share the JobManager logs on debug log level with us.
How long is your checkpoint interval? Deleting files from HDFS/S3 can take some time and if the checkpoint interval is shorter than this time, then the system won't be able to delete old checkpoints quick enough.
Cheers,
Till