Hi Rob,
yes, this behavior is expected. Flink does not automatically scale-down a job in case of a failure.
You have to ensure that you have enough resources available to continue processing.
In case of Flink's cluster mode, the common practice is to have stand-by TMs available (the same is true for JMs if you need a HA setup).
Best, Fabian