Hey Joe! This sounds odd... are there any failures (JobManager or
TaskManager) or leader elections being reported? You should see such
events in the JobManager/TaskManager logs.
On Tue, May 16, 2017 at 2:28 PM, Joe Olson <
[hidden email]> wrote:
> When running Flink in high availability mode, I've been seeing a high number
> of UnknownKvStateKeyGroupLocation errors being returned when using queryable
> state calls.
>
>
> If I put a simple getKvState call into a loop executing every second, and
> call it repeatedly, sometimes I will get the expected results, sometimes I
> will get UnknownKvStateKeyGroupLocation thrown. This is not associated with
> a query timeout (network issue).
>
>
> From looking at the Flink source code, this problem stems from a failure of
> lookup.getKvStateServerAddress returning null. I know all the task managers
> are registering state with the job manager, because I see the "Key value
> state registered for job xx under name yy" messages in the job server log.
>
>
> Anything else I should be looking for? I have several jobs I am querying
> state on, and this seems isolated to only one. I've gone over very closely
> the difference between the jobs, but they all built from the same template.
>
>
> What would cause a lookup.getKvStateServerAddress to sometimes succeed, and
> sometimes to fail?
>
>
>