Flink 1.4.2 -> 1.5.0 upgrade Queryable State debugging

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.4.2 -> 1.5.0 upgrade Queryable State debugging

Philip Doctor
Dear Flink Users,
I'm trying to upgrade to flink 1.5.0, so far everything works except for the Queryable state client.  Now here's where it gets weird.  I have the client sitting behind a web API so the rest of our non-java ecosystem can consume it.  I've got 2 tests, one calls my route directly as a java method call, the other calls the deployed server via HTTP (the difference in the test intending to flex if the server is properly started, etc).

The local call succeeds, the remote call fails, but the failure isn't what I'd expect (around the server layer) it's compalining about contacting the oracle for state location:

<very very long stack trace>
 Caused by: org.apache.flink.queryablestate.exceptions.UnknownLocationException: Could not contact the state location oracle to retrieve the state location.\n\tat org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandler.getKvStateLookupInfo(KvStateClientProxyHandler.java:228)\n
<etc>

The client call is pretty straightforward:
 
        return client.getKvState(
            flinkJobID,
            stateDescriptor.queryableStateName,
            key,
            keyTypeInformation,
            stateDescriptor
        )

I've confirmed via logs that I have the exact same key, flink job ID, and queryable state name. 

So I'm going bonkers on what difference might exist, I'm wondering if I'm packing my jar wrong and there's some resources I need to look out for? (back on flink 1.3.x I had to handle the reference.conf file for AKKA when you were depending on that for Queryable State, is there something like that? etc).  Is there /more logging/ somewhere on the server side that might give me a hint?  like "Tried to query state for X but couldn't find the BananaLever" ?  I'm pretty stuck right now and ready to try any random ideas to move forward.  


The only changes I've made aside from the jar version bump from 1.4.2 to 1.5.0 is handling queryableStateName (which is now nullable), no other code changes were required, and to confirm, this all works just fine with 1.4.2.


Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.4.2 -> 1.5.0 upgrade Queryable State debugging

Dawid Wysakowicz-2
Hi Philip,
Could you attach the full stack trace? Are you querying the same job/cluster in both tests?
I am also looping in Kostas, who might know more about changes in Queryable state between 1.4.2 and 1.5.0.
Best,
Dawid

On Thu, 19 Jul 2018 at 22:33, Philip Doctor <[hidden email]> wrote:
Dear Flink Users,
I'm trying to upgrade to flink 1.5.0, so far everything works except for the Queryable state client.  Now here's where it gets weird.  I have the client sitting behind a web API so the rest of our non-java ecosystem can consume it.  I've got 2 tests, one calls my route directly as a java method call, the other calls the deployed server via HTTP (the difference in the test intending to flex if the server is properly started, etc).

The local call succeeds, the remote call fails, but the failure isn't what I'd expect (around the server layer) it's compalining about contacting the oracle for state location:

<very very long stack trace>
 Caused by: org.apache.flink.queryablestate.exceptions.UnknownLocationException: Could not contact the state location oracle to retrieve the state location.\n\tat org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandler.getKvStateLookupInfo(KvStateClientProxyHandler.java:228)\n
<etc>

The client call is pretty straightforward:
 
        return client.getKvState(
            flinkJobID,
            stateDescriptor.queryableStateName,
            key,
            keyTypeInformation,
            stateDescriptor
        )

I've confirmed via logs that I have the exact same key, flink job ID, and queryable state name. 

So I'm going bonkers on what difference might exist, I'm wondering if I'm packing my jar wrong and there's some resources I need to look out for? (back on flink 1.3.x I had to handle the reference.conf file for AKKA when you were depending on that for Queryable State, is there something like that? etc).  Is there /more logging/ somewhere on the server side that might give me a hint?  like "Tried to query state for X but couldn't find the BananaLever" ?  I'm pretty stuck right now and ready to try any random ideas to move forward.  


The only changes I've made aside from the jar version bump from 1.4.2 to 1.5.0 is handling queryableStateName (which is now nullable), no other code changes were required, and to confirm, this all works just fine with 1.4.2.


Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.4.2 -> 1.5.0 upgrade Queryable State debugging

Philip Doctor

Yeah I just went to reproduce this on a fresh environment, I blew away all the Zookeeper data and the error went away.  I'm running HA JobManager (1 active, 2 standby) and 3 TMs.  I'm not sure how to fully account for this behavior yet, it looks like I can make this run from a totally fresh environment, so until I start practicing how to save point + restore a 1.4.2 -> 1.5.0 job, I look to be good for the moment.  Sorry to have bothered you all.


Thanks.


From: Dawid Wysakowicz <[hidden email]>
Sent: Friday, July 20, 2018 3:09:46 AM
To: Philip Doctor
Cc: user; Kostas Kloudas
Subject: Re: Flink 1.4.2 -> 1.5.0 upgrade Queryable State debugging
 
Hi Philip,
Could you attach the full stack trace? Are you querying the same job/cluster in both tests?
I am also looping in Kostas, who might know more about changes in Queryable state between 1.4.2 and 1.5.0.
Best,
Dawid

On Thu, 19 Jul 2018 at 22:33, Philip Doctor <[hidden email]> wrote:
Dear Flink Users,
I'm trying to upgrade to flink 1.5.0, so far everything works except for the Queryable state client.  Now here's where it gets weird.  I have the client sitting behind a web API so the rest of our non-java ecosystem can consume it.  I've got 2 tests, one calls my route directly as a java method call, the other calls the deployed server via HTTP (the difference in the test intending to flex if the server is properly started, etc).

The local call succeeds, the remote call fails, but the failure isn't what I'd expect (around the server layer) it's compalining about contacting the oracle for state location:

<very very long stack trace>
 Caused by: org.apache.flink.queryablestate.exceptions.UnknownLocationException: Could not contact the state location oracle to retrieve the state location.\n\tat org.apache.flink.queryablestate.client.proxy.KvStateClientProxyHandler.getKvStateLookupInfo(KvStateClientProxyHandler.java:228)\n
<etc>

The client call is pretty straightforward:
 
        return client.getKvState(
            flinkJobID,
            stateDescriptor.queryableStateName,
            key,
            keyTypeInformation,
            stateDescriptor
        )

I've confirmed via logs that I have the exact same key, flink job ID, and queryable state name. 

So I'm going bonkers on what difference might exist, I'm wondering if I'm packing my jar wrong and there's some resources I need to look out for? (back on flink 1.3.x I had to handle the reference.conf file for AKKA when you were depending on that for Queryable State, is there something like that? etc).  Is there /more logging/ somewhere on the server side that might give me a hint?  like "Tried to query state for X but couldn't find the BananaLever" ?  I'm pretty stuck right now and ready to try any random ideas to move forward.  


The only changes I've made aside from the jar version bump from 1.4.2 to 1.5.0 is handling queryableStateName (which is now nullable), no other code changes were required, and to confirm, this all works just fine with 1.4.2.


Thank you.