Task manager number mismatch container number on mesos

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Task manager number mismatch container number on mesos

Renjie Liu
Hi, all:
We are using flink 1.2.0 on mesos. We found the number of task managers mismatches with container number occasinally. That's the mesos container still exists but it can't be found on the monitor web page of flink master. This case doesn't happen frequently and it's hard to reproduce.
--
Liu, Renjie
Software Engineer, MVAD
Reply | Threaded
Open this post in threaded view
|

Re: Task manager number mismatch container number on mesos

Ufuk Celebi
When it happens, is it temporary or permanent?

Looping in Till and Eron who worked on the Mesos runner.

– Ufuk

On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <[hidden email]> wrote:
> Hi, all:
> We are using flink 1.2.0 on mesos. We found the number of task managers
> mismatches with container number occasinally. That's the mesos container
> still exists but it can't be found on the monitor web page of flink master.
> This case doesn't happen frequently and it's hard to reproduce.
> --
> Liu, Renjie
> Software Engineer, MVAD
Reply | Threaded
Open this post in threaded view
|

Re: Task manager number mismatch container number on mesos

Renjie Liu
Permanent. I've waited for several minutes and the task manager is still lost.

On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <[hidden email]> wrote:
When it happens, is it temporary or permanent?

Looping in Till and Eron who worked on the Mesos runner.

– Ufuk

On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <[hidden email]> wrote:
> Hi, all:
> We are using flink 1.2.0 on mesos. We found the number of task managers
> mismatches with container number occasinally. That's the mesos container
> still exists but it can't be found on the monitor web page of flink master.
> This case doesn't happen frequently and it's hard to reproduce.
> --
> Liu, Renjie
> Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD
Reply | Threaded
Open this post in threaded view
|

Re: Task manager number mismatch container number on mesos

rmetzger0
Could you provide the logs of the task manager that still runs as a container but doesn't show up as a Taskmanager?

On Thu, Mar 23, 2017 at 11:38 AM, Renjie Liu <[hidden email]> wrote:
Permanent. I've waited for several minutes and the task manager is still lost.

On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <[hidden email]> wrote:
When it happens, is it temporary or permanent?

Looping in Till and Eron who worked on the Mesos runner.

– Ufuk

On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <[hidden email]> wrote:
> Hi, all:
> We are using flink 1.2.0 on mesos. We found the number of task managers
> mismatches with container number occasinally. That's the mesos container
> still exists but it can't be found on the monitor web page of flink master.
> This case doesn't happen frequently and it's hard to reproduce.
> --
> Liu, Renjie
> Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD

Reply | Threaded
Open this post in threaded view
|

Re: Task manager number mismatch container number on mesos

Renjie Liu
I'm not sure how to reproduce this bug, and I'll post it next time it happens.

On Thu, Mar 23, 2017 at 10:21 PM Robert Metzger <[hidden email]> wrote:
Could you provide the logs of the task manager that still runs as a container but doesn't show up as a Taskmanager?

On Thu, Mar 23, 2017 at 11:38 AM, Renjie Liu <[hidden email]> wrote:
Permanent. I've waited for several minutes and the task manager is still lost.

On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <[hidden email]> wrote:
When it happens, is it temporary or permanent?

Looping in Till and Eron who worked on the Mesos runner.

– Ufuk

On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <[hidden email]> wrote:
> Hi, all:
> We are using flink 1.2.0 on mesos. We found the number of task managers
> mismatches with container number occasinally. That's the mesos container
> still exists but it can't be found on the monitor web page of flink master.
> This case doesn't happen frequently and it's hard to reproduce.
> --
> Liu, Renjie
> Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD

--
Liu, Renjie
Software Engineer, MVAD
Reply | Threaded
Open this post in threaded view
|

Re: Task manager number mismatch container number on mesos

Renjie Liu
This happens again. 
I've checked job manager's log and it reports the lost of task manager as expected. 
However, there's nothing valuable in the task manager's log. I've checked the output of jstack and what's interesting is that several threads get blocked when allocating memory. But the jvm heap usage is low and no gc happens.






On Thu, Mar 23, 2017 at 10:24 PM Renjie Liu <[hidden email]> wrote:
I'm not sure how to reproduce this bug, and I'll post it next time it happens.

On Thu, Mar 23, 2017 at 10:21 PM Robert Metzger <[hidden email]> wrote:
Could you provide the logs of the task manager that still runs as a container but doesn't show up as a Taskmanager?

On Thu, Mar 23, 2017 at 11:38 AM, Renjie Liu <[hidden email]> wrote:
Permanent. I've waited for several minutes and the task manager is still lost.

On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <[hidden email]> wrote:
When it happens, is it temporary or permanent?

Looping in Till and Eron who worked on the Mesos runner.

– Ufuk

On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <[hidden email]> wrote:
> Hi, all:
> We are using flink 1.2.0 on mesos. We found the number of task managers
> mismatches with container number occasinally. That's the mesos container
> still exists but it can't be found on the monitor web page of flink master.
> This case doesn't happen frequently and it's hard to reproduce.
> --
> Liu, Renjie
> Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD

--
Liu, Renjie
Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD
Reply | Threaded
Open this post in threaded view
|

Re: Task manager number mismatch container number on mesos

Renjie Liu
I'm using mesos 1.0.1 client but our cluster is mesos 0.26.0, is this may be the cause?

On Mon, Apr 10, 2017 at 2:05 PM Renjie Liu <[hidden email]> wrote:
This happens again. 
I've checked job manager's log and it reports the lost of task manager as expected. 
However, there's nothing valuable in the task manager's log. I've checked the output of jstack and what's interesting is that several threads get blocked when allocating memory. But the jvm heap usage is low and no gc happens.






On Thu, Mar 23, 2017 at 10:24 PM Renjie Liu <[hidden email]> wrote:
I'm not sure how to reproduce this bug, and I'll post it next time it happens.

On Thu, Mar 23, 2017 at 10:21 PM Robert Metzger <[hidden email]> wrote:
Could you provide the logs of the task manager that still runs as a container but doesn't show up as a Taskmanager?

On Thu, Mar 23, 2017 at 11:38 AM, Renjie Liu <[hidden email]> wrote:
Permanent. I've waited for several minutes and the task manager is still lost.

On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <[hidden email]> wrote:
When it happens, is it temporary or permanent?

Looping in Till and Eron who worked on the Mesos runner.

– Ufuk

On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <[hidden email]> wrote:
> Hi, all:
> We are using flink 1.2.0 on mesos. We found the number of task managers
> mismatches with container number occasinally. That's the mesos container
> still exists but it can't be found on the monitor web page of flink master.
> This case doesn't happen frequently and it's hard to reproduce.
> --
> Liu, Renjie
> Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD

--
Liu, Renjie
Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD
Reply | Threaded
Open this post in threaded view
|

Re: Task manager number mismatch container number on mesos

Renjie Liu
Attached is task manager's log, jstack, jstack mixed mode, heap usage.
pasted1
It seems that threads are active threads blocked on allocating memory, but no gc is triggered and memory usage is low.

On Mon, Apr 10, 2017 at 2:06 PM Renjie Liu <[hidden email]> wrote:
I'm using mesos 1.0.1 client but our cluster is mesos 0.26.0, is this may be the cause?

On Mon, Apr 10, 2017 at 2:05 PM Renjie Liu <[hidden email]> wrote:
This happens again. 
I've checked job manager's log and it reports the lost of task manager as expected. 
However, there's nothing valuable in the task manager's log. I've checked the output of jstack and what's interesting is that several threads get blocked when allocating memory. But the jvm heap usage is low and no gc happens.






On Thu, Mar 23, 2017 at 10:24 PM Renjie Liu <[hidden email]> wrote:
I'm not sure how to reproduce this bug, and I'll post it next time it happens.

On Thu, Mar 23, 2017 at 10:21 PM Robert Metzger <[hidden email]> wrote:
Could you provide the logs of the task manager that still runs as a container but doesn't show up as a Taskmanager?

On Thu, Mar 23, 2017 at 11:38 AM, Renjie Liu <[hidden email]> wrote:
Permanent. I've waited for several minutes and the task manager is still lost.

On Thu, Mar 23, 2017 at 6:34 PM Ufuk Celebi <[hidden email]> wrote:
When it happens, is it temporary or permanent?

Looping in Till and Eron who worked on the Mesos runner.

– Ufuk

On Thu, Mar 23, 2017 at 11:09 AM, Renjie Liu <[hidden email]> wrote:
> Hi, all:
> We are using flink 1.2.0 on mesos. We found the number of task managers
> mismatches with container number occasinally. That's the mesos container
> still exists but it can't be found on the monitor web page of flink master.
> This case doesn't happen frequently and it's hard to reproduce.
> --
> Liu, Renjie
> Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD

--
Liu, Renjie
Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD
--
Liu, Renjie
Software Engineer, MVAD

flink-taskmanager.log (4M) Download Attachment
heap (1K) Download Attachment
stack (52K) Download Attachment
stack-mixed (58K) Download Attachment