Hello,
Now I'm at the stage where my job seem to completely hang. Source code is attached (it won't compile but I think gives a very good idea of what happens). Unfortunately I can't provide the datasets. Most of them are about 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB memory for each. It was working for smaller input sizes. Any idea on what I can do differently is appreciated. Thans, Timur FaithResolution.scala (12K) Download Attachment |
Could you share the logs with us, Timur? That would be very helpful. Cheers, On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <[hidden email]> wrote:
|
Hey Timur,
is it possible to connect to the VMs and get stack traces of the Flink processes as well? We can first have a look at the logs, but the stack traces will be helpful if we can't figure out what the issue is. – Ufuk On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <[hidden email]> wrote: > Could you share the logs with us, Timur? That would be very helpful. > > Cheers, > Till > > On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <[hidden email]> wrote: >> >> Hello, >> >> Now I'm at the stage where my job seem to completely hang. Source code is >> attached (it won't compile but I think gives a very good idea of what >> happens). Unfortunately I can't provide the datasets. Most of them are about >> 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB memory >> for each. >> >> It was working for smaller input sizes. Any idea on what I can do >> differently is appreciated. >> >> Thans, >> Timur |
I will do it my tomorrow. Logs don't show anything unusual. Are there any logs besides what's in flink/log and yarn container logs? On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <[hidden email]> wrote:
Hey Timur, |
No.
If you run on YARN, the YARN logs are the relevant ones for the JobManager and TaskManager. The client log submitting the job should be found in /log. – Ufuk On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov <[hidden email]> wrote: > I will do it my tomorrow. Logs don't show anything unusual. Are there any > logs besides what's in flink/log and yarn container logs? > > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <[hidden email]> wrote: > > Hey Timur, > > is it possible to connect to the VMs and get stack traces of the Flink > processes as well? > > We can first have a look at the logs, but the stack traces will be > helpful if we can't figure out what the issue is. > > – Ufuk > > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <[hidden email]> wrote: >> Could you share the logs with us, Timur? That would be very helpful. >> >> Cheers, >> Till >> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <[hidden email]> >> wrote: >>> >>> Hello, >>> >>> Now I'm at the stage where my job seem to completely hang. Source code is >>> attached (it won't compile but I think gives a very good idea of what >>> happens). Unfortunately I can't provide the datasets. Most of them are >>> about >>> 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB >>> memory >>> for each. >>> >>> It was working for smaller input sizes. Any idea on what I can do >>> differently is appreciated. >>> >>> Thans, >>> Timur |
Hi Timur, thank you for sharing the source code of your job. That is helpful! Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is much more IO heavy with the larger input data because all the joins start spilling? Our monitoring, in particular for batch jobs is really not very advanced.. If we had some monitoring showing the spill status, we would maybe see that the job is still running. How long did you wait until you declared the job hanging? Regards, Robert On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <[hidden email]> wrote: No. |
Hello Robert, I observed progress for 2 hours(meaning numbers change on dashboard), and then I waited for 2 hours more. I'm sure it had to spill at some point, but I figured 2h is enough time. Thanks, On Apr 26, 2016 1:35 AM, "Robert Metzger" <[hidden email]> wrote:
|
Can you please further provide the execution plan via
env.getExecutionPlan() On Tue, Apr 26, 2016 at 4:23 PM, Timur Fayruzov <[hidden email]> wrote: > Hello Robert, > > I observed progress for 2 hours(meaning numbers change on dashboard), and > then I waited for 2 hours more. I'm sure it had to spill at some point, but > I figured 2h is enough time. > > Thanks, > Timur > > On Apr 26, 2016 1:35 AM, "Robert Metzger" <[hidden email]> wrote: >> >> Hi Timur, >> >> thank you for sharing the source code of your job. That is helpful! >> Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is much >> more IO heavy with the larger input data because all the joins start >> spilling? >> Our monitoring, in particular for batch jobs is really not very advanced.. >> If we had some monitoring showing the spill status, we would maybe see that >> the job is still running. >> >> How long did you wait until you declared the job hanging? >> >> Regards, >> Robert >> >> >> On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <[hidden email]> wrote: >>> >>> No. >>> >>> If you run on YARN, the YARN logs are the relevant ones for the >>> JobManager and TaskManager. The client log submitting the job should >>> be found in /log. >>> >>> – Ufuk >>> >>> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov >>> <[hidden email]> wrote: >>> > I will do it my tomorrow. Logs don't show anything unusual. Are there >>> > any >>> > logs besides what's in flink/log and yarn container logs? >>> > >>> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <[hidden email]> wrote: >>> > >>> > Hey Timur, >>> > >>> > is it possible to connect to the VMs and get stack traces of the Flink >>> > processes as well? >>> > >>> > We can first have a look at the logs, but the stack traces will be >>> > helpful if we can't figure out what the issue is. >>> > >>> > – Ufuk >>> > >>> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <[hidden email]> >>> > wrote: >>> >> Could you share the logs with us, Timur? That would be very helpful. >>> >> >>> >> Cheers, >>> >> Till >>> >> >>> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <[hidden email]> >>> >> wrote: >>> >>> >>> >>> Hello, >>> >>> >>> >>> Now I'm at the stage where my job seem to completely hang. Source >>> >>> code is >>> >>> attached (it won't compile but I think gives a very good idea of what >>> >>> happens). Unfortunately I can't provide the datasets. Most of them >>> >>> are >>> >>> about >>> >>> 100-500MM records, I try to process on EMR cluster with 40 tasks 6GB >>> >>> memory >>> >>> for each. >>> >>> >>> >>> It was working for smaller input sizes. Any idea on what I can do >>> >>> differently is appreciated. >>> >>> >>> >>> Thans, >>> >>> Timur >> >> > |
Robert, Ufuk, logs, execution plan and a screenshot of the console are in the archive: https://www.dropbox.com/s/68gyl6f3rdzn7o1/debug-stuck.tar.gz?dl=0 Note that when I looked in the backpressure view I saw back pressure 'high' on following paths: Input->code_line:123,124->map->join Input->code_line:134,135->map->join Input->code_line:121->map->join Unfortunately, I was not able to take thread dumps nor heap dumps (neither kill -3, jstack nor jmap worked, some Amazon AMI problem I assume). Hope that helps. Please, let me know if I can assist you in any way. Otherwise, I probably would not be actively looking at this problem. Thanks, Timur On Tue, Apr 26, 2016 at 8:11 AM, Ufuk Celebi <[hidden email]> wrote: Can you please further provide the execution plan via |
Hi Timur, I've previously seen large batch jobs hang because of join deadlocks. We should have fixed those problems, but we might have missed some corner case. Did you check whether there was any cpu activity when the job hangs? Can you try running htop on the taskmanager machines and see if they're idle? Cheers, -Vasia. On 27 April 2016 at 02:48, Timur Fayruzov <[hidden email]> wrote:
|
Hi Timur, I had a look at the plan you shared. I could not find any flow that branches and merges again, a pattern which is prone to cause a deadlocks. This might help with complex jobs such as yours. 2016-04-27 10:57 GMT+02:00 Vasiliki Kalavri <[hidden email]>:
|
Free forum by Nabble | Edit this page |