FlinkML ALS is taking too long to run

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

FlinkML ALS is taking too long to run

mmziyad
Dear all

I'm trying to run Flink ALS against Yahoo-R2 data set[1] on HDFS. The program is running without showing any errors, but it does not finish. The operators running indefinitely are:

CoGroup (CoGroup at org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:606))(11/240)

Join(Join at org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:576))(15/240)


I was using the below parameters to run:

val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(100)
And I didn't set the hdfs temporary path. Can someone tell me the parameters to set to run ALS on such large data sets? Why are these operators running indefinitely?


Best
Ziyad
Reply | Threaded
Open this post in threaded view
|

Re: FlinkML ALS is taking too long to run

Andrea Spina
Dear Ziyad,
could you kindly share some additional info about your environment (local/cluster, nodes, machines' configuration)?
What does exactly you mean by "indefinitely"? How much time the job is hanging?

Hope to help you, then.

Cheers,

Andrea
Reply | Threaded
Open this post in threaded view
|

Re: FlinkML ALS is taking too long to run

mmziyad
Dear Andrea

Thank you for your reply.
The job was stuck at two operators I mentioned (for more than 17 hours). See the screenshot.

I could solve the problem by:
1. Reducing the task slots in the cluster (to half the number of cores from same as the number of cores)
2. Tuning the hyper parameter 'blocks'. I kept it at double the value of job parallelism.

Best
Ziyad

On Tue, Jul 11, 2017 at 5:53 PM, Andrea Spina <[hidden email]> wrote:
Dear Ziyad,
could you kindly share some additional info about your environment
(local/cluster, nodes, machines' configuration)?
What does exactly you mean by "indefinitely"? How much time the job is
hanging?

Hope to help you, then.

Cheers,

Andrea



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-taking-too-long-to-run-tp14154p14186.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.


ALS.png (736K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: FlinkML ALS is taking too long to run

Andrea Spina
Dear Ziyad,

Yep, I had encountered same very long runtimes with ALS as well at the time and I recorded improvements by increasing the number of blocks / decreasing #TSs/TM like you've stated out.

Cheers,

Andrea


Reply | Threaded
Open this post in threaded view
|

Re: FlinkML ALS is taking too long to run

Sebastian Schelter
I don't think you need to employ a distributed system for working with this dataset. An SGD implementation on a single machine should easily handle the job.

Best,
Sebastian

2017-07-12 9:26 GMT+02:00 Andrea Spina <[hidden email]>:
Dear Ziyad,

Yep, I had encountered same very long runtimes with ALS as well at the time
and I recorded improvements by increasing the number of blocks / decreasing
#TSs/TM like you've stated out.

Cheers,

Andrea






--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-taking-too-long-to-run-tp14154p14192.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.