(DEPRECATED) Apache Flink User Mailing List archive.

FlinkML ALS is taking too long to run

Classic

List

Threaded

5 messages Options

mmziyad

FlinkML ALS is taking too long to run

Dear all

I'm trying to run Flink ALS against Yahoo-R2 data set[1] on HDFS. The program is running without showing any errors, but it does not finish. The operators running indefinitely are:

CoGroup (CoGroup at org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:606))(11/240)

Join(Join at org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:576))(15/240)

I was using the below parameters to run:

val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(100)

And I didn't set the hdfs temporary path. Can someone tell me the parameters to set to run ALS on such large data sets? Why are these operators running indefinitely?

[1] https://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Best

Ziyad

Andrea Spina

Re: FlinkML ALS is taking too long to run

Dear Ziyad,
could you kindly share some additional info about your environment (local/cluster, nodes, machines' configuration)?
What does exactly you mean by "indefinitely"? How much time the job is hanging?

Hope to help you, then.

Cheers,

Andrea

mmziyad

Re: FlinkML ALS is taking too long to run

Dear Andrea

Thank you for your reply.

The job was stuck at two operators I mentioned (for more than 17 hours). See the screenshot.

I could solve the problem by:

1. Reducing the task slots in the cluster (to half the number of cores from same as the number of cores)

2. Tuning the hyper parameter 'blocks'. I kept it at double the value of job parallelism.

Best

Ziyad

On Tue, Jul 11, 2017 at 5:53 PM, Andrea Spina <[hidden email]> wrote:

Dear Ziyad,
could you kindly share some additional info about your environment
(local/cluster, nodes, machines' configuration)?
What does exactly you mean by "indefinitely"? How much time the job is
hanging?

Hope to help you, then.

Cheers,

Andrea

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-taking-too-long-to-run-tp14154p14186.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

ALS.png (736K) Download Attachment

Andrea Spina

Re: FlinkML ALS is taking too long to run

Dear Ziyad,

Yep, I had encountered same very long runtimes with ALS as well at the time and I recorded improvements by increasing the number of blocks / decreasing #TSs/TM like you've stated out.

Cheers,

Andrea

Sebastian Schelter

Re: FlinkML ALS is taking too long to run

I don't think you need to employ a distributed system for working with this dataset. An SGD implementation on a single machine should easily handle the job.

Best,

Sebastian

2017-07-12 9:26 GMT+02:00 Andrea Spina <[hidden email]>:

Dear Ziyad,

Yep, I had encountered same very long runtimes with ALS as well at the time
and I recorded improvements by increasing the number of blocks / decreasing
#TSs/TM like you've stated out.

Cheers,

Andrea

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-taking-too-long-to-run-tp14154p14192.html

Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.