Distribute crawling of a URL list using Flink

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Distribute crawling of a URL list using Flink

Eranga Heshan
Hi all,

I am fairly new to Flink. I have this project where I have a list of URLs (In one node) which need to be crawled distributedly. Then for each URL, I need the serialized crawled result to be written to a single text file.

I want to know if there are similar projects which I can look into or an idea on how to implement this.

Thanks & Regards,



Eranga Heshan
Undergraduate
Computer Science & Engineering
University of Moratuwa
Mobile: <a href="tel:%2B94%2071%20552%202087" value="+94715522087" style="color:rgb(17,85,204)" target="_blank">+94 71 138 2686
Email:[hidden email]
    
Reply | Threaded
Open this post in threaded view
|

Re: Distribute crawling of a URL list using Flink

Kien Truong

Hi,

While this task is quite trivial to do with Flink Dataset API, using readTextFile to read the input and

a flatMap function to perform the downloading, it might not be a good idea.

The download process is I/O bound, and will block the synchronous flatMap function,

so the throughput will not be very good.


Until Flink supports asynchronous functions, I suggest you looks elsewhere.

An example with master-workers architecture using Akka can be found here

https://github.com/typesafehub/activator-akka-distributed-workers


Regards,

Kien



On 8/14/2017 10:09 AM, Eranga Heshan wrote:
Hi all,

I am fairly new to Flink. I have this project where I have a list of URLs (In one node) which need to be crawled distributedly. Then for each URL, I need the serialized crawled result to be written to a single text file.

I want to know if there are similar projects which I can look into or an idea on how to implement this.

Thanks & Regards,



Eranga Heshan
Undergraduate
Computer Science & Engineering
University of Moratuwa
Mobile:  <a href="tel:%2B94%2071%20552%202087" value="+94715522087" style="color:rgb(17,85,204)" target="_blank" moz-do-not-send="true">+94 71 138 2686
Email: [hidden email]
    
Reply | Threaded
Open this post in threaded view
|

Re: Distribute crawling of a URL list using Flink

Nico Kruber
Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect
the resulting string for further processing in your pipeline.



Nico


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:

> Hi,
>
> While this task is quite trivial to do with Flink Dataset API, using
> readTextFile to read the input and
>
> a flatMap function to perform the downloading, it might not be a good idea.
>
> The download process is I/O bound, and will block the synchronous
> flatMap function,
>
> so the throughput will not be very good.
>
>
> Until Flink supports asynchronous functions, I suggest you looks elsewhere.
>
> An example with master-workers architecture using Akka can be found here
>
> https://github.com/typesafehub/activator-akka-distributed-workers
>
>
> Regards,
>
> Kien
>
> On 8/14/2017 10:09 AM, Eranga Heshan wrote:
> > Hi all,
> >
> > I am fairly new to Flink. I have this project where I have a list of
> > URLs (In one node) which need to be crawled distributedly. Then for
> > each URL, I need the serialized crawled result to be written to a
> > single text file.
> >
> > I want to know if there are similar projects which I can look into or
> > an idea on how to implement this.
> >
> > Thanks & Regards,
> >
> >
> >
> >
> > Eranga Heshan
> > /Undergraduate/
> > Computer Science & Engineering
> > University of Moratuwa
> > Mobile: +94 71 138 2686 <tel:%2B94%2071%20552%202087>
> > Email: [hidden email] <mailto:[hidden email]>
> > <https://www.facebook.com/erangaheshan>
> > <https://twitter.com/erangaheshan>
> > <https://www.linkedin.com/in/erangaheshan>


signature.asc (201 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Distribute crawling of a URL list using Flink

Eranga Heshan
Thanks for your quick replies, Nico and Kien. Since I am using Flink-1.3.0, I will try Nico's idea. I might bug you again for my future problems. 😊

Regards,



Eranga Heshan
Undergraduate
Computer Science & Engineering
University of Moratuwa
Mobile: <a href="tel:%2B94%2071%20552%202087" value="+94715522087" style="color:rgb(17,85,204)" target="_blank">+94 71 138 2686
Email:[hidden email]
    

On Mon, Aug 14, 2017 at 10:36 PM, Nico Kruber <[hidden email]> wrote:
Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect
the resulting string for further processing in your pipeline.



Nico


[1] <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/ asyncio.html" rel="noreferrer" target="_blank">https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
> Hi,
>
> While this task is quite trivial to do with Flink Dataset API, using
> readTextFile to read the input and
>
> a flatMap function to perform the downloading, it might not be a good idea.
>
> The download process is I/O bound, and will block the synchronous
> flatMap function,
>
> so the throughput will not be very good.
>
>
> Until Flink supports asynchronous functions, I suggest you looks elsewhere.
>
> An example with master-workers architecture using Akka can be found here
>
> https://github.com/typesafehub/activator-akka-distributed-workers
>
>
> Regards,
>
> Kien
>
> On 8/14/2017 10:09 AM, Eranga Heshan wrote:
> > Hi all,
> >
> > I am fairly new to Flink. I have this project where I have a list of
> > URLs (In one node) which need to be crawled distributedly. Then for
> > each URL, I need the serialized crawled result to be written to a
> > single text file.
> >
> > I want to know if there are similar projects which I can look into or
> > an idea on how to implement this.
> >
> > Thanks & Regards,
> >
> >
> >
> >
> > Eranga Heshan
> > /Undergraduate/
> > Computer Science & Engineering
> > University of Moratuwa
> > Mobile:     <a href="tel:%2B94%2071%20138%202686" value="+94711382686">+94 71 138 2686 <tel:%2B94%2071%20552%202087>
> > Email:      [hidden email] <mailto:[hidden email]>
> > <https://www.facebook.com/erangaheshan>
> > <https://twitter.com/erangaheshan>
> > <https://www.linkedin.com/in/erangaheshan>


Reply | Threaded
Open this post in threaded view
|

Re: Distribute crawling of a URL list using Flink

Kien Truong
In reply to this post by Nico Kruber
Hi,

Admittedly, I have not suggested this because I thought it was not available for batch API.

Regards,
Kien
On Aug 15, 2017, at 00:06, Nico Kruber <[hidden email]> wrote:
Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect
the resulting string for further processing in your pipeline.



Nico


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
Hi,

While this task is quite trivial to do with Flink Dataset API, using
readTextFile to read the input and

a flatMap function to perform the downloading, it might not be a good idea.

The download process is I/O bound, and will block the synchronous
flatMap function,

so the throughput will not be very good.


Until Flink supports asynchronous functions, I suggest you looks elsewhere.

An example with master-workers architecture using Akka can be found here

https://github.com/typesafehub/activator-akka-distributed-workers


Regards,

Kien

On 8/14/2017 10:09 AM, Eranga Heshan wrote:
Hi all,

I am fairly new to Flink. I have this project where I have a list of
URLs (In one node) which need to be crawled distributedly. Then for
each URL, I need the serialized crawled result to be written to a
single text file.

I want to know if there are similar projects which I can look into or
an idea on how to implement this.

Thanks & Regards,




Eranga Heshan
/Undergraduate/
Computer Science & Engineering
University of Moratuwa
Mobile: +94 71 138 2686 <tel:%2B94%2071%20552%202087>
Email: [hidden email] <mailto:[hidden email]>
< https://www.facebook.com/erangaheshan>
< https://twitter.com/erangaheshan>
< https://www.linkedin.com/in/erangaheshan>

Reply | Threaded
Open this post in threaded view
|

Re: Distribute crawling of a URL list using Flink

Aljoscha Krettek
Hi,

It is not available for the Batch API, you would have to use the DataStream API.

Best,
Aljoscha

On 15. Aug 2017, at 01:16, Kien Truong <[hidden email]> wrote:

Hi,

Admittedly, I have not suggested this because I thought it was not available for batch API.

Regards,
Kien
On Aug 15, 2017, at 00:06, Nico Kruber <[hidden email]> wrote:
Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect
the resulting string for further processing in your pipeline.



Nico


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
Hi,

While this task is quite trivial to do with Flink Dataset API, using
readTextFile to read the input and

a flatMap function to perform the downloading, it might not be a good idea.

The download process is I/O bound, and will block the synchronous
flatMap function,

so the throughput will not be very good.


Until Flink supports asynchronous functions, I suggest you looks elsewhere.

An example with master-workers architecture using Akka can be found here

https://github.com/typesafehub/activator-akka-distributed-workers


Regards,

Kien

On 8/14/2017 10:09 AM, Eranga Heshan wrote:
Hi all,

I am fairly new to Flink. I have this project where I have a list of
URLs (In one node) which need to be crawled distributedly. Then for
each URL, I need the serialized crawled result to be written to a
single text file.

I want to know if there are similar projects which I can look into or
an idea on how to implement this.

Thanks & Regards,




Eranga Heshan
/Undergraduate/
Computer Science & Engineering
University of Moratuwa
Mobile: +94 71 138 2686 <<a href="tel:%2B94%2071%20552%202087" class="">tel:%2B94%2071%20552%202087>
Email: [hidden email] <[hidden email]>
< https://www.facebook.com/erangaheshan>
< https://twitter.com/erangaheshan>
< https://www.linkedin.com/in/erangaheshan>


Reply | Threaded
Open this post in threaded view
|

Re: Distribute crawling of a URL list using Flink

Eranga Heshan
Thank you Aljoscha :-) I actually need it for a Kafka stream, so I use DataStream API anyway.

Regards,



Eranga Heshan
Undergraduate
Computer Science & Engineering
University of Moratuwa
Mobile: <a href="tel:%2B94%2071%20552%202087" value="+94715522087" style="color:rgb(17,85,204)" target="_blank">+94 71 138 2686
Email:[hidden email]
    

On Fri, Aug 25, 2017 at 5:53 PM, Aljoscha Krettek <[hidden email]> wrote:
Hi,

It is not available for the Batch API, you would have to use the DataStream API.

Best,
Aljoscha

On 15. Aug 2017, at 01:16, Kien Truong <[hidden email]> wrote:

Hi,

Admittedly, I have not suggested this because I thought it was not available for batch API.

Regards,
Kien
On Aug 15, 2017, at 00:06, Nico Kruber <[hidden email]> wrote:
Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect
the resulting string for further processing in your pipeline.



Nico


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
Hi,

While this task is quite trivial to do with Flink Dataset API, using
readTextFile to read the input and

a flatMap function to perform the downloading, it might not be a good idea.

The download process is I/O bound, and will block the synchronous
flatMap function,

so the throughput will not be very good.


Until Flink supports asynchronous functions, I suggest you looks elsewhere.

An example with master-workers architecture using Akka can be found here

https://github.com/typesafehub/activator-akka-distributed-workers


Regards,

Kien

On 8/14/2017 10:09 AM, Eranga Heshan wrote:
Hi all,

I am fairly new to Flink. I have this project where I have a list of
URLs (In one node) which need to be crawled distributedly. Then for
each URL, I need the serialized crawled result to be written to a
single text file.

I want to know if there are similar projects which I can look into or
an idea on how to implement this.

Thanks & Regards,




Eranga Heshan
/Undergraduate/
Computer Science & Engineering
University of Moratuwa
Mobile: <a href="tel:+94%2071%20138%202686" value="+94711382686" target="_blank">+94 71 138 2686 <<a href="tel:%2B94%2071%20552%202087" target="_blank">tel:%2B94%2071%20552%202087>
Email: [hidden email] <[hidden email]>
< https://www.facebook.com/erangaheshan>
< https://twitter.com/erangaheshan>
< https://www.linkedin.com/in/erangaheshan>