(DEPRECATED) Apache Flink User Mailing List archive.

Distribute crawling of a URL list using Flink

Classic

List

Threaded

7 messages Options

Eranga Heshan

Distribute crawling of a URL list using Flink

Hi all,

I am fairly new to Flink. I have this project where I have a list of URLs (In one node) which need to be crawled distributedly. Then for each URL, I need the serialized crawled result to be written to a single text file.

I want to know if there are similar projects which I can look into or an idea on how to implement this.

Thanks & Regards,

Eranga Heshan
Undergraduate
Computer Science & Engineering
University of Moratuwa
Mobile:	<a href="tel:%2B94%2071%20552%202087" value="+94715522087" style="color:rgb(17,85,204)" target="_blank">+94 71 138 2686
Email:	[hidden email]

Kien Truong

Re: Distribute crawling of a URL list using Flink

Hi,

While this task is quite trivial to do with Flink Dataset API, using readTextFile to read the input and

a flatMap function to perform the downloading, it might not be a good idea.

The download process is I/O bound, and will block the synchronous flatMap function,

so the throughput will not be very good.

Until Flink supports asynchronous functions, I suggest you looks elsewhere.

An example with master-workers architecture using Akka can be found here

https://github.com/typesafehub/activator-akka-distributed-workers

Regards,

Kien

On 8/14/2017 10:09 AM, Eranga Heshan wrote:

Hi all,

I am fairly new to Flink. I have this project where I have a list of URLs (In one node) which need to be crawled distributedly. Then for each URL, I need the serialized crawled result to be written to a single text file.

I want to know if there are similar projects which I can look into or an idea on how to implement this.

Thanks & Regards,

Eranga Heshan

Undergraduate

Computer Science & Engineering

University of Moratuwa

Mobile: <a href="tel:%2B94%2071%20552%202087" value="+94715522087" style="color:rgb(17,85,204)" target="_blank" moz-do-not-send="true">+94 71 138 2686

Email: [hidden email]

Nico Kruber

Re: Distribute crawling of a URL list using Flink

Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect
the resulting string for further processing in your pipeline.

Nico

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:

> Hi,
>
> While this task is quite trivial to do with Flink Dataset API, using
> readTextFile to read the input and
>
> a flatMap function to perform the downloading, it might not be a good idea.
>
> The download process is I/O bound, and will block the synchronous
> flatMap function,
>
> so the throughput will not be very good.
>
>
> Until Flink supports asynchronous functions, I suggest you looks elsewhere.
>
> An example with master-workers architecture using Akka can be found here
>
> https://github.com/typesafehub/activator-akka-distributed-workers
>
>
> Regards,
>
> Kien
>
> On 8/14/2017 10:09 AM, Eranga Heshan wrote:
> > Hi all,
> >
> > I am fairly new to Flink. I have this project where I have a list of
> > URLs (In one node) which need to be crawled distributedly. Then for
> > each URL, I need the serialized crawled result to be written to a
> > single text file.
> >
> > I want to know if there are similar projects which I can look into or
> > an idea on how to implement this.
> >
> > Thanks & Regards,
> >
> >
> >
> >
> > Eranga Heshan
> > /Undergraduate/
> > Computer Science & Engineering
> > University of Moratuwa
> > Mobile: +94 71 138 2686 <tel:%2B94%2071%20552%202087>
> > Email: [hidden email] <mailto:[hidden email]>
> > <https://www.facebook.com/erangaheshan>
> > <https://twitter.com/erangaheshan>
> > <https://www.linkedin.com/in/erangaheshan>

signature.asc (201 bytes) Download Attachment

Eranga Heshan

Re: Distribute crawling of a URL list using Flink

Thanks for your quick replies, Nico and Kien. Since I am using Flink-1.3.0, I will try Nico's idea. I might bug you again for my future problems. 😊

Regards,

Eranga Heshan
Undergraduate
Computer Science & Engineering
University of Moratuwa
Mobile:	<a href="tel:%2B94%2071%20552%202087" value="+94715522087" style="color:rgb(17,85,204)" target="_blank">+94 71 138 2686
Email:	[hidden email]

On Mon, Aug 14, 2017 at 10:36 PM, Nico Kruber <[hidden email]> wrote:

Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect
the resulting string for further processing in your pipeline.

Nico

[1] <a href="https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/ asyncio.html" rel="noreferrer" target="_blank">https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:
> Hi,
>
> While this task is quite trivial to do with Flink Dataset API, using
> readTextFile to read the input and
>
> a flatMap function to perform the downloading, it might not be a good idea.
>
> The download process is I/O bound, and will block the synchronous
> flatMap function,
>
> so the throughput will not be very good.
>
>
> Until Flink supports asynchronous functions, I suggest you looks elsewhere.
>
> An example with master-workers architecture using Akka can be found here
>
> https://github.com/typesafehub/activator-akka-distributed-workers
>
>
> Regards,
>
> Kien
>
> On 8/14/2017 10:09 AM, Eranga Heshan wrote:
> > Hi all,
> >
> > I am fairly new to Flink. I have this project where I have a list of
> > URLs (In one node) which need to be crawled distributedly. Then for
> > each URL, I need the serialized crawled result to be written to a
> > single text file.
> >
> > I want to know if there are similar projects which I can look into or
> > an idea on how to implement this.
> >
> > Thanks & Regards,
> >
> >
> >
> >
> > Eranga Heshan
> > /Undergraduate/
> > Computer Science & Engineering
> > University of Moratuwa
> > Mobile: <a href="tel:%2B94%2071%20138%202686" value="+94711382686">+94 71 138 2686 <tel:%2B94%2071%20552%202087>
> > Email: [hidden email] <mailto:[hidden email]>
> > <https://www.facebook.com/erangaheshan>
> > <https://twitter.com/erangaheshan>
> > <https://www.linkedin.com/in/erangaheshan>

Kien Truong

Re: Distribute crawling of a URL list using Flink

In reply to this post by Nico Kruber

Hi,

Admittedly, I have not suggested this because I thought it was not available for batch API.

Regards,

Kien

On Aug 15, 2017, at 00:06, Nico Kruber <[hidden email]> wrote:

Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect 
the resulting string for further processing in your pipeline.



Nico


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:

 
   Hi,
  

  
 While this task is quite trivial to do with Flink Dataset API, using
  
 readTextFile to read the input and
  

  
 a flatMap function to perform the downloading, it might not be a good idea.
  

  
 The download process is I/O bound, and will block the synchronous
  
 flatMap function,
  

  
 so the throughput will not be very good.
  

  

  
 Until Flink supports asynchronous functions, I suggest you looks elsewhere.
  

  
 An example with master-workers architecture using Akka can be found here
  

  

  https://github.com/typesafehub/activator-akka-distributed-workers
  

  

  
 Regards,
  

  
 Kien
  

  
 On 8/14/2017 10:09 AM, Eranga Heshan wrote:
  

  
    Hi all,
   

   
 I am fairly new to Flink. I have this project where I have a list of
   
 URLs (In one node) which need to be crawled distributedly. Then for
   
 each URL, I need the serialized crawled result to be written to a
   
 single text file.
   

   
 I want to know if there are similar projects which I can look into or
   
 an idea on how to implement this.
   

   
 Thanks & Regards,
   

   

   

   

   
 Eranga Heshan
   
 /Undergraduate/
   
 Computer Science & Engineering
   
 University of Moratuwa
   
 Mobile: +94 71 138 2686 <tel:%2B94%2071%20552%202087>
   
 Email: [hidden email] <mailto:[hidden email]>
   
 <
   https://www.facebook.com/erangaheshan>
   
 <
   https://twitter.com/erangaheshan>
   
 <
   https://www.linkedin.com/in/erangaheshan>

Aljoscha Krettek

Re: Distribute crawling of a URL list using Flink

Hi,

It is not available for the Batch API, you would have to use the DataStream API.

Best,

Aljoscha

On 15. Aug 2017, at 01:16, Kien Truong <[hidden email]> wrote:

Hi,

Admittedly, I have not suggested this because I thought it was not available for batch API.

Regards,

Kien

On Aug 15, 2017, at 00:06, Nico Kruber <[hidden email]> wrote:

Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect 
the resulting string for further processing in your pipeline.



Nico


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:

 
   Hi,
  

  
 While this task is quite trivial to do with Flink Dataset API, using
  
 readTextFile to read the input and
  

  
 a flatMap function to perform the downloading, it might not be a good idea.
  

  
 The download process is I/O bound, and will block the synchronous
  
 flatMap function,
  

  
 so the throughput will not be very good.
  

  

  
 Until Flink supports asynchronous functions, I suggest you looks elsewhere.
  

  
 An example with master-workers architecture using Akka can be found here
  

  

  https://github.com/typesafehub/activator-akka-distributed-workers
  

  

  
 Regards,
  

  
 Kien
  

  
 On 8/14/2017 10:09 AM, Eranga Heshan wrote:
  

  
    Hi all,
   

   
 I am fairly new to Flink. I have this project where I have a list of
   
 URLs (In one node) which need to be crawled distributedly. Then for
   
 each URL, I need the serialized crawled result to be written to a
   
 single text file.
   

   
 I want to know if there are similar projects which I can look into or
   
 an idea on how to implement this.
   

   
 Thanks & Regards,
   

   

   

   

   
 Eranga Heshan
   
 /Undergraduate/
   
 Computer Science & Engineering
   
 University of Moratuwa
   
 Mobile: +94 71 138 2686 <<a href="tel:%2B94%2071%20552%202087" class="">tel:%2B94%2071%20552%202087>
   
 Email: [hidden email] <[hidden email]>
   
 <
   https://www.facebook.com/erangaheshan>
   
 <
   https://twitter.com/erangaheshan>
   
 <
   https://www.linkedin.com/in/erangaheshan>

Eranga Heshan

Re: Distribute crawling of a URL list using Flink

Thank you Aljoscha :-) I actually need it for a Kafka stream, so I use DataStream API anyway.

Regards,

Eranga Heshan
Undergraduate
Computer Science & Engineering
University of Moratuwa
Mobile:	<a href="tel:%2B94%2071%20552%202087" value="+94715522087" style="color:rgb(17,85,204)" target="_blank">+94 71 138 2686
Email:	[hidden email]

On Fri, Aug 25, 2017 at 5:53 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,

It is not available for the Batch API, you would have to use the DataStream API.

Best,

Aljoscha

On 15. Aug 2017, at 01:16, Kien Truong <[hidden email]> wrote:

Hi,

Admittedly, I have not suggested this because I thought it was not available for batch API.

Regards,

Kien

On Aug 15, 2017, at 00:06, Nico Kruber <[hidden email]> wrote:

Hi Eranga and Kien,
Flink supports asynchronous IO since version 1.2, see [1] for details.

You basically pack your URL download into the asynchronous part and collect 
the resulting string for further processing in your pipeline.



Nico


[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
asyncio.html

On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote:

 
   Hi,
  

  
 While this task is quite trivial to do with Flink Dataset API, using
  
 readTextFile to read the input and
  

  
 a flatMap function to perform the downloading, it might not be a good idea.
  

  
 The download process is I/O bound, and will block the synchronous
  
 flatMap function,
  

  
 so the throughput will not be very good.
  

  

  
 Until Flink supports asynchronous functions, I suggest you looks elsewhere.
  

  
 An example with master-workers architecture using Akka can be found here
  

  

  https://github.com/typesafehub/activator-akka-distributed-workers
  

  

  
 Regards,
  

  
 Kien
  

  
 On 8/14/2017 10:09 AM, Eranga Heshan wrote:
  

  
    Hi all,
   

   
 I am fairly new to Flink. I have this project where I have a list of
   
 URLs (In one node) which need to be crawled distributedly. Then for
   
 each URL, I need the serialized crawled result to be written to a
   
 single text file.
   

   
 I want to know if there are similar projects which I can look into or
   
 an idea on how to implement this.
   

   
 Thanks & Regards,
   

   

   

   

   
 Eranga Heshan
   
 /Undergraduate/
   
 Computer Science & Engineering
   
 University of Moratuwa
   
 Mobile: <a href="tel:+94%2071%20138%202686" value="+94711382686" target="_blank">+94 71 138 2686 <<a href="tel:%2B94%2071%20552%202087" target="_blank">tel:%2B94%2071%20552%202087>
   
 Email: [hidden email] <[hidden email]>
   
 <
   https://www.facebook.com/erangaheshan>
   
 <
   https://twitter.com/erangaheshan>
   
 <
   https://www.linkedin.com/in/erangaheshan>