 
	
					
		
	
					| Hi all, I am fairly new to Flink. I have this project where I have a list of URLs (In one node) which need to be crawled distributedly. Then for each URL, I need the serialized crawled result to be written to a single text file. I want to know if there are similar projects which I can look into or an idea on how to implement this. Thanks & Regards, 
 | |||||||||||||||||
 
	
					
		
	
					| Hi,  While this task is quite trivial to do with Flink Dataset API,
      using readTextFile to read the input and  a flatMap function to perform the downloading, it might not be a
      good idea. The download process is I/O bound, and will block the synchronous
      flatMap function,  so the throughput will not be very good. 
 Until Flink supports asynchronous functions, I suggest you looks
      elsewhere.  An example with master-workers architecture using Akka can be
      found here https://github.com/typesafehub/activator-akka-distributed-workers 
 Regards, Kien 
 
 On 8/14/2017 10:09 AM, Eranga Heshan
      wrote: 
 | |||||||||||||||||
 
	
					
		
	
					| 
		Hi Eranga and Kien,
 Flink supports asynchronous IO since version 1.2, see [1] for details. You basically pack your URL download into the asynchronous part and collect the resulting string for further processing in your pipeline. Nico [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/ asyncio.html On Monday, 14 August 2017 17:50:47 CEST Kien Truong wrote: > Hi, > > While this task is quite trivial to do with Flink Dataset API, using > readTextFile to read the input and > > a flatMap function to perform the downloading, it might not be a good idea. > > The download process is I/O bound, and will block the synchronous > flatMap function, > > so the throughput will not be very good. > > > Until Flink supports asynchronous functions, I suggest you looks elsewhere. > > An example with master-workers architecture using Akka can be found here > > https://github.com/typesafehub/activator-akka-distributed-workers > > > Regards, > > Kien > > On 8/14/2017 10:09 AM, Eranga Heshan wrote: > > Hi all, > > > > I am fairly new to Flink. I have this project where I have a list of > > URLs (In one node) which need to be crawled distributedly. Then for > > each URL, I need the serialized crawled result to be written to a > > single text file. > > > > I want to know if there are similar projects which I can look into or > > an idea on how to implement this. > > > > Thanks & Regards, > > > > > > > > > > Eranga Heshan > > /Undergraduate/ > > Computer Science & Engineering > > University of Moratuwa > > Mobile: +94 71 138 2686 <tel:%2B94%2071%20552%202087> > > Email: [hidden email] <mailto:[hidden email]> > > <https://www.facebook.com/erangaheshan> > > <https://twitter.com/erangaheshan> > > <https://www.linkedin.com/in/erangaheshan> | 
 
	
					
		
	
					| Thanks for your quick replies, Nico and Kien. Since I am using Flink-1.3.0, I will try Nico's idea. I might bug you again for my future problems. 😊 Regards, 
 On Mon, Aug 14, 2017 at 10:36 PM, Nico Kruber <[hidden email]> wrote: Hi Eranga and Kien, | |||||||||||||||||
 
	
					
		
	
					| 
				In reply to this post by Nico Kruber
			 Hi,  Admittedly, I have not suggested this because I thought it was not available for batch API.  Regards,  Kien  On Aug 15, 2017, at 00:06, Nico Kruber <[hidden email]> wrote: Hi Eranga and Kien, | 
 
	
					
		
	
					| 
		Hi, It is not available for the Batch API, you would have to use the DataStream API. Best, Aljoscha 
 | 
 
	
					
		
	
					| Thank you Aljoscha :-) I actually need it for a Kafka stream, so I use DataStream API anyway. Regards, 
 On Fri, Aug 25, 2017 at 5:53 PM, Aljoscha Krettek <[hidden email]> wrote: 
 | |||||||||||||||||
| Free forum by Nabble | Edit this page | 
 
	

 
	
	
		
