Parallel read text

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallel read text

David Olsen
After searching on the internet I still do not find the answer (with key word like 'apache flink parallel read text') I am looking for. So asking here before jumping to write code ...

My problem is I want to a read text file or split text files (from local file system). Therefore I want to parallel read those files and process them accordingly. 

From what I discover so far:
- Use ExecutionEnvironment.readTextFile but this only serves with 1 thread(?) (meaning reading the file(s) from the beginning to the end)
- Use streaming env to addSource[1] but that seems to me I need to implement my own source with RichParallelSourceFunction.

Is there any classes or impl that already can read text in parallel?

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Parallel read text

Chesnay Schepler
ExecutionEnvironment.readTextFile will read the file in parallel.

On 28.05.2016 09:59, David Olsen wrote:
After searching on the internet I still do not find the answer (with key word like 'apache flink parallel read text') I am looking for. So asking here before jumping to write code ...

My problem is I want to a read text file or split text files (from local file system). Therefore I want to parallel read those files and process them accordingly. 

From what I discover so far:
- Use ExecutionEnvironment.readTextFile but this only serves with 1 thread(?) (meaning reading the file(s) from the beginning to the end)
- Use streaming env to addSource[1] but that seems to me I need to implement my own source with RichParallelSourceFunction.
Is there any classes or impl that already can read text in parallel?
Thanks


Reply | Threaded
Open this post in threaded view
|

Re: Parallel read text

David Olsen
Thank you for the advice! 

Now I have a new question. I read the source[1] streaming env exploits FileSourceFunction, which inherits RichParallelSourceFunction, to create split input[2]. I know I can set parallelism in streaming env, but any way I can verify that at runtime the split files or the file is read in parallel? 

Thank you again for your help. 





On 28 May 2016 at 17:52, Chesnay Schepler <[hidden email]> wrote:
ExecutionEnvironment.readTextFile will read the file in parallel.


On 28.05.2016 09:59, David Olsen wrote:
After searching on the internet I still do not find the answer (with key word like 'apache flink parallel read text') I am looking for. So asking here before jumping to write code ...

My problem is I want to a read text file or split text files (from local file system). Therefore I want to parallel read those files and process them accordingly. 

From what I discover so far:
- Use ExecutionEnvironment.readTextFile but this only serves with 1 thread(?) (meaning reading the file(s) from the beginning to the end)
- Use streaming env to addSource[1] but that seems to me I need to implement my own source with RichParallelSourceFunction.
Is there any classes or impl that already can read text in parallel?
Thanks



Reply | Threaded
Open this post in threaded view
|

Re: Parallel read text

rmetzger0
Hi David,

I guess you can verify it by adding custom log statements into the Flink code (therefore, you need to recompile Flink).
Maybe a debugger is also sufficient (if you are running Flink locally).
We are currently reworking the reading of static files for the streaming environment. Maybe its interesting to check out the new implementation [1]



On Sat, May 28, 2016 at 1:49 PM, David Olsen <[hidden email]> wrote:
Thank you for the advice! 

Now I have a new question. I read the source[1] streaming env exploits FileSourceFunction, which inherits RichParallelSourceFunction, to create split input[2]. I know I can set parallelism in streaming env, but any way I can verify that at runtime the split files or the file is read in parallel? 

Thank you again for your help. 





On 28 May 2016 at 17:52, Chesnay Schepler <[hidden email]> wrote:
ExecutionEnvironment.readTextFile will read the file in parallel.


On 28.05.2016 09:59, David Olsen wrote:
After searching on the internet I still do not find the answer (with key word like 'apache flink parallel read text') I am looking for. So asking here before jumping to write code ...

My problem is I want to a read text file or split text files (from local file system). Therefore I want to parallel read those files and process them accordingly. 

From what I discover so far:
- Use ExecutionEnvironment.readTextFile but this only serves with 1 thread(?) (meaning reading the file(s) from the beginning to the end)
- Use streaming env to addSource[1] but that seems to me I need to implement my own source with RichParallelSourceFunction.
Is there any classes or impl that already can read text in parallel?
Thanks