Flink's WordCount at scale of 1BLN of unique words

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink's WordCount at scale of 1BLN of unique words

Xtra Coder
Hello, 

Question from newbie about how Flink's WordCount will actually work at scale. 

I've read/seen rather many high-level presentations and do not see more-or-less clear answers for following …

Use-case: 
--------------
there is huuuge text stream with very variable set of words – let's say 1BLN of unique words. Storing them just as raw text, without supplementary data, will take roughly 16TB of RAM. How Flink is approaching this internally. 

Here I'm more interested in following:
1.  How individual words are spread in cluster of Flink nodes? 
Will each word appear exactly in one node and will be counted there or ... I'm not sure about the variants

2.  As far as I understand – while job is running all its intermediate aggregation results are stored in-memory across cluster nodes (which may be partially written to local drive). 
Wild guess - what size of cluster is required to run above mentioned tasks efficiently?

And two functional question on top of this  ...

1. Since intermediate results are in memory – I guess it should be possible to get “current” counter for any word being processed. 
Is this possible?

2. After I've streamed 100TB of text – what will be the right way to save result to HDFS. For example I want to save list of words ordered by key with portions of 10mln per file compressed with bzip2. 
What APIs I should use? 
Since Flink uses intermediate snapshots for falt-tolerance - is it possible to save whole "current" state without stopping the stream?

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Flink's WordCount at scale of 1BLN of unique words

Matthias J. Sax-2
Are you talking about a streaming or a batch job?

You are mentioning a "text stream" but also say you want to stream 100TB
-- indicating you have a finite data set using DataSet API.

-Matthias

On 05/22/2016 09:50 PM, Xtra Coder wrote:

> Hello,
>
> Question from newbie about how Flink's WordCount will actually work at
> scale.
>
> I've read/seen rather many high-level presentations and do not see
> more-or-less clear answers for following …
>
> Use-case:
> --------------
> there is huuuge text stream with very variable set of words – let's say
> 1BLN of unique words. Storing them just as raw text, without
> supplementary data, will take roughly 16TB of RAM. How Flink is
> approaching this internally.
>
> Here I'm more interested in following:
> 1.  How individual words are spread in cluster of Flink nodes?
> Will each word appear exactly in one node and will be counted there or
> ... I'm not sure about the variants
>
> 2.  As far as I understand – while job is running all its intermediate
> aggregation results are stored in-memory across cluster nodes (which may
> be partially written to local drive).
> Wild guess - what size of cluster is required to run above mentioned
> tasks efficiently?
>
> And two functional question on top of this  ...
>
> 1. Since intermediate results are in memory – I guess it should be
> possible to get “current” counter for any word being processed.
> Is this possible?
>
> 2. After I've streamed 100TB of text – what will be the right way to
> save result to HDFS. For example I want to save list of words ordered by
> key with portions of 10mln per file compressed with bzip2.
> What APIs I should use?
> Since Flink uses intermediate snapshots for falt-tolerance - is it
> possible to save whole "current" state without stopping the stream?
>
> Thanks.


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink's WordCount at scale of 1BLN of unique words

Xtra Coder
Mentioning 100TB "in my context" is more like "saving current state" at some point of time to "backup" or "direct access" storage and continue with next 100TB/hours/days of streamed data.
So - no, it is not about a finite data set.

On Mon, May 23, 2016 at 11:13 AM, Matthias J. Sax <[hidden email]> wrote:
Are you talking about a streaming or a batch job?

You are mentioning a "text stream" but also say you want to stream 100TB
-- indicating you have a finite data set using DataSet API.

-Matthias

On 05/22/2016 09:50 PM, Xtra Coder wrote:
> Hello,
>
> Question from newbie about how Flink's WordCount will actually work at
> scale.
>
> I've read/seen rather many high-level presentations and do not see
> more-or-less clear answers for following …
>
> Use-case:
> --------------
> there is huuuge text stream with very variable set of words – let's say
> 1BLN of unique words. Storing them just as raw text, without
> supplementary data, will take roughly 16TB of RAM. How Flink is
> approaching this internally.
>
> Here I'm more interested in following:
> 1.  How individual words are spread in cluster of Flink nodes?
> Will each word appear exactly in one node and will be counted there or
> ... I'm not sure about the variants
>
> 2.  As far as I understand – while job is running all its intermediate
> aggregation results are stored in-memory across cluster nodes (which may
> be partially written to local drive).
> Wild guess - what size of cluster is required to run above mentioned
> tasks efficiently?
>
> And two functional question on top of this  ...
>
> 1. Since intermediate results are in memory – I guess it should be
> possible to get “current” counter for any word being processed.
> Is this possible?
>
> 2. After I've streamed 100TB of text – what will be the right way to
> save result to HDFS. For example I want to save list of words ordered by
> key with portions of 10mln per file compressed with bzip2.
> What APIs I should use?
> Since Flink uses intermediate snapshots for falt-tolerance - is it
> possible to save whole "current" state without stopping the stream?
>
> Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Flink's WordCount at scale of 1BLN of unique words

Aljoscha Krettek
Hi,
first, regarding your use-case questions:

1. if you do a keyBy(..) on the "word" then the same words will end up on the same machine.
2. This depends on the StateBackend that you use. For example, there is the FileStateBackend that keeps state in memory and does checkpoints to a file system and there is the RocksDBStateBackend that keeps all state in RocksDB instances on disk. It's hard to say what the required cluster size will be but I would imagine that it has to be somewhat bigger if you want good performance. See here for the doc about state backends: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/state_backends.html

Now, for the functional questions:

1. This is not possible right now but will be once "queryable state" is implemented. This is being tracked by this issue: https://issues.apache.org/jira/browse/FLINK-3779

2. I think as a source you could use the RollingSink, I'm not sure about the bzip2 compression, it could be that you have to implement a custom Writer for that.

Regarding your last question: yes, this is possible and covered here: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html

Cheers,
Aljoscha

On Wed, 25 May 2016 at 00:14 Xtra Coder <[hidden email]> wrote:
Mentioning 100TB "in my context" is more like "saving current state" at some point of time to "backup" or "direct access" storage and continue with next 100TB/hours/days of streamed data.
So - no, it is not about a finite data set.

On Mon, May 23, 2016 at 11:13 AM, Matthias J. Sax <[hidden email]> wrote:
Are you talking about a streaming or a batch job?

You are mentioning a "text stream" but also say you want to stream 100TB
-- indicating you have a finite data set using DataSet API.

-Matthias

On 05/22/2016 09:50 PM, Xtra Coder wrote:
> Hello,
>
> Question from newbie about how Flink's WordCount will actually work at
> scale.
>
> I've read/seen rather many high-level presentations and do not see
> more-or-less clear answers for following …
>
> Use-case:
> --------------
> there is huuuge text stream with very variable set of words – let's say
> 1BLN of unique words. Storing them just as raw text, without
> supplementary data, will take roughly 16TB of RAM. How Flink is
> approaching this internally.
>
> Here I'm more interested in following:
> 1.  How individual words are spread in cluster of Flink nodes?
> Will each word appear exactly in one node and will be counted there or
> ... I'm not sure about the variants
>
> 2.  As far as I understand – while job is running all its intermediate
> aggregation results are stored in-memory across cluster nodes (which may
> be partially written to local drive).
> Wild guess - what size of cluster is required to run above mentioned
> tasks efficiently?
>
> And two functional question on top of this  ...
>
> 1. Since intermediate results are in memory – I guess it should be
> possible to get “current” counter for any word being processed.
> Is this possible?
>
> 2. After I've streamed 100TB of text – what will be the right way to
> save result to HDFS. For example I want to save list of words ordered by
> key with portions of 10mln per file compressed with bzip2.
> What APIs I should use?
> Since Flink uses intermediate snapshots for falt-tolerance - is it
> possible to save whole "current" state without stopping the stream?
>
> Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: Flink's WordCount at scale of 1BLN of unique words

Xtra Coder
Thanks, things are clear so far.