Was trying to write a simple streaming Flink program that counts the total words(not the frequency) in a fie.
I was thinking on the lines of :
counts = text.flatMap(new Tokenizer())
.count();
// count() isnt part of streamin APIs (but supported for batching)
Any suggestions on how to do this ? I just want a continuous count (not windowed count)
-roshan
|
Can't you use a KeyedStream, I mean keyBy with the sameKey? something like this,
source.flatMap(new Tokenizer()).keyBy(0).sum(2).project(2).print(); Assuming tokenizer is giving Tuple3<String,String,Integer> 1-> is always the same key, say "test" 2->the actual word 3-> 1 There might be some other good choices but this is the first thing that quickly came in my mind :-) Hari |
Seems a bit convoluted for such a simple problem. I am thinking a custom
streaming count() operator will simplify. Wasn¹t able to find examples for custom Streaming operators. -roshan On 7/21/16, 8:00 PM, "hrajaram" <[hidden email]> wrote: >Can't you use a KeyedStream, I mean keyBy with the sameKey? something >like >this, >source.flatMap(new Tokenizer()).keyBy(0).sum(2).project(2).print(); > >Assuming tokenizer is giving Tuple3<String,String,Integer> > >1-> is always the same key, say "test" >2->the actual word >3-> 1 > > > >There might be some other good choices but this is the first thing that >quickly came in my mind :-) > >Hari > > > > >-- >View this message in context: >http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/counti >ng-words-not-frequency-tp8099p8100.html >Sent from the Apache Flink User Mailing List archive. mailing list >archive at Nabble.com. > |
It is complicated:
1. If you have a file you should consider using the DataSet API. It is more complicated to use DataStream with files as you have to simulate a stream from a file. 2. You need a tokenizer for a map operator unless you have a word per line. 3. Sum operator is fine it will count but unless you groupBy you will get as many counts as your default parallelism. So you need to explicitly set parallelism to 1. And that is something you cannot do with a map only job unless your files is small. But if it is small why use Flink. Sent from my iPhone > On Jul 22, 2016, at 6:35 PM, Roshan Naik <[hidden email]> wrote: > > Seems a bit convoluted for such a simple problem. I am thinking a custom > streaming count() operator will simplify. Wasn¹t able to find examples for > custom Streaming operators. > -roshan > > >> On 7/21/16, 8:00 PM, "hrajaram" <[hidden email]> wrote: >> >> Can't you use a KeyedStream, I mean keyBy with the sameKey? something >> like >> this, >> source.flatMap(new Tokenizer()).keyBy(0).sum(2).project(2).print(); >> >> Assuming tokenizer is giving Tuple3<String,String,Integer> >> >> 1-> is always the same key, say "test" >> 2->the actual word >> 3-> 1 >> >> >> >> There might be some other good choices but this is the first thing that >> quickly came in my mind :-) >> >> Hari >> >> >> >> >> -- >> View this message in context: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/counti >> ng-words-not-frequency-tp8099p8100.html >> Sent from the Apache Flink User Mailing List archive. mailing list >> archive at Nabble.com. > |
Free forum by Nabble | Edit this page |