Hi I have a dataset which has almost 99% of correct data. As of now if say some data is bad I just ignore it and log it and return only correct data. I do this inside a map function.Also, is there a way I could create a datastream which can get the data from inside map function(not sure this is feasible as of now)? |
You can use a split operator, generating 2 streams.
Darshan Singh <[hidden email]> 于 2018年3月30日周五 上午2:53写道:
|
In reply to this post by Darshan Singh
Hi Darshan,
You can use side outputs [1] and a process function to split the data in as many streams as you want, e.g. correct, fixable and wrong. Each side output will be a separate stream that your can process individually. You can always send the “bad data” directly from your process function to Kafka or wherever. You just need to override the open() method, create a connection to the outside storage system, and use that connection to store the data whenever you see them. Keep in mind though, that your process function is executed on a single thread, so it may be beneficial to split your computation in multiple functions (although this is up to you to benchmark and see if it fits you). Thanks, Kostas
|
Thanks, As of now I have decided to write it to hdfs from within the function.On Tue, Apr 3, 2018 at 10:58 AM, Kostas Kloudas <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |