Hi guys,
in my use case I have burst of data coming into my system (RDF triples generated from a CSV that I need to process in a further step) and I was trying to figure it out what is the best way to save them on HDFS. Do you suggest me to save them on HBase or to use a serialization tool like avro/parquet and similar? Do I need Flume as well or there's a Flink solution for that? Best, Flavio |
Hi, there the right answer depends on (at least) two aspects: a) Do you have an actual streaming case or is it batch, i.e., does the data come from a potentially infinite stream or not? This basically determines the system to handle your data. - Stream: I don't have much experience here, but Flink's new Streaming feature, Kafka or Flume might be worth looking at. - Batch: A regular Flink job might work. b) How do you want to access your data? This influences the format to store the data. - Full scans of some columns (large fraction of tuples) -> Parquet or ORC in HDFS - Point access to certain tuples (also subsets of columns, few or many tuples) -> HBase, - always read all full tuples -> Avro, ProtoBufs in HDFS I don't know how much throughput these systems are able to handle though... Hope this helps, Fabian 2014-09-29 17:32 GMT+02:00 Flavio Pompermaier <[hidden email]>:
|
Thanks Fabian for the support. See inline for answers:
On Mon, Sep 29, 2014 at 6:12 PM, Fabian Hueske <[hidden email]> wrote:
Stream, triples are generated from an external program with some batch size
Full scans of some columns. Is it possible to add batch of rows to a parquet file? Or do I need to create a new File for each batch? Then can I read an entire directory containing those files at once?
|
Free forum by Nabble | Edit this page |