Persist streams of data

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Persist streams of data

Flavio Pompermaier
Hi guys,

in my use case I have burst of data coming into my system (RDF triples generated from a CSV that I need to process in a further step) and I was trying to figure it out what is the best way to save them on HDFS. 
Do you suggest me to save them on HBase or to use a serialization tool like avro/parquet and similar? Do I need Flume as well or there's a Flink solution for that?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Persist streams of data

Fabian Hueske
Hi,

there the right answer depends on (at least) two aspects:

a) Do you have an actual streaming case or is it batch, i.e., does the data come from a potentially infinite stream or not? This basically determines the system to handle your data.
  - Stream: I don't have much experience here, but Flink's new Streaming feature, Kafka or Flume might be worth looking at.
  - Batch: A regular Flink job might work.
b) How do you want to access your data? This influences the format to store the data.
      - Full scans of some columns (large fraction of tuples) -> Parquet or ORC in HDFS
      - Point access to certain tuples (also subsets of columns, few or many tuples) -> HBase,
      - always read all full tuples -> Avro, ProtoBufs in HDFS

I don't know how much throughput these systems are able to handle though...

Hope this helps,
Fabian

2014-09-29 17:32 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

in my use case I have burst of data coming into my system (RDF triples generated from a CSV that I need to process in a further step) and I was trying to figure it out what is the best way to save them on HDFS. 
Do you suggest me to save them on HBase or to use a serialization tool like avro/parquet and similar? Do I need Flume as well or there's a Flink solution for that?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Persist streams of data

Flavio Pompermaier
Thanks Fabian for the support. See inline for answers:

On Mon, Sep 29, 2014 at 6:12 PM, Fabian Hueske <[hidden email]> wrote:
Hi,

there the right answer depends on (at least) two aspects:

a) Do you have an actual streaming case or is it batch, i.e., does the data come from a potentially infinite stream or not? This basically determines the system to handle your data.
  - Stream: I don't have much experience here, but Flink's new Streaming feature, Kafka or Flume might be worth looking at.
  - Batch: A regular Flink job might work.
 
Stream, triples are generated from an external program with some batch size

b) How do you want to access your data? This influences the format to store the data.
      - Full scans of some columns (large fraction of tuples) -> Parquet or ORC in HDFS
      - Point access to certain tuples (also subsets of columns, few or many tuples) -> HBase,
      - always read all full tuples -> Avro, ProtoBufs in HDFS

Full scans of some columns. Is it possible to add batch of rows to a parquet file? Or do I need to create a new File for each batch?
Then can I read an entire directory containing those files at once?
 
I don't know how much throughput these systems are able to handle though...

Hope this helps,
Fabian

2014-09-29 17:32 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

in my use case I have burst of data coming into my system (RDF triples generated from a CSV that I need to process in a further step) and I was trying to figure it out what is the best way to save them on HDFS. 
Do you suggest me to save them on HBase or to use a serialization tool like avro/parquet and similar? Do I need Flume as well or there's a Flink solution for that?

Best,
Flavio