(DEPRECATED) Apache Flink User Mailing List archive.

Persist streams of data

Classic

List

Threaded

3 messages Options

Flavio Pompermaier

Persist streams of data

Hi guys,

in my use case I have burst of data coming into my system (RDF triples generated from a CSV that I need to process in a further step) and I was trying to figure it out what is the best way to save them on HDFS.

Do you suggest me to save them on HBase or to use a serialization tool like avro/parquet and similar? Do I need Flume as well or there's a Flink solution for that?

Best,

Flavio

Fabian Hueske

Re: Persist streams of data

Hi,

there the right answer depends on (at least) two aspects:

a) Do you have an actual streaming case or is it batch, i.e., does the data come from a potentially infinite stream or not? This basically determines the system to handle your data.

- Stream: I don't have much experience here, but Flink's new Streaming feature, Kafka or Flume might be worth looking at.

- Batch: A regular Flink job might work.

b) How do you want to access your data? This influences the format to store the data.

- Full scans of some columns (large fraction of tuples) -> Parquet or ORC in HDFS

- Point access to certain tuples (also subsets of columns, few or many tuples) -> HBase,

- always read all full tuples -> Avro, ProtoBufs in HDFS

I don't know how much throughput these systems are able to handle though...

Hope this helps,

Fabian

2014-09-29 17:32 GMT+02:00 Flavio Pompermaier <[hidden email]>:

Hi guys,

in my use case I have burst of data coming into my system (RDF triples generated from a CSV that I need to process in a further step) and I was trying to figure it out what is the best way to save them on HDFS.
Do you suggest me to save them on HBase or to use a serialization tool like avro/parquet and similar? Do I need Flume as well or there's a Flink solution for that?

Best,
Flavio

Flavio Pompermaier

Re: Persist streams of data

Thanks Fabian for the support. See inline for answers:

On Mon, Sep 29, 2014 at 6:12 PM, Fabian Hueske <[hidden email]> wrote:

Hi,

there the right answer depends on (at least) two aspects:

a) Do you have an actual streaming case or is it batch, i.e., does the data come from a potentially infinite stream or not? This basically determines the system to handle your data.
- Stream: I don't have much experience here, but Flink's new Streaming feature, Kafka or Flume might be worth looking at.
- Batch: A regular Flink job might work.

Stream, triples are generated from an external program with some batch size

b) How do you want to access your data? This influences the format to store the data.
      - Full scans of some columns (large fraction of tuples) -> Parquet or ORC in HDFS
      - Point access to certain tuples (also subsets of columns, few or many tuples) -> HBase,
      - always read all full tuples -> Avro, ProtoBufs in HDFS

Full scans of some columns. Is it possible to add batch of rows to a parquet file? Or do I need to create a new File for each batch?

Then can I read an entire directory containing those files at once?

I don't know how much throughput these systems are able to handle though...

Hope this helps,
Fabian

2014-09-29 17:32 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

in my use case I have burst of data coming into my system (RDF triples generated from a CSV that I need to process in a further step) and I was trying to figure it out what is the best way to save them on HDFS.
Do you suggest me to save them on HBase or to use a serialization tool like avro/parquet and similar? Do I need Flume as well or there's a Flink solution for that?

Best,
Flavio