(DEPRECATED) Apache Flink User Mailing List archive.

Batch writing from Flink streaming job

Classic

List

Threaded

3 messages Options

Padarn Wilson-2

Batch writing from Flink streaming job

Hi all,

I am writing some some jobs intended to run using the DataStream API using a Kafka source. However we also have a lot of data in Avro archives (of the same Kafka source). I would like to be able to run the processing code over parts of the archive so I can generate some "example output".

I've written the transformations needed to read the data from the archives and process the data, but now I'm trying to figure out the best way to write the results of this to some storage.

At the moment I can easily write to Json or CSV using the bucketing sink (although I'm curious about using the watermark time rather than system time to name the buckets), but I'd really like to store to something smaller like Avro.

However I'm not sure this make sense. Writing to a compressed file format in this way from a streaming job doesn't sound intuitively right. What would make the most sense. I could write to some temporary database and then pipe that into an archive, but this seems like a lot of trouble. Is there a way to pipe the output directly into the batch API of flink?

Thanks

Jörn Franke

Re: Batch writing from Flink streaming job

If you want to write in batches from a streaming source you always will need some state ie a state database (here a NoSQL database such as a key value store makes sense). Then you can grab the data at certain points in time and convert it to Avro. You need to make sure that the state is logically consistent (eg all from the last day) to avoid that events arrive later then expected and they are not in the files.

You can write your own sink, but it would require some state database to write the data afterwards as batch.

Maybe this could be a generic Flink component, ie writing to a state database to later write a logical consistent (ok this is defined by the application) state into other sinks (CSVsink, avrosink etc).

> On 13. May 2018, at 14:02, Padarn Wilson <[hidden email]> wrote:
>
> Hi all,
>
> I am writing some some jobs intended to run using the DataStream API using a Kafka source. However we also have a lot of data in Avro archives (of the same Kafka source). I would like to be able to run the processing code over parts of the archive so I can generate some "example output".
>
> I've written the transformations needed to read the data from the archives and process the data, but now I'm trying to figure out the best way to write the results of this to some storage.
>
> At the moment I can easily write to Json or CSV using the bucketing sink (although I'm curious about using the watermark time rather than system time to name the buckets), but I'd really like to store to something smaller like Avro.
>
> However I'm not sure this make sense. Writing to a compressed file format in this way from a streaming job doesn't sound intuitively right. What would make the most sense. I could write to some temporary database and then pipe that into an archive, but this seems like a lot of trouble. Is there a way to pipe the output directly into the batch API of flink?
>
> Thanks

Fabian Hueske-2

Re: Batch writing from Flink streaming job

Hi,

Avro provides schema for data and can be used to serialize individual records in a binary format.
It does not compress the data (although this can be put on top) but is more space efficient due to the binary serialization.

I think you can implement a Writer for the BucketingSink that writes records encoded as Avro.

Writing formats such as Parquet or ORC that group records in batches and write them compressed in columnar layout are a different story and not supported by BucketingSink.

However, there are some plans to support these formats in the future as well.

Avro is different than Parquet and ORC because it writes individual records and not batches of records.

Best, Fabian

2018-05-13 14:09 GMT+02:00 Jörn Franke <[hidden email]>:

If you want to write in batches from a streaming source you always will need some state ie a state database (here a NoSQL database such as a key value store makes sense). Then you can grab the data at certain points in time and convert it to Avro. You need to make sure that the state is logically consistent (eg all from the last day) to avoid that events arrive later then expected and they are not in the files.

You can write your own sink, but it would require some state database to write the data afterwards as batch.

Maybe this could be a generic Flink component, ie writing to a state database to later write a logical consistent (ok this is defined by the application) state into other sinks (CSVsink, avrosink etc).

> On 13. May 2018, at 14:02, Padarn Wilson <[hidden email]> wrote:
>
> Hi all,
>
> I am writing some some jobs intended to run using the DataStream API using a Kafka source. However we also have a lot of data in Avro archives (of the same Kafka source). I would like to be able to run the processing code over parts of the archive so I can generate some "example output".
>
> I've written the transformations needed to read the data from the archives and process the data, but now I'm trying to figure out the best way to write the results of this to some storage.
>
> At the moment I can easily write to Json or CSV using the bucketing sink (although I'm curious about using the watermark time rather than system time to name the buckets), but I'd really like to store to something smaller like Avro.
>
> However I'm not sure this make sense. Writing to a compressed file format in this way from a streaming job doesn't sound intuitively right. What would make the most sense. I could write to some temporary database and then pipe that into an archive, but this seems like a lot of trouble. Is there a way to pipe the output directly into the batch API of flink?
>
> Thanks