Amazon Athena

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Amazon Athena

Madhukar Thota
Anyone used used Amazon Athena with Apache Flink?

I have use case where I want to write streaming data ( which is in Avro format) from kafka to s3 by converting into parquet format and update S3 location with daily partitions on Athena table.

Any guidance is appreciated.

Reply | Threaded
Open this post in threaded view
|

Re: Amazon Athena

Aljoscha Krettek
Hi,

I don’t have any experience with Athena but this sounds doable. It seems that you only need to have some way of writing into S3 and then Athena will pick up the data in S3 when running queries. Multiple folks have used Flink to write data from Kafka into S3, the most recent case I know from the mailing lists is probably Seth (in cc), could you maybe comment if you find some time?

Best,
Aljoscha

> On 31. May 2017, at 04:10, Madhukar Thota <[hidden email]> wrote:
>
> Anyone used used Amazon Athena with Apache Flink?
>
> I have use case where I want to write streaming data ( which is in Avro format) from kafka to s3 by converting into parquet format and update S3 location with daily partitions on Athena table.
>
> Any guidance is appreciated.
>

Reply | Threaded
Open this post in threaded view
|

Re: Amazon Athena

swiesman
Seems straight forward. The biggest challenge is that that you don’t want Athena picking up on partially written files or for whatever reason corrupt files. The issue with S3 is you cannot allow Flink to perform delete, truncate, or rename operations because it moves faster than S3 can become consistent. I think the simplest solution would be to use the bucketing sink to write files out to hdfs and then add an additional operator or auxiliary process that will copy them to S3 when they move from pending to complete. If you do this then you will only need at least once copy’s to S3 because overwriting a file with itself is the only consistent overwrite condition.  

Seth  

On 6/6/17, 10:03 AM, "Aljoscha Krettek" <[hidden email]> wrote:

    Hi,
   
    I don’t have any experience with Athena but this sounds doable. It seems that you only need to have some way of writing into S3 and then Athena will pick up the data in S3 when running queries. Multiple folks have used Flink to write data from Kafka into S3, the most recent case I know from the mailing lists is probably Seth (in cc), could you maybe comment if you find some time?
   
    Best,
    Aljoscha
   
    > On 31. May 2017, at 04:10, Madhukar Thota <[hidden email]> wrote:
    >
    > Anyone used used Amazon Athena with Apache Flink?
    >
    > I have use case where I want to write streaming data ( which is in Avro format) from kafka to s3 by converting into parquet format and update S3 location with daily partitions on Athena table.
    >
    > Any guidance is appreciated.
    >