Batch reading from Cassandra. How to?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Batch reading from Cassandra. How to?

Lasse Nedergaard
Hi.

We would like to do some batch analytics on our data set stored in Cassandra and are looking for an efficient way to load data from a single table. Not by key, but random 15%, 50% or 100% 
Data bricks has create an efficient way to load Cassandra data into Apache Spark and they are doing it by reading from the underlying SS tables to load in parallel. 
Do we have something similarly in Flink, or how is the most efficient way to load all, or many random data from a single Cassandra table into Flink? 

Any suggestions and/or recommendations is highly appreciated.

Thanks in advance

Lasse Nedergaard
Reply | Threaded
Open this post in threaded view
|

Re: Batch reading from Cassandra. How to?

Lasse Nedergaard
Any good suggestions?

Lasse

Den tir. 11. feb. 2020 kl. 08.48 skrev Lasse Nedergaard <[hidden email]>:
Hi.

We would like to do some batch analytics on our data set stored in Cassandra and are looking for an efficient way to load data from a single table. Not by key, but random 15%, 50% or 100% 
Data bricks has create an efficient way to load Cassandra data into Apache Spark and they are doing it by reading from the underlying SS tables to load in parallel. 
Do we have something similarly in Flink, or how is the most efficient way to load all, or many random data from a single Cassandra table into Flink? 

Any suggestions and/or recommendations is highly appreciated.

Thanks in advance

Lasse Nedergaard
Reply | Threaded
Open this post in threaded view
|

Re: Batch reading from Cassandra. How to?

Till Rohrmann
Hi Lasse,

as far as I know, the best way to read from Cassandra is to use the CassandraInputFormat [1]. Unfortunately, there is no such optimized way to read a large amount of data as Spark offers it at the moment. But if you want to contribute this feature to Flink, then the community would highly appreciate it.


Cheers,
Till

On Fri, Feb 14, 2020 at 11:04 AM Lasse Nedergaard <[hidden email]> wrote:
Any good suggestions?

Lasse

Den tir. 11. feb. 2020 kl. 08.48 skrev Lasse Nedergaard <[hidden email]>:
Hi.

We would like to do some batch analytics on our data set stored in Cassandra and are looking for an efficient way to load data from a single table. Not by key, but random 15%, 50% or 100% 
Data bricks has create an efficient way to load Cassandra data into Apache Spark and they are doing it by reading from the underlying SS tables to load in parallel. 
Do we have something similarly in Flink, or how is the most efficient way to load all, or many random data from a single Cassandra table into Flink? 

Any suggestions and/or recommendations is highly appreciated.

Thanks in advance

Lasse Nedergaard