Re: Apache Flink transactions

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

Chiwan Park
Hi.

Flink is not DBMS. There is no equivalent operation of insert, update, remove.
But you can use map[1] or filter[2] operation to create modified dataset.

I recommend you some sildes[3][4] to understand Flink concepts.

Regards,
Chiwan Park

[1] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#map
[2] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#filter
[3] http://www.slideshare.net/robertmetzger1/introduction-to-apache-flink-palo-alto-meetup
[4] http://www.slideshare.net/dataArtisans/flink-training-dataset-api-basics


> On Jun 4, 2015, at 2:48 PM, Hawin Jiang <[hidden email]> wrote:
>
> Hi  Admin
>
>
>
> Do we have insert, update and remove operations on Apache Flink?
>
> For example:  I have 10 million records in my test file.  I want to add one
> record, update one record and remove one record from this test file.
>
> How to implement it by Flink?
>
> Thanks.
>
>
>
>
>
>
>
>
>
> Best regards
>
> Email: [hidden email]
>




Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

Till Rohrmann

But what you can do to simulate an insert is to read the new data in a separate DataSet and then apply an union operator on the new and old DataSet.

Cheers,
Till


On Thu, Jun 4, 2015 at 9:00 AM, Chiwan Park <[hidden email]> wrote:
Hi.

Flink is not DBMS. There is no equivalent operation of insert, update, remove.
But you can use map[1] or filter[2] operation to create modified dataset.

I recommend you some sildes[3][4] to understand Flink concepts.

Regards,
Chiwan Park

[1] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#map
[2] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#filter
[3] http://www.slideshare.net/robertmetzger1/introduction-to-apache-flink-palo-alto-meetup
[4] http://www.slideshare.net/dataArtisans/flink-training-dataset-api-basics


> On Jun 4, 2015, at 2:48 PM, Hawin Jiang <[hidden email]> wrote:
>
> Hi  Admin
>
>
>
> Do we have insert, update and remove operations on Apache Flink?
>
> For example:  I have 10 million records in my test file.  I want to add one
> record, update one record and remove one record from this test file.
>
> How to implement it by Flink?
>
> Thanks.
>
>
>
>
>
>
>
>
>
> Best regards
>
> Email: [hidden email]
>





Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

Stephan Ewen
In reply to this post by Chiwan Park
To run operations like Insert, Update, Delete on persistent files, you have to have support from the storage engine.

The Apache ORC data format recently added support for transactional inserts, updates, deletes. http://orc.apache.org/

ORC has a Hadoop input format, and you can use that one with Flink through the Hadoop wrappers.



 ORC

On Thu, Jun 4, 2015 at 9:45 AM, Hawin Jiang <[hidden email]> wrote:
Hi  Admin



Do we have insert, update and remove operations on Apache Flink?

For example:  I have 10 million records in my test file.  I want to add one
record, update one record and remove one record from this test file.

How to implement it by Flink?

Thanks.









Best regards

Email: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

hawin
In reply to this post by Chiwan Park
Hi  Chiwan

Thanks for your information.  I knew Flink is not DBMS. I want to know what is the flink way to select, insert, update and delete data on HDFS.


@Till
Maybe union is a way to insert data. But I think it will cost some performance issue.


@Stephan
Thanks for your suggestion.  I have checked apache flink roadmap.  SQL on flink will be released on Q3/Q4 2015. Will it support insertion, deletion and update data on HDFS?
You guys already provided a nice example for selecting data on HDFS.  Such as: TPCHQuery10 and TPCHQuery3.
Do you have other examples for inserting, updating and removing data on HDFS by Apache flink

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

Chiwan Park
Basically Flink uses Data Model in functional programming model. All DataSet is immutable. This means we cannot modify DataSet but ㅐonly can create new DataSet with modification. Update, delete query are represented as writing filtered DataSet.
Following scala sample shows select, insert, update, and remove query in Flink. (I’m not sure this is best practice.)

case class MyType(id: Int, value1: String, value2: String)

// load data (you can use readCsvFile, or something else.)
val data = env.fromElements(MyType(0, “test”, “test2”), MyType(1, “hello”, “flink”), MyType(2, “flink”, “good”))

// selecting
// same as SELECT * FROM data WHERE id = 1
val selectedData1 = data.filter(_.id == 1)
// same as SELECT value1 FROM data WHERE id = 1
val selectedData2 = data.filter(_.id == 1).map(_.value1)

// removing is same as selecting such as following
// same as DELETE FROM data WHERE id = 1, but DataSet data is not changed. the result is removedData
val removedData = data.filter(_.id != 1)

// inserting
// same as INSERT INTO data (id, value1, value2) VALUES (3, “new”, “data”)
val newData = env.fromElements(MyType(3, “new”, “data”))
val insertedData = data.union(newData)

// updating
// UPDATE data SET value1 = “updated”, value2 = “data” WHERE id = 1, but DataSet data is not changed.
val updatedData = data.map { x => if (x.id == 1) MyType(x.id, “updated”, “data”) else x }

Regards,
Chiwan Park

> On Jun 5, 2015, at 9:22 AM, hawin <[hidden email]> wrote:
>
> Hi  Chiwan
>
> Thanks for your information.  I knew Flink is not DBMS. I want to know what
> is the flink way to select, insert, update and delete data on HDFS.
>
>
> @Till
> Maybe union is a way to insert data. But I think it will cost some
> performance issue.
>
>
> @Stephan
> Thanks for your suggestion.  I have checked apache flink roadmap.  SQL on
> flink will be released on Q3/Q4 2015. Will it support insertion, deletion
> and update data on HDFS?
> You guys already provided a nice example for selecting data on HDFS.  Such
> as: TPCHQuery10 and TPCHQuery3.
> Do you have other examples for inserting, updating and removing data on HDFS
> by Apache flink
>
> Thanks
>
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1491.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.




Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

Aljoscha Krettek
Yes, this code seems very reasonable. :D

The way to use this to "modify" a file on HDFS is to read the file,
then filter out some elements and write a new modified file that does
not contain the filtered out elements. As said before, Flink (or
HDFS), does not allow in-place modification of files.

On Fri, Jun 5, 2015 at 4:55 AM, Chiwan Park <[hidden email]> wrote:

> Basically Flink uses Data Model in functional programming model. All DataSet is immutable. This means we cannot modify DataSet but ㅐonly can create new DataSet with modification. Update, delete query are represented as writing filtered DataSet.
> Following scala sample shows select, insert, update, and remove query in Flink. (I’m not sure this is best practice.)
>
> case class MyType(id: Int, value1: String, value2: String)
>
> // load data (you can use readCsvFile, or something else.)
> val data = env.fromElements(MyType(0, “test”, “test2”), MyType(1, “hello”, “flink”), MyType(2, “flink”, “good”))
>
> // selecting
> // same as SELECT * FROM data WHERE id = 1
> val selectedData1 = data.filter(_.id == 1)
> // same as SELECT value1 FROM data WHERE id = 1
> val selectedData2 = data.filter(_.id == 1).map(_.value1)
>
> // removing is same as selecting such as following
> // same as DELETE FROM data WHERE id = 1, but DataSet data is not changed. the result is removedData
> val removedData = data.filter(_.id != 1)
>
> // inserting
> // same as INSERT INTO data (id, value1, value2) VALUES (3, “new”, “data”)
> val newData = env.fromElements(MyType(3, “new”, “data”))
> val insertedData = data.union(newData)
>
> // updating
> // UPDATE data SET value1 = “updated”, value2 = “data” WHERE id = 1, but DataSet data is not changed.
> val updatedData = data.map { x => if (x.id == 1) MyType(x.id, “updated”, “data”) else x }
>
> Regards,
> Chiwan Park
>
>> On Jun 5, 2015, at 9:22 AM, hawin <[hidden email]> wrote:
>>
>> Hi  Chiwan
>>
>> Thanks for your information.  I knew Flink is not DBMS. I want to know what
>> is the flink way to select, insert, update and delete data on HDFS.
>>
>>
>> @Till
>> Maybe union is a way to insert data. But I think it will cost some
>> performance issue.
>>
>>
>> @Stephan
>> Thanks for your suggestion.  I have checked apache flink roadmap.  SQL on
>> flink will be released on Q3/Q4 2015. Will it support insertion, deletion
>> and update data on HDFS?
>> You guys already provided a nice example for selecting data on HDFS.  Such
>> as: TPCHQuery10 and TPCHQuery3.
>> Do you have other examples for inserting, updating and removing data on HDFS
>> by Apache flink
>>
>> Thanks
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1491.html
>> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

hawin
Hi Aljoscha

Thanks for your reply.  
Do you have any tips for Flink SQL.
I know that Spark support ORC format. How about Flink SQL?  
BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code.
How to make that as simple as possible by flink.
I am going to use Flink in my future project.  Sorry for so many questions.
I believe that you guys will make a world difference.


@Chiwan
You made a very good example for me.  
Thanks a lot

Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

Aljoscha Krettek
Hi,
I think the example could be made more concise by using the Table API.
http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html

Please let us know if you have questions about that, it is still quite new.

On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:

> Hi Aljoscha
>
> Thanks for your reply.
> Do you have any tips for Flink SQL.
> I know that Spark support ORC format. How about Flink SQL?
> BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code.
> How to make that as simple as possible by flink.
> I am going to use Flink in my future project.  Sorry for so many questions.
> I believe that you guys will make a world difference.
>
>
> @Chiwan
> You made a very good example for me.
> Thanks a lot
>
>
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

Fabian Hueske-2
If you want to append data to a data set that is store as files (e.g., on HDFS), you can go for a directory structure as follows:

dataSetRootFolder
  - part1
    - 1
    - 2
    - ...
  - part2
    - 1
    - ...
  - partX
 
Flink's file format supports recursive directory scans such that you can add new subfolders to dataSetRootFolder and read the full data set.

2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
Hi,
I think the example could be made more concise by using the Table API.
http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html

Please let us know if you have questions about that, it is still quite new.

On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
> Hi Aljoscha
>
> Thanks for your reply.
> Do you have any tips for Flink SQL.
> I know that Spark support ORC format. How about Flink SQL?
> BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code.
> How to make that as simple as possible by flink.
> I am going to use Flink in my future project.  Sorry for so many questions.
> I believe that you guys will make a world difference.
>
>
> @Chiwan
> You made a very good example for me.
> Thanks a lot
>
>
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

hawin
Thanks all

Actually, I want to know more info about Flink SQL and Flink performance 
Here is the Spark benchmark. Maybe you already saw it before.

Thanks.



Best regards
Hawin



On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
If you want to append data to a data set that is store as files (e.g., on HDFS), you can go for a directory structure as follows:

dataSetRootFolder
  - part1
    - 1
    - 2
    - ...
  - part2
    - 1
    - ...
  - partX
 
Flink's file format supports recursive directory scans such that you can add new subfolders to dataSetRootFolder and read the full data set.

2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
Hi,
I think the example could be made more concise by using the Table API.
http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html

Please let us know if you have questions about that, it is still quite new.

On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
> Hi Aljoscha
>
> Thanks for your reply.
> Do you have any tips for Flink SQL.
> I know that Spark support ORC format. How about Flink SQL?
> BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code.
> How to make that as simple as possible by flink.
> I am going to use Flink in my future project.  Sorry for so many questions.
> I believe that you guys will make a world difference.
>
>
> @Chiwan
> You made a very good example for me.
> Thanks a lot
>
>
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

Aljoscha Krettek
Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:

> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>>   - part1
>>     - 1
>>     - 2
>>     - ...
>>   - part2
>>     - 1
>>     - ...
>>   - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project.  Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

hawin
Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL as below. 
Do you have any apache flink benchmark information?
Thanks. 



SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Query 1A
32,888 results
Query 1B
3,331,851 results
Query 1C
89,974,976 results
05101520253035404550Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez
Median Response Time (s)
Redshift (HDD) - Current2.492.619.46
Impala - Disk - 1.2.312.01512.01537.085
Impala - Mem - 1.2.32.173.0136.04
Shark - Disk - 0.8.16.6722.4
Shark - Mem - 0.8.11.71.83.6
Hive - 0.12 YARN50.4959.9343.34
Tez - 0.2.028.2236.3526.44
 

On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>>   - part1
>>     - 1
>>     - 2
>>     - ...
>>   - part2
>>     - 1
>>     - ...
>>   - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project.  Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

Aljoscha Krettek
Hi,
we don't have any current performance numbers. But the queries mentioned on the benchmark page should be easy to implement in Flink. It could be interesting if someone ported these queries and ran them with exactly the same data on the same machines.

Bill Sparks wrote on the mailing list some days ago (http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@...%3e). He seems to be running some tests to compare Flink, Spark and MapReduce.

Regards,
Aljoscha

On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <[hidden email]> wrote:
Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL as below. 
Do you have any apache flink benchmark information?
Thanks. 



SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Query 1A
32,888 results
Query 1B
3,331,851 results
Query 1C
89,974,976 results
05101520253035404550Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez
Median Response Time (s)
Redshift (HDD) - Current2.492.619.46
Impala - Disk - 1.2.312.01512.01537.085
Impala - Mem - 1.2.32.173.0136.04
Shark - Disk - 0.8.16.6722.4
Shark - Mem - 0.8.11.71.83.6
Hive - 0.12 YARN50.4959.9343.34
Tez - 0.2.028.2236.3526.44
 

On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>>   - part1
>>     - 1
>>     - 2
>>     - ...
>>   - part2
>>     - 1
>>     - ...
>>   - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project.  Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

hawin


On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
we don't have any current performance numbers. But the queries mentioned on the benchmark page should be easy to implement in Flink. It could be interesting if someone ported these queries and ran them with exactly the same data on the same machines.

Bill Sparks wrote on the mailing list some days ago (http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@...%3e). He seems to be running some tests to compare Flink, Spark and MapReduce.

Regards,
Aljoscha

On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <[hidden email]> wrote:
Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL as below. 
Do you have any apache flink benchmark information?
Thanks. 



SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Query 1A
32,888 results
Query 1B
3,331,851 results
Query 1C
89,974,976 results
05101520253035404550Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez
Median Response Time (s)
Redshift (HDD) - Current2.492.619.46
Impala - Disk - 1.2.312.01512.01537.085
Impala - Mem - 1.2.32.173.0136.04
Shark - Disk - 0.8.16.6722.4
Shark - Mem - 0.8.11.71.83.6
Hive - 0.12 YARN50.4959.9343.34
Tez - 0.2.028.2236.3526.44
 

On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>>   - part1
>>     - 1
>>     - 2
>>     - ...
>>   - part2
>>     - 1
>>     - ...
>>   - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project.  Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>



Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

hawin
In reply to this post by Aljoscha Krettek
Hey Aljoscha

I also sent an email to Bill for asking the latest test results. From Bill's email, Apache Spark performance looks like better than Flink.
How about your thoughts. 



Best regards
Hawin



On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
we don't have any current performance numbers. But the queries mentioned on the benchmark page should be easy to implement in Flink. It could be interesting if someone ported these queries and ran them with exactly the same data on the same machines.

Bill Sparks wrote on the mailing list some days ago (http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@...%3e). He seems to be running some tests to compare Flink, Spark and MapReduce.

Regards,
Aljoscha

On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <[hidden email]> wrote:
Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL as below. 
Do you have any apache flink benchmark information?
Thanks. 



SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Query 1A
32,888 results
Query 1B
3,331,851 results
Query 1C
89,974,976 results
05101520253035404550Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez
Median Response Time (s)
Redshift (HDD) - Current2.492.619.46
Impala - Disk - 1.2.312.01512.01537.085
Impala - Mem - 1.2.32.173.0136.04
Shark - Disk - 0.8.16.6722.4
Shark - Mem - 0.8.11.71.83.6
Hive - 0.12 YARN50.4959.9343.34
Tez - 0.2.028.2236.3526.44
 

On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>>   - part1
>>     - 1
>>     - 2
>>     - ...
>>   - part2
>>     - 1
>>     - ...
>>   - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project.  Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>



Reply | Threaded
Open this post in threaded view
|

Re: Apache Flink transactions

Fabian Hueske-2
Comparing the performance of systems is not easy and the results depend on a lot of things as the configuration, data, and jobs.

That being said, the numbers that Bill reported for WordCount make absolutely sense as Stephan pointed out in his response (Flink does not feature hash-based aggregations yet).
So there are definitely use cases where Spark outperforms Flink, but there are also other cases where both systems perform similar or Flink is faster. For example more complex jobs benefit a lot from Flink's pipelined execution and Flink's build-in iterations are very fast, especially delta-iterations.

Best, Fabian



2015-06-10 0:53 GMT+02:00 Hawin Jiang <[hidden email]>:
Hey Aljoscha

I also sent an email to Bill for asking the latest test results. From Bill's email, Apache Spark performance looks like better than Flink.
How about your thoughts. 



Best regards
Hawin



On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
we don't have any current performance numbers. But the queries mentioned on the benchmark page should be easy to implement in Flink. It could be interesting if someone ported these queries and ran them with exactly the same data on the same machines.

Bill Sparks wrote on the mailing list some days ago (http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@...%3e). He seems to be running some tests to compare Flink, Spark and MapReduce.

Regards,
Aljoscha

On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <[hidden email]> wrote:
Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL as below. 
Do you have any apache flink benchmark information?
Thanks. 



SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Query 1A
32,888 results
Query 1B
3,331,851 results
Query 1C
89,974,976 results
05101520253035404550Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540455055Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez0510152025303540Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez
Median Response Time (s)
Redshift (HDD) - Current2.492.619.46
Impala - Disk - 1.2.312.01512.01537.085
Impala - Mem - 1.2.32.173.0136.04
Shark - Disk - 0.8.16.6722.4
Shark - Mem - 0.8.11.71.83.6
Hive - 0.12 YARN50.4959.9343.34
Tez - 0.2.028.2236.3526.44
 

On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>>   - part1
>>     - 1
>>     - 2
>>     - ...
>>   - part2
>>     - 1
>>     - ...
>>   - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project.  Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>