(DEPRECATED) Apache Flink User Mailing List archive.

Re: Apache Flink transactions

Classic

List

Threaded

16 messages Options

Chiwan Park

Re: Apache Flink transactions

Hi.

Flink is not DBMS. There is no equivalent operation of insert, update, remove.
But you can use map[1] or filter[2] operation to create modified dataset.

I recommend you some sildes[3][4] to understand Flink concepts.

Regards,
Chiwan Park

[1] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#map
[2] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#filter
[3] http://www.slideshare.net/robertmetzger1/introduction-to-apache-flink-palo-alto-meetup
[4] http://www.slideshare.net/dataArtisans/flink-training-dataset-api-basics

> On Jun 4, 2015, at 2:48 PM, Hawin Jiang <[hidden email]> wrote:
>
> Hi Admin
>
>
>
> Do we have insert, update and remove operations on Apache Flink?
>
> For example: I have 10 million records in my test file. I want to add one
> record, update one record and remove one record from this test file.
>
> How to implement it by Flink?
>
> Thanks.
>
>
>
>
>
>
>
>
>
> Best regards
>
> Email: [hidden email]
>

Till Rohrmann

Re: Apache Flink transactions

But what you can do to simulate an insert is to read the new data in a separate DataSet and then apply an union operator on the new and old DataSet.

Cheers,
Till

On Thu, Jun 4, 2015 at 9:00 AM, Chiwan Park <[hidden email]> wrote:

Hi.

Flink is not DBMS. There is no equivalent operation of insert, update, remove.
But you can use map[1] or filter[2] operation to create modified dataset.

I recommend you some sildes[3][4] to understand Flink concepts.

Regards,
Chiwan Park

[1] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#map
[2] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#filter
[3] http://www.slideshare.net/robertmetzger1/introduction-to-apache-flink-palo-alto-meetup
[4] http://www.slideshare.net/dataArtisans/flink-training-dataset-api-basics

> On Jun 4, 2015, at 2:48 PM, Hawin Jiang <[hidden email]> wrote:
>
> Hi Admin
>
>
>
> Do we have insert, update and remove operations on Apache Flink?
>
> For example: I have 10 million records in my test file. I want to add one
> record, update one record and remove one record from this test file.
>
> How to implement it by Flink?
>
> Thanks.
>
>
>
>
>
>
>
>
>
> Best regards
>
> Email: [hidden email]
>

Stephan Ewen

Re: Apache Flink transactions

In reply to this post by Chiwan Park

To run operations like Insert, Update, Delete on persistent files, you have to have support from the storage engine.

The Apache ORC data format recently added support for transactional inserts, updates, deletes. http://orc.apache.org/

ORC has a Hadoop input format, and you can use that one with Flink through the Hadoop wrappers.

ORC

On Thu, Jun 4, 2015 at 9:45 AM, Hawin Jiang <[hidden email]> wrote:

Hi Admin

Do we have insert, update and remove operations on Apache Flink?

For example: I have 10 million records in my test file. I want to add one
record, update one record and remove one record from this test file.

How to implement it by Flink?

Thanks.

Best regards

Email: [hidden email]

hawin

Re: Apache Flink transactions

In reply to this post by Chiwan Park

Hi Chiwan

Thanks for your information. I knew Flink is not DBMS. I want to know what is the flink way to select, insert, update and delete data on HDFS.

@Till
Maybe union is a way to insert data. But I think it will cost some performance issue.

@Stephan
Thanks for your suggestion. I have checked apache flink roadmap. SQL on flink will be released on Q3/Q4 2015. Will it support insertion, deletion and update data on HDFS?
You guys already provided a nice example for selecting data on HDFS. Such as: TPCHQuery10 and TPCHQuery3.
Do you have other examples for inserting, updating and removing data on HDFS by Apache flink

Thanks

Chiwan Park

Re: Apache Flink transactions

Basically Flink uses Data Model in functional programming model. All DataSet is immutable. This means we cannot modify DataSet but ㅐonly can create new DataSet with modification. Update, delete query are represented as writing filtered DataSet.
Following scala sample shows select, insert, update, and remove query in Flink. (I’m not sure this is best practice.)

case class MyType(id: Int, value1: String, value2: String)

// load data (you can use readCsvFile, or something else.)
val data = env.fromElements(MyType(0, “test”, “test2”), MyType(1, “hello”, “flink”), MyType(2, “flink”, “good”))

// selecting
// same as SELECT * FROM data WHERE id = 1
val selectedData1 = data.filter(_.id == 1)
// same as SELECT value1 FROM data WHERE id = 1
val selectedData2 = data.filter(_.id == 1).map(_.value1)

// removing is same as selecting such as following
// same as DELETE FROM data WHERE id = 1, but DataSet data is not changed. the result is removedData
val removedData = data.filter(_.id != 1)

// inserting
// same as INSERT INTO data (id, value1, value2) VALUES (3, “new”, “data”)
val newData = env.fromElements(MyType(3, “new”, “data”))
val insertedData = data.union(newData)

// updating
// UPDATE data SET value1 = “updated”, value2 = “data” WHERE id = 1, but DataSet data is not changed.
val updatedData = data.map { x => if (x.id == 1) MyType(x.id, “updated”, “data”) else x }

Regards,
Chiwan Park

> On Jun 5, 2015, at 9:22 AM, hawin <[hidden email]> wrote:
>
> Hi Chiwan
>
> Thanks for your information. I knew Flink is not DBMS. I want to know what
> is the flink way to select, insert, update and delete data on HDFS.
>
>
> @Till
> Maybe union is a way to insert data. But I think it will cost some
> performance issue.
>
>
> @Stephan
> Thanks for your suggestion. I have checked apache flink roadmap. SQL on
> flink will be released on Q3/Q4 2015. Will it support insertion, deletion
> and update data on HDFS?
> You guys already provided a nice example for selecting data on HDFS. Such
> as: TPCHQuery10 and TPCHQuery3.
> Do you have other examples for inserting, updating and removing data on HDFS
> by Apache flink
>
> Thanks
>
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1491.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Aljoscha Krettek

Re: Apache Flink transactions

Yes, this code seems very reasonable. :D

The way to use this to "modify" a file on HDFS is to read the file,
then filter out some elements and write a new modified file that does
not contain the filtered out elements. As said before, Flink (or
HDFS), does not allow in-place modification of files.

On Fri, Jun 5, 2015 at 4:55 AM, Chiwan Park <[hidden email]> wrote:

> Basically Flink uses Data Model in functional programming model. All DataSet is immutable. This means we cannot modify DataSet but ㅐonly can create new DataSet with modification. Update, delete query are represented as writing filtered DataSet.
> Following scala sample shows select, insert, update, and remove query in Flink. (I’m not sure this is best practice.)
>
> case class MyType(id: Int, value1: String, value2: String)
>
> // load data (you can use readCsvFile, or something else.)
> val data = env.fromElements(MyType(0, “test”, “test2”), MyType(1, “hello”, “flink”), MyType(2, “flink”, “good”))
>
> // selecting
> // same as SELECT * FROM data WHERE id = 1
> val selectedData1 = data.filter(_.id == 1)
> // same as SELECT value1 FROM data WHERE id = 1
> val selectedData2 = data.filter(_.id == 1).map(_.value1)
>
> // removing is same as selecting such as following
> // same as DELETE FROM data WHERE id = 1, but DataSet data is not changed. the result is removedData
> val removedData = data.filter(_.id != 1)
>
> // inserting
> // same as INSERT INTO data (id, value1, value2) VALUES (3, “new”, “data”)
> val newData = env.fromElements(MyType(3, “new”, “data”))
> val insertedData = data.union(newData)
>
> // updating
> // UPDATE data SET value1 = “updated”, value2 = “data” WHERE id = 1, but DataSet data is not changed.
> val updatedData = data.map { x => if (x.id == 1) MyType(x.id, “updated”, “data”) else x }
>
> Regards,
> Chiwan Park
>
>> On Jun 5, 2015, at 9:22 AM, hawin <[hidden email]> wrote:
>>
>> Hi Chiwan
>>
>> Thanks for your information. I knew Flink is not DBMS. I want to know what
>> is the flink way to select, insert, update and delete data on HDFS.
>>
>>
>> @Till
>> Maybe union is a way to insert data. But I think it will cost some
>> performance issue.
>>
>>
>> @Stephan
>> Thanks for your suggestion. I have checked apache flink roadmap. SQL on
>> flink will be released on Q3/Q4 2015. Will it support insertion, deletion
>> and update data on HDFS?
>> You guys already provided a nice example for selecting data on HDFS. Such
>> as: TPCHQuery10 and TPCHQuery3.
>> Do you have other examples for inserting, updating and removing data on HDFS
>> by Apache flink
>>
>> Thanks
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1491.html
>> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
>
>
>
>

hawin

Re: Apache Flink transactions

Hi Aljoscha

Thanks for your reply.
Do you have any tips for Flink SQL.
I know that Spark support ORC format. How about Flink SQL?
BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code.
How to make that as simple as possible by flink.
I am going to use Flink in my future project. Sorry for so many questions.
I believe that you guys will make a world difference.

@Chiwan
You made a very good example for me.
Thanks a lot

Aljoscha Krettek

Re: Apache Flink transactions

Hi,
I think the example could be made more concise by using the Table API.
http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html

Please let us know if you have questions about that, it is still quite new.

On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:

> Hi Aljoscha
>
> Thanks for your reply.
> Do you have any tips for Flink SQL.
> I know that Spark support ORC format. How about Flink SQL?
> BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code.
> How to make that as simple as possible by flink.
> I am going to use Flink in my future project. Sorry for so many questions.
> I believe that you guys will make a world difference.
>
>
> @Chiwan
> You made a very good example for me.
> Thanks a lot
>
>
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Fabian Hueske-2

Re: Apache Flink transactions

If you want to append data to a data set that is store as files (e.g., on HDFS), you can go for a directory structure as follows:

dataSetRootFolder

- part1

- 1
- 2
- ...

- part2
- 1
- ...

- partX

Flink's file format supports recursive directory scans such that you can add new subfolders to dataSetRootFolder and read the full data set.

2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:

Hi,
I think the example could be made more concise by using the Table API.
http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html

Please let us know if you have questions about that, it is still quite new.

On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
> Hi Aljoscha
>
> Thanks for your reply.
> Do you have any tips for Flink SQL.
> I know that Spark support ORC format. How about Flink SQL?
> BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code.
> How to make that as simple as possible by flink.
> I am going to use Flink in my future project. Sorry for so many questions.
> I believe that you guys will make a world difference.
>
>
> @Chiwan
> You made a very good example for me.
> Thanks a lot
>
>
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

hawin

Re: Apache Flink transactions

Thanks all

Actually, I want to know more info about Flink SQL and Flink performance

Here is the Spark benchmark. Maybe you already saw it before.

https://amplab.cs.berkeley.edu/benchmark/

Thanks.

Best regards

Hawin

On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:

If you want to append data to a data set that is store as files (e.g., on HDFS), you can go for a directory structure as follows:

dataSetRootFolder
- part1
- 1
    - 2
    - ...
- part2
    - 1
    - ...
- partX

Flink's file format supports recursive directory scans such that you can add new subfolders to dataSetRootFolder and read the full data set.

2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
Hi,
I think the example could be made more concise by using the Table API.
http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html

Please let us know if you have questions about that, it is still quite new.

On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
> Hi Aljoscha
>
> Thanks for your reply.
> Do you have any tips for Flink SQL.
> I know that Spark support ORC format. How about Flink SQL?
> BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code.
> How to make that as simple as possible by flink.
> I am going to use Flink in my future project. Sorry for so many questions.
> I believe that you guys will make a world difference.
>
>
> @Chiwan
> You made a very good example for me.
> Thanks a lot
>
>
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Aljoscha Krettek

Re: Apache Flink transactions

Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:

> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>> - part1
>> - 1
>> - 2
>> - ...
>> - part2
>> - 1
>> - ...
>> - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project. Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>

hawin

Re: Apache Flink transactions

Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL as below.

Do you have any apache flink benchmark information?

Such as: https://amplab.cs.berkeley.edu/benchmark/

Thanks.

SELECT pageURL, pageRank FROM rankings WHERE pageRank > X

	Query 1A 32,888 results	Query 1B 3,331,851 results	Query 1C 89,974,976 results
	05101520253035404550Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez	0510152025303540455055Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez	0510152025303540Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez
	Median Response Time (s)
Redshift (HDD) - Current	2.49	2.61	9.46
Impala - Disk - 1.2.3	12.015	12.015	37.085
Impala - Mem - 1.2.3	2.17	3.01	36.04
Shark - Disk - 0.8.1	6.6	7	22.4
Shark - Mem - 0.8.1	1.7	1.8	3.6
Hive - 0.12 YARN	50.49	59.93	43.34
Tez - 0.2.0	28.22	36.35	26.44

On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>> - part1
>> - 1
>> - 2
>> - ...
>> - part2
>> - 1
>> - ...
>> - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project. Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>

Aljoscha Krettek

Re: Apache Flink transactions

Hi,

we don't have any current performance numbers. But the queries mentioned on the benchmark page should be easy to implement in Flink. It could be interesting if someone ported these queries and ran them with exactly the same data on the same machines.

Bill Sparks wrote on the mailing list some days ago (http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@...%3e). He seems to be running some tests to compare Flink, Spark and MapReduce.

Regards,

Aljoscha

On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <[hidden email]> wrote:

Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL as below.
Do you have any apache flink benchmark information?
Such as: https://amplab.cs.berkeley.edu/benchmark/
Thanks.
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Query 1A
32,888 results Query 1B
3,331,851 results Query 1C
89,974,976 results
05101520253035404550Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez 0510152025303540455055Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez 0510152025303540Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez
Median Response Time (s)
Redshift (HDD) - Current 2.49 2.61 9.46
Impala - Disk - 1.2.3 12.015 12.015 37.085
Impala - Mem - 1.2.3 2.17 3.01 36.04
Shark - Disk - 0.8.1 6.6 7 22.4
Shark - Mem - 0.8.1 1.7 1.8 3.6
Hive - 0.12 YARN 50.49 59.93 43.34
Tez - 0.2.0 28.22 36.35 26.44
On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>> - part1
>> - 1
>> - 2
>> - ...
>> - part2
>> - 1
>> - ...
>> - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project. Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>

hawin

Re: Apache Flink transactions

On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
we don't have any current performance numbers. But the queries mentioned on the benchmark page should be easy to implement in Flink. It could be interesting if someone ported these queries and ran them with exactly the same data on the same machines.

Bill Sparks wrote on the mailing list some days ago (http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@...%3e). He seems to be running some tests to compare Flink, Spark and MapReduce.

Regards,
Aljoscha
On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <[hidden email]> wrote:
Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL as below.
Do you have any apache flink benchmark information?
Such as: https://amplab.cs.berkeley.edu/benchmark/
Thanks.
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Query 1A
32,888 results Query 1B
3,331,851 results Query 1C
89,974,976 results
05101520253035404550Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez 0510152025303540455055Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez 0510152025303540Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez
Median Response Time (s)
Redshift (HDD) - Current 2.49 2.61 9.46
Impala - Disk - 1.2.3 12.015 12.015 37.085
Impala - Mem - 1.2.3 2.17 3.01 36.04
Shark - Disk - 0.8.1 6.6 7 22.4
Shark - Mem - 0.8.1 1.7 1.8 3.6
Hive - 0.12 YARN 50.49 59.93 43.34
Tez - 0.2.0 28.22 36.35 26.44
On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>> - part1
>> - 1
>> - 2
>> - ...
>> - part2
>> - 1
>> - ...
>> - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project. Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>

hawin

Re: Apache Flink transactions

In reply to this post by Aljoscha Krettek

Hey Aljoscha

I also sent an email to Bill for asking the latest test results. From Bill's email, Apache Spark performance looks like better than Flink.

How about your thoughts.

Best regards

Hawin

On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <[hidden email]> wrote:

Hi,
we don't have any current performance numbers. But the queries mentioned on the benchmark page should be easy to implement in Flink. It could be interesting if someone ported these queries and ran them with exactly the same data on the same machines.

Bill Sparks wrote on the mailing list some days ago (http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@...%3e). He seems to be running some tests to compare Flink, Spark and MapReduce.

Regards,
Aljoscha
On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <[hidden email]> wrote:
Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL as below.
Do you have any apache flink benchmark information?
Such as: https://amplab.cs.berkeley.edu/benchmark/
Thanks.
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Query 1A
32,888 results Query 1B
3,331,851 results Query 1C
89,974,976 results
05101520253035404550Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez 0510152025303540455055Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez 0510152025303540Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez
Median Response Time (s)
Redshift (HDD) - Current 2.49 2.61 9.46
Impala - Disk - 1.2.3 12.015 12.015 37.085
Impala - Mem - 1.2.3 2.17 3.01 36.04
Shark - Disk - 0.8.1 6.6 7 22.4
Shark - Mem - 0.8.1 1.7 1.8 3.6
Hive - 0.12 YARN 50.49 59.93 43.34
Tez - 0.2.0 28.22 36.35 26.44
On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>> - part1
>> - 1
>> - 2
>> - ...
>> - part2
>> - 1
>> - ...
>> - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project. Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>

Fabian Hueske-2

Re: Apache Flink transactions

Comparing the performance of systems is not easy and the results depend on a lot of things as the configuration, data, and jobs.

That being said, the numbers that Bill reported for WordCount make absolutely sense as Stephan pointed out in his response (Flink does not feature hash-based aggregations yet).

So there are definitely use cases where Spark outperforms Flink, but there are also other cases where both systems perform similar or Flink is faster. For example more complex jobs benefit a lot from Flink's pipelined execution and Flink's build-in iterations are very fast, especially delta-iterations.

Best, Fabian

2015-06-10 0:53 GMT+02:00 Hawin Jiang <[hidden email]>:

Hey Aljoscha

I also sent an email to Bill for asking the latest test results. From Bill's email, Apache Spark performance looks like better than Flink.
How about your thoughts.

Best regards
Hawin
On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
we don't have any current performance numbers. But the queries mentioned on the benchmark page should be easy to implement in Flink. It could be interesting if someone ported these queries and ran them with exactly the same data on the same machines.

Bill Sparks wrote on the mailing list some days ago (http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@...%3e). He seems to be running some tests to compare Flink, Spark and MapReduce.

Regards,
Aljoscha
On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <[hidden email]> wrote:
Hi Aljoscha

I want to know what is the apache flink performance if I run the same SQL as below.
Do you have any apache flink benchmark information?
Such as: https://amplab.cs.berkeley.edu/benchmark/
Thanks.
SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
Query 1A
32,888 results Query 1B
3,331,851 results Query 1C
89,974,976 results
05101520253035404550Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez 0510152025303540455055Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez 0510152025303540Redshift (HDD)Impala - DiskImpala - MemShark - DiskShark - MemHiveTez
Median Response Time (s)
Redshift (HDD) - Current 2.49 2.61 9.46
Impala - Disk - 1.2.3 12.015 12.015 37.085
Impala - Mem - 1.2.3 2.17 3.01 36.04
Shark - Disk - 0.8.1 6.6 7 22.4
Shark - Mem - 0.8.1 1.7 1.8 3.6
Hive - 0.12 YARN 50.49 59.93 43.34
Tez - 0.2.0 28.22 36.35 26.44
On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote:
Hi,
actually, what do you want to know about Flink SQL?

Aljoscha

On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote:
> Thanks all
>
> Actually, I want to know more info about Flink SQL and Flink performance
> Here is the Spark benchmark. Maybe you already saw it before.
> https://amplab.cs.berkeley.edu/benchmark/
>
> Thanks.
>
>
>
> Best regards
> Hawin
>
>
>
> On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
>>
>> If you want to append data to a data set that is store as files (e.g., on
>> HDFS), you can go for a directory structure as follows:
>>
>> dataSetRootFolder
>> - part1
>> - 1
>> - 2
>> - ...
>> - part2
>> - 1
>> - ...
>> - partX
>>
>> Flink's file format supports recursive directory scans such that you can
>> add new subfolders to dataSetRootFolder and read the full data set.
>>
>> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>>>
>>> Hi,
>>> I think the example could be made more concise by using the Table API.
>>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html
>>>
>>> Please let us know if you have questions about that, it is still quite
>>> new.
>>>
>>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote:
>>> > Hi Aljoscha
>>> >
>>> > Thanks for your reply.
>>> > Do you have any tips for Flink SQL.
>>> > I know that Spark support ORC format. How about Flink SQL?
>>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of
>>> > code.
>>> > How to make that as simple as possible by flink.
>>> > I am going to use Flink in my future project. Sorry for so many
>>> > questions.
>>> > I believe that you guys will make a world difference.
>>> >
>>> >
>>> > @Chiwan
>>> > You made a very good example for me.
>>> > Thanks a lot
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html
>>> > Sent from the Apache Flink User Mailing List archive. mailing list
>>> > archive at Nabble.com.
>>
>>
>