Hi.
Flink is not DBMS. There is no equivalent operation of insert, update, remove. But you can use map[1] or filter[2] operation to create modified dataset. I recommend you some sildes[3][4] to understand Flink concepts. Regards, Chiwan Park [1] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#map [2] http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#filter [3] http://www.slideshare.net/robertmetzger1/introduction-to-apache-flink-palo-alto-meetup [4] http://www.slideshare.net/dataArtisans/flink-training-dataset-api-basics > On Jun 4, 2015, at 2:48 PM, Hawin Jiang <[hidden email]> wrote: > > Hi Admin > > > > Do we have insert, update and remove operations on Apache Flink? > > For example: I have 10 million records in my test file. I want to add one > record, update one record and remove one record from this test file. > > How to implement it by Flink? > > Thanks. > > > > > > > > > > Best regards > > Email: [hidden email] > |
But what you can do to simulate an insert is to read the new data in a separate Cheers, On Thu, Jun 4, 2015 at 9:00 AM, Chiwan Park <[hidden email]> wrote: Hi. |
In reply to this post by Chiwan Park
To run operations like Insert, Update, Delete on persistent files, you have to have support from the storage engine. The Apache ORC data format recently added support for transactional inserts, updates, deletes. http://orc.apache.org/ ORC has a Hadoop input format, and you can use that one with Flink through the Hadoop wrappers. On Thu, Jun 4, 2015 at 9:45 AM, Hawin Jiang <[hidden email]> wrote:
|
In reply to this post by Chiwan Park
Hi Chiwan
Thanks for your information. I knew Flink is not DBMS. I want to know what is the flink way to select, insert, update and delete data on HDFS. @Till Maybe union is a way to insert data. But I think it will cost some performance issue. @Stephan Thanks for your suggestion. I have checked apache flink roadmap. SQL on flink will be released on Q3/Q4 2015. Will it support insertion, deletion and update data on HDFS? You guys already provided a nice example for selecting data on HDFS. Such as: TPCHQuery10 and TPCHQuery3. Do you have other examples for inserting, updating and removing data on HDFS by Apache flink Thanks |
Basically Flink uses Data Model in functional programming model. All DataSet is immutable. This means we cannot modify DataSet but ㅐonly can create new DataSet with modification. Update, delete query are represented as writing filtered DataSet.
Following scala sample shows select, insert, update, and remove query in Flink. (I’m not sure this is best practice.) case class MyType(id: Int, value1: String, value2: String) // load data (you can use readCsvFile, or something else.) val data = env.fromElements(MyType(0, “test”, “test2”), MyType(1, “hello”, “flink”), MyType(2, “flink”, “good”)) // selecting // same as SELECT * FROM data WHERE id = 1 val selectedData1 = data.filter(_.id == 1) // same as SELECT value1 FROM data WHERE id = 1 val selectedData2 = data.filter(_.id == 1).map(_.value1) // removing is same as selecting such as following // same as DELETE FROM data WHERE id = 1, but DataSet data is not changed. the result is removedData val removedData = data.filter(_.id != 1) // inserting // same as INSERT INTO data (id, value1, value2) VALUES (3, “new”, “data”) val newData = env.fromElements(MyType(3, “new”, “data”)) val insertedData = data.union(newData) // updating // UPDATE data SET value1 = “updated”, value2 = “data” WHERE id = 1, but DataSet data is not changed. val updatedData = data.map { x => if (x.id == 1) MyType(x.id, “updated”, “data”) else x } Regards, Chiwan Park > On Jun 5, 2015, at 9:22 AM, hawin <[hidden email]> wrote: > > Hi Chiwan > > Thanks for your information. I knew Flink is not DBMS. I want to know what > is the flink way to select, insert, update and delete data on HDFS. > > > @Till > Maybe union is a way to insert data. But I think it will cost some > performance issue. > > > @Stephan > Thanks for your suggestion. I have checked apache flink roadmap. SQL on > flink will be released on Q3/Q4 2015. Will it support insertion, deletion > and update data on HDFS? > You guys already provided a nice example for selecting data on HDFS. Such > as: TPCHQuery10 and TPCHQuery3. > Do you have other examples for inserting, updating and removing data on HDFS > by Apache flink > > Thanks > > > > > -- > View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1491.html > Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. |
Yes, this code seems very reasonable. :D
The way to use this to "modify" a file on HDFS is to read the file, then filter out some elements and write a new modified file that does not contain the filtered out elements. As said before, Flink (or HDFS), does not allow in-place modification of files. On Fri, Jun 5, 2015 at 4:55 AM, Chiwan Park <[hidden email]> wrote: > Basically Flink uses Data Model in functional programming model. All DataSet is immutable. This means we cannot modify DataSet but ㅐonly can create new DataSet with modification. Update, delete query are represented as writing filtered DataSet. > Following scala sample shows select, insert, update, and remove query in Flink. (I’m not sure this is best practice.) > > case class MyType(id: Int, value1: String, value2: String) > > // load data (you can use readCsvFile, or something else.) > val data = env.fromElements(MyType(0, “test”, “test2”), MyType(1, “hello”, “flink”), MyType(2, “flink”, “good”)) > > // selecting > // same as SELECT * FROM data WHERE id = 1 > val selectedData1 = data.filter(_.id == 1) > // same as SELECT value1 FROM data WHERE id = 1 > val selectedData2 = data.filter(_.id == 1).map(_.value1) > > // removing is same as selecting such as following > // same as DELETE FROM data WHERE id = 1, but DataSet data is not changed. the result is removedData > val removedData = data.filter(_.id != 1) > > // inserting > // same as INSERT INTO data (id, value1, value2) VALUES (3, “new”, “data”) > val newData = env.fromElements(MyType(3, “new”, “data”)) > val insertedData = data.union(newData) > > // updating > // UPDATE data SET value1 = “updated”, value2 = “data” WHERE id = 1, but DataSet data is not changed. > val updatedData = data.map { x => if (x.id == 1) MyType(x.id, “updated”, “data”) else x } > > Regards, > Chiwan Park > >> On Jun 5, 2015, at 9:22 AM, hawin <[hidden email]> wrote: >> >> Hi Chiwan >> >> Thanks for your information. I knew Flink is not DBMS. I want to know what >> is the flink way to select, insert, update and delete data on HDFS. >> >> >> @Till >> Maybe union is a way to insert data. But I think it will cost some >> performance issue. >> >> >> @Stephan >> Thanks for your suggestion. I have checked apache flink roadmap. SQL on >> flink will be released on Q3/Q4 2015. Will it support insertion, deletion >> and update data on HDFS? >> You guys already provided a nice example for selecting data on HDFS. Such >> as: TPCHQuery10 and TPCHQuery3. >> Do you have other examples for inserting, updating and removing data on HDFS >> by Apache flink >> >> Thanks >> >> >> >> >> -- >> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1491.html >> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. > > > > |
Hi Aljoscha
Thanks for your reply. Do you have any tips for Flink SQL. I know that Spark support ORC format. How about Flink SQL? BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code. How to make that as simple as possible by flink. I am going to use Flink in my future project. Sorry for so many questions. I believe that you guys will make a world difference. @Chiwan You made a very good example for me. Thanks a lot |
Hi,
I think the example could be made more concise by using the Table API. http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html Please let us know if you have questions about that, it is still quite new. On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote: > Hi Aljoscha > > Thanks for your reply. > Do you have any tips for Flink SQL. > I know that Spark support ORC format. How about Flink SQL? > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of code. > How to make that as simple as possible by flink. > I am going to use Flink in my future project. Sorry for so many questions. > I believe that you guys will make a world difference. > > > @Chiwan > You made a very good example for me. > Thanks a lot > > > > > > -- > View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html > Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com. |
If you want to append data to a data set that is store as files (e.g., on HDFS), you can go for a directory structure as follows: dataSetRootFolder- 2 - ... - 1 - ... 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>: Hi, |
Thanks all Actually, I want to know more info about Flink SQL and Flink performance Here is the Spark benchmark. Maybe you already saw it before. Thanks. Best regards Hawin On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote:
|
Hi,
actually, what do you want to know about Flink SQL? Aljoscha On Sat, Jun 6, 2015 at 2:22 AM, Hawin Jiang <[hidden email]> wrote: > Thanks all > > Actually, I want to know more info about Flink SQL and Flink performance > Here is the Spark benchmark. Maybe you already saw it before. > https://amplab.cs.berkeley.edu/benchmark/ > > Thanks. > > > > Best regards > Hawin > > > > On Fri, Jun 5, 2015 at 1:35 AM, Fabian Hueske <[hidden email]> wrote: >> >> If you want to append data to a data set that is store as files (e.g., on >> HDFS), you can go for a directory structure as follows: >> >> dataSetRootFolder >> - part1 >> - 1 >> - 2 >> - ... >> - part2 >> - 1 >> - ... >> - partX >> >> Flink's file format supports recursive directory scans such that you can >> add new subfolders to dataSetRootFolder and read the full data set. >> >> 2015-06-05 9:58 GMT+02:00 Aljoscha Krettek <[hidden email]>: >>> >>> Hi, >>> I think the example could be made more concise by using the Table API. >>> http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html >>> >>> Please let us know if you have questions about that, it is still quite >>> new. >>> >>> On Fri, Jun 5, 2015 at 9:03 AM, hawin <[hidden email]> wrote: >>> > Hi Aljoscha >>> > >>> > Thanks for your reply. >>> > Do you have any tips for Flink SQL. >>> > I know that Spark support ORC format. How about Flink SQL? >>> > BTW, for TPCHQuery10 example, you have implemented it by 231 lines of >>> > code. >>> > How to make that as simple as possible by flink. >>> > I am going to use Flink in my future project. Sorry for so many >>> > questions. >>> > I believe that you guys will make a world difference. >>> > >>> > >>> > @Chiwan >>> > You made a very good example for me. >>> > Thanks a lot >>> > >>> > >>> > >>> > >>> > >>> > -- >>> > View this message in context: >>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Re-Apache-Flink-transactions-tp1457p1494.html >>> > Sent from the Apache Flink User Mailing List archive. mailing list >>> > archive at Nabble.com. >> >> > |
Hi Aljoscha I want to know what is the apache flink performance if I run the same SQL as below. Do you have any apache flink benchmark information? Thanks.
On Mon, Jun 8, 2015 at 2:03 AM, Aljoscha Krettek <[hidden email]> wrote: Hi, |
Hi, we don't have any current performance numbers. But the queries mentioned on the benchmark page should be easy to implement in Flink. It could be interesting if someone ported these queries and ran them with exactly the same data on the same machines. Bill Sparks wrote on the mailing list some days ago (http://mail-archives.apache.org/mod_mbox/flink-user/201506.mbox/%3cD1972778.64426%25jsparks@...%3e). He seems to be running some tests to compare Flink, Spark and MapReduce. Regards, Aljoscha On Mon, Jun 8, 2015 at 9:09 PM, Hawin Jiang <[hidden email]> wrote:
|
On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <[hidden email]> wrote:
|
In reply to this post by Aljoscha Krettek
Hey Aljoscha I also sent an email to Bill for asking the latest test results. From Bill's email, Apache Spark performance looks like better than Flink. How about your thoughts. Best regards Hawin On Tue, Jun 9, 2015 at 2:29 AM, Aljoscha Krettek <[hidden email]> wrote:
|
Comparing the performance of systems is not easy and the results depend on a lot of things as the configuration, data, and jobs. That being said, the numbers that Bill reported for WordCount make absolutely sense as Stephan pointed out in his response (Flink does not feature hash-based aggregations yet). So there are definitely use cases where Spark outperforms Flink, but there are also other cases where both systems perform similar or Flink is faster. For example more complex jobs benefit a lot from Flink's pipelined execution and Flink's build-in iterations are very fast, especially delta-iterations. Best, Fabian 2015-06-10 0:53 GMT+02:00 Hawin Jiang <[hidden email]>:
|
Free forum by Nabble | Edit this page |