Using Flink in an university course

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Flink in an university course

Wouter Zorgdrager-2
Hi all,

I'm working on a setup to use Apache Flink in an assignment for a Big Data (bachelor) university course and I'm interested in your view on this. To sketch the situation:
-  > 200 students follow this course
- students have to write some (simple) Flink applications using the DataStream API; the focus is on writing the transformation code
- students need to write Scala code
- we provide a dataset and a template (Scala class) with function signatures and detailed description per application.
e.g.: def assignment_one(input: DataStream[Event]): DataStream[(String, Int)] = ???
- we provide some setup code like parsing of data and setting up the streaming environment
- assignments need to be auto-graded, based on correct results

In last years course edition we approached this by a custom Docker container. This container first compiled the students code, run all the Flink applications against a different dataset and then verified the output against our solutions. This was turned into a grade and reported back to the student. Although this was a working approach, I think we can do better.

I'm wondering if any of you have experience with using Apache Flink in a university course (or have seen this somewhere) as well as assessing Flink code.

Thanks a lot!

Kind regards,
Wouter Zorgdrager
Reply | Threaded
Open this post in threaded view
|

Re: Using Flink in an university course

Jörn Franke
It would help to understand the current issues that you have with this approach? I used a similar approach (not with Flink, but a similar big data technology) some years ago

> Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager <[hidden email]>:
>
> Hi all,
>
> I'm working on a setup to use Apache Flink in an assignment for a Big Data (bachelor) university course and I'm interested in your view on this. To sketch the situation:
> -  > 200 students follow this course
> - students have to write some (simple) Flink applications using the DataStream API; the focus is on writing the transformation code
> - students need to write Scala code
> - we provide a dataset and a template (Scala class) with function signatures and detailed description per application.
> e.g.: def assignment_one(input: DataStream[Event]): DataStream[(String, Int)] = ???
> - we provide some setup code like parsing of data and setting up the streaming environment
> - assignments need to be auto-graded, based on correct results
>
> In last years course edition we approached this by a custom Docker container. This container first compiled the students code, run all the Flink applications against a different dataset and then verified the output against our solutions. This was turned into a grade and reported back to the student. Although this was a working approach, I think we can do better.
>
> I'm wondering if any of you have experience with using Apache Flink in a university course (or have seen this somewhere) as well as assessing Flink code.
>
> Thanks a lot!
>
> Kind regards,
> Wouter Zorgdrager
Reply | Threaded
Open this post in threaded view
|

Re: Using Flink in an university course

Fabian Hueske-2
Hi Wouter,

We are using Docker Compose (Flink JM, Flink TM, Kafka, Zookeeper) setups for our trainings and it is working very well.
We have an additional container that feeds a Kafka topic via the commandline producer to simulate a somewhat realistic behavior.
Of course, you can do it without Kafka as and use some kind of data generating source that reads from a file that is replace for evaluation.

The biggest benefit that I see with using Docker is that the students have an environment that is close to grading situation for development and testing.
You do not need to provide infrastructure but everyone is running it locally in a well-defined context.

So, as Joern said, what problems do you see with Docker?

Best,
Fabian

Am Mo., 4. März 2019 um 13:44 Uhr schrieb Jörn Franke <[hidden email]>:
It would help to understand the current issues that you have with this approach? I used a similar approach (not with Flink, but a similar big data technology) some years ago

> Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager <[hidden email]>:
>
> Hi all,
>
> I'm working on a setup to use Apache Flink in an assignment for a Big Data (bachelor) university course and I'm interested in your view on this. To sketch the situation:
> -  > 200 students follow this course
> - students have to write some (simple) Flink applications using the DataStream API; the focus is on writing the transformation code
> - students need to write Scala code
> - we provide a dataset and a template (Scala class) with function signatures and detailed description per application.
> e.g.: def assignment_one(input: DataStream[Event]): DataStream[(String, Int)] = ???
> - we provide some setup code like parsing of data and setting up the streaming environment
> - assignments need to be auto-graded, based on correct results
>
> In last years course edition we approached this by a custom Docker container. This container first compiled the students code, run all the Flink applications against a different dataset and then verified the output against our solutions. This was turned into a grade and reported back to the student. Although this was a working approach, I think we can do better.
>
> I'm wondering if any of you have experience with using Apache Flink in a university course (or have seen this somewhere) as well as assessing Flink code.
>
> Thanks a lot!
>
> Kind regards,
> Wouter Zorgdrager
Reply | Threaded
Open this post in threaded view
|

Re: Using Flink in an university course

Wouter Zorgdrager-2
Hey all,

Thanks for the replies. The issues we were running into (which are not specific to Docker):
- Students changing the template wrongly failed the container.
- We give full points if the output matches our solutions (and none otherwise), but it would be nice if we could give partial grades per assignment (and better feedback). This would require instead of looking only at results also at the operators used. The pitfall is that in many cases a correct solution can be achieved in multiple ways. I came across a Flink test library [1] which allows to test Flink code more extensively but seems to be only in Java. 

In retrospective, I do think using Docker is a good approach as Fabian confirms. However, the way we currently assess student solutions might be improved. I assume that in your trainings manual feedback is given, but unfortunately this is quite difficult for so many students. 

Cheers,
Wouter



Op ma 4 mrt. 2019 om 14:39 schreef Fabian Hueske <[hidden email]>:
Hi Wouter,

We are using Docker Compose (Flink JM, Flink TM, Kafka, Zookeeper) setups for our trainings and it is working very well.
We have an additional container that feeds a Kafka topic via the commandline producer to simulate a somewhat realistic behavior.
Of course, you can do it without Kafka as and use some kind of data generating source that reads from a file that is replace for evaluation.

The biggest benefit that I see with using Docker is that the students have an environment that is close to grading situation for development and testing.
You do not need to provide infrastructure but everyone is running it locally in a well-defined context.

So, as Joern said, what problems do you see with Docker?

Best,
Fabian

Am Mo., 4. März 2019 um 13:44 Uhr schrieb Jörn Franke <[hidden email]>:
It would help to understand the current issues that you have with this approach? I used a similar approach (not with Flink, but a similar big data technology) some years ago

> Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager <[hidden email]>:
>
> Hi all,
>
> I'm working on a setup to use Apache Flink in an assignment for a Big Data (bachelor) university course and I'm interested in your view on this. To sketch the situation:
> -  > 200 students follow this course
> - students have to write some (simple) Flink applications using the DataStream API; the focus is on writing the transformation code
> - students need to write Scala code
> - we provide a dataset and a template (Scala class) with function signatures and detailed description per application.
> e.g.: def assignment_one(input: DataStream[Event]): DataStream[(String, Int)] = ???
> - we provide some setup code like parsing of data and setting up the streaming environment
> - assignments need to be auto-graded, based on correct results
>
> In last years course edition we approached this by a custom Docker container. This container first compiled the students code, run all the Flink applications against a different dataset and then verified the output against our solutions. This was turned into a grade and reported back to the student. Although this was a working approach, I think we can do better.
>
> I'm wondering if any of you have experience with using Apache Flink in a university course (or have seen this somewhere) as well as assessing Flink code.
>
> Thanks a lot!
>
> Kind regards,
> Wouter Zorgdrager
Reply | Threaded
Open this post in threaded view
|

Re: Using Flink in an university course

Addison Higham
Hi there,

As far as a runtime for students, it seems like docker is your best bet. However, you could have them instead package a jar using some interface (for example, see https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/packaging.html, which details the `Program` interface) and then execute it inside a custom runner. That *might* result in something less prone to breakage as it would need to conform to an interface, but it may require a fair amount of custom code to reduce the boiler plate to build up a program plan as well as the custom runner. The code for how flink loads a jar and turns it into something it can execute is mostly encapsulated in org.apache.flink.client.program.PackagedProgram, which might be a good thing to read and understand if you go down this route.

If you want to give more insight, you could build some tooling to traverse the underlying graphs that the students build up in their data stream application. For example, calling `StreamExecutionEnvironment.getStreamGraph` after the data stream is built will get a graph of the current job, which you can then use to traverse a graph and see which operators and edges are in use. This is very similar to the process flink uses to build the job DAG it renders in the UI. I am not sure what you could do as an automated analysis, but the StreamGraph API is quite low level and exposes a lot of information about the program.

Hopefully that is a little bit helpful. Good luck and sounds like a fun course!


On Mon, Mar 4, 2019 at 7:16 AM Wouter Zorgdrager <[hidden email]> wrote:
Hey all,

Thanks for the replies. The issues we were running into (which are not specific to Docker):
- Students changing the template wrongly failed the container.
- We give full points if the output matches our solutions (and none otherwise), but it would be nice if we could give partial grades per assignment (and better feedback). This would require instead of looking only at results also at the operators used. The pitfall is that in many cases a correct solution can be achieved in multiple ways. I came across a Flink test library [1] which allows to test Flink code more extensively but seems to be only in Java. 

In retrospective, I do think using Docker is a good approach as Fabian confirms. However, the way we currently assess student solutions might be improved. I assume that in your trainings manual feedback is given, but unfortunately this is quite difficult for so many students. 

Cheers,
Wouter



Op ma 4 mrt. 2019 om 14:39 schreef Fabian Hueske <[hidden email]>:
Hi Wouter,

We are using Docker Compose (Flink JM, Flink TM, Kafka, Zookeeper) setups for our trainings and it is working very well.
We have an additional container that feeds a Kafka topic via the commandline producer to simulate a somewhat realistic behavior.
Of course, you can do it without Kafka as and use some kind of data generating source that reads from a file that is replace for evaluation.

The biggest benefit that I see with using Docker is that the students have an environment that is close to grading situation for development and testing.
You do not need to provide infrastructure but everyone is running it locally in a well-defined context.

So, as Joern said, what problems do you see with Docker?

Best,
Fabian

Am Mo., 4. März 2019 um 13:44 Uhr schrieb Jörn Franke <[hidden email]>:
It would help to understand the current issues that you have with this approach? I used a similar approach (not with Flink, but a similar big data technology) some years ago

> Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager <[hidden email]>:
>
> Hi all,
>
> I'm working on a setup to use Apache Flink in an assignment for a Big Data (bachelor) university course and I'm interested in your view on this. To sketch the situation:
> -  > 200 students follow this course
> - students have to write some (simple) Flink applications using the DataStream API; the focus is on writing the transformation code
> - students need to write Scala code
> - we provide a dataset and a template (Scala class) with function signatures and detailed description per application.
> e.g.: def assignment_one(input: DataStream[Event]): DataStream[(String, Int)] = ???
> - we provide some setup code like parsing of data and setting up the streaming environment
> - assignments need to be auto-graded, based on correct results
>
> In last years course edition we approached this by a custom Docker container. This container first compiled the students code, run all the Flink applications against a different dataset and then verified the output against our solutions. This was turned into a grade and reported back to the student. Although this was a working approach, I think we can do better.
>
> I'm wondering if any of you have experience with using Apache Flink in a university course (or have seen this somewhere) as well as assessing Flink code.
>
> Thanks a lot!
>
> Kind regards,
> Wouter Zorgdrager
Reply | Threaded
Open this post in threaded view
|

Re: Using Flink in an university course

Wouter Zorgdrager-2
Hi all,

Thanks for the input. Much appreciated.

Regards,
Wouter

Op ma 4 mrt. 2019 om 20:40 schreef Addison Higham <[hidden email]>:
Hi there,

As far as a runtime for students, it seems like docker is your best bet. However, you could have them instead package a jar using some interface (for example, see https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/packaging.html, which details the `Program` interface) and then execute it inside a custom runner. That *might* result in something less prone to breakage as it would need to conform to an interface, but it may require a fair amount of custom code to reduce the boiler plate to build up a program plan as well as the custom runner. The code for how flink loads a jar and turns it into something it can execute is mostly encapsulated in org.apache.flink.client.program.PackagedProgram, which might be a good thing to read and understand if you go down this route.

If you want to give more insight, you could build some tooling to traverse the underlying graphs that the students build up in their data stream application. For example, calling `StreamExecutionEnvironment.getStreamGraph` after the data stream is built will get a graph of the current job, which you can then use to traverse a graph and see which operators and edges are in use. This is very similar to the process flink uses to build the job DAG it renders in the UI. I am not sure what you could do as an automated analysis, but the StreamGraph API is quite low level and exposes a lot of information about the program.

Hopefully that is a little bit helpful. Good luck and sounds like a fun course!


On Mon, Mar 4, 2019 at 7:16 AM Wouter Zorgdrager <[hidden email]> wrote:
Hey all,

Thanks for the replies. The issues we were running into (which are not specific to Docker):
- Students changing the template wrongly failed the container.
- We give full points if the output matches our solutions (and none otherwise), but it would be nice if we could give partial grades per assignment (and better feedback). This would require instead of looking only at results also at the operators used. The pitfall is that in many cases a correct solution can be achieved in multiple ways. I came across a Flink test library [1] which allows to test Flink code more extensively but seems to be only in Java. 

In retrospective, I do think using Docker is a good approach as Fabian confirms. However, the way we currently assess student solutions might be improved. I assume that in your trainings manual feedback is given, but unfortunately this is quite difficult for so many students. 

Cheers,
Wouter



Op ma 4 mrt. 2019 om 14:39 schreef Fabian Hueske <[hidden email]>:
Hi Wouter,

We are using Docker Compose (Flink JM, Flink TM, Kafka, Zookeeper) setups for our trainings and it is working very well.
We have an additional container that feeds a Kafka topic via the commandline producer to simulate a somewhat realistic behavior.
Of course, you can do it without Kafka as and use some kind of data generating source that reads from a file that is replace for evaluation.

The biggest benefit that I see with using Docker is that the students have an environment that is close to grading situation for development and testing.
You do not need to provide infrastructure but everyone is running it locally in a well-defined context.

So, as Joern said, what problems do you see with Docker?

Best,
Fabian

Am Mo., 4. März 2019 um 13:44 Uhr schrieb Jörn Franke <[hidden email]>:
It would help to understand the current issues that you have with this approach? I used a similar approach (not with Flink, but a similar big data technology) some years ago

> Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager <[hidden email]>:
>
> Hi all,
>
> I'm working on a setup to use Apache Flink in an assignment for a Big Data (bachelor) university course and I'm interested in your view on this. To sketch the situation:
> -  > 200 students follow this course
> - students have to write some (simple) Flink applications using the DataStream API; the focus is on writing the transformation code
> - students need to write Scala code
> - we provide a dataset and a template (Scala class) with function signatures and detailed description per application.
> e.g.: def assignment_one(input: DataStream[Event]): DataStream[(String, Int)] = ???
> - we provide some setup code like parsing of data and setting up the streaming environment
> - assignments need to be auto-graded, based on correct results
>
> In last years course edition we approached this by a custom Docker container. This container first compiled the students code, run all the Flink applications against a different dataset and then verified the output against our solutions. This was turned into a grade and reported back to the student. Although this was a working approach, I think we can do better.
>
> I'm wondering if any of you have experience with using Apache Flink in a university course (or have seen this somewhere) as well as assessing Flink code.
>
> Thanks a lot!
>
> Kind regards,
> Wouter Zorgdrager