Runtime generated (source) datasets

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Runtime generated (source) datasets

Flavio Pompermaier
Hi guys,

I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Till Rohrmann
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,

I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Flavio Pompermaier
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,

I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Till Rohrmann
Yes this will also work. You only have to make sure that the list of data sets is processed properly later on in your code.

On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,

I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio




Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Flavio Pompermaier
Great! Could you explain me a little bit the internals of how and when Flink will generate the plan and how the execution environment is involved in this phase? 
Just to better understand this step!

Thanks again,
Flavio

On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[hidden email]> wrote:
Yes this will also work. You only have to make sure that the list of data sets is processed properly later on in your code.

On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,

I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio





Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Fabian Hueske-2
The program is compiled when the ExecutionEnvironment.execute() method is called. At that moment, theEexecutionEnvironment collects all data sources that were previously created and traverses them towards connected data sinks. All sinks that are found this way are remembered and treated as execution targets. The sinks and all connected operators and data sources are given to the optimizer which analyses the plan, compiles an execution plan, and submits the plan to the execution system which the ExecutionEnvironment refers to (local, remote, ...).  

Therefore, your code can build arbitrary data flows with as many source as you like. Once you call ExecutionEnvironment.execute() all data sources and operators which are required to compute the result of all data sinks are executed.


2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Great! Could you explain me a little bit the internals of how and when Flink will generate the plan and how the execution environment is involved in this phase? 
Just to better understand this step!

Thanks again,
Flavio


On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[hidden email]> wrote:
Yes this will also work. You only have to make sure that the list of data sets is processed properly later on in your code.

On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,

I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio






Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Flavio Pompermaier
Thanks Fabian, that makes a lot of sense :)

Best,
Flavio

On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[hidden email]> wrote:
The program is compiled when the ExecutionEnvironment.execute() method is called. At that moment, theEexecutionEnvironment collects all data sources that were previously created and traverses them towards connected data sinks. All sinks that are found this way are remembered and treated as execution targets. The sinks and all connected operators and data sources are given to the optimizer which analyses the plan, compiles an execution plan, and submits the plan to the execution system which the ExecutionEnvironment refers to (local, remote, ...).  

Therefore, your code can build arbitrary data flows with as many source as you like. Once you call ExecutionEnvironment.execute() all data sources and operators which are required to compute the result of all data sinks are executed.


2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Great! Could you explain me a little bit the internals of how and when Flink will generate the plan and how the execution environment is involved in this phase? 
Just to better understand this step!

Thanks again,
Flavio


On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[hidden email]> wrote:
Yes this will also work. You only have to make sure that the list of data sets is processed properly later on in your code.

On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,

I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio






Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Stephan Ewen
There is a common misunderstanding between the "compile" phase of the Java/Scala compiler (which does not generate the Flink plan) and the Flink "compile/optimize" phase (happening when calling env.execute()).

The Flink compile/optimize phase is not a compile phase in the sense that source code is parsed and translated to byte code. It only is a set of transformations on the program's data flow

We should probably stop calling the Flink phase "compile", but simply "pre-flight" or "optimize" or "prepare". Otherwise, it creates frequent confusions...

On Wed, Jan 21, 2015 at 6:05 AM, Flavio Pompermaier <[hidden email]> wrote:
Thanks Fabian, that makes a lot of sense :)

Best,
Flavio

On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[hidden email]> wrote:
The program is compiled when the ExecutionEnvironment.execute() method is called. At that moment, theEexecutionEnvironment collects all data sources that were previously created and traverses them towards connected data sinks. All sinks that are found this way are remembered and treated as execution targets. The sinks and all connected operators and data sources are given to the optimizer which analyses the plan, compiles an execution plan, and submits the plan to the execution system which the ExecutionEnvironment refers to (local, remote, ...).  

Therefore, your code can build arbitrary data flows with as many source as you like. Once you call ExecutionEnvironment.execute() all data sources and operators which are required to compute the result of all data sinks are executed.


2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Great! Could you explain me a little bit the internals of how and when Flink will generate the plan and how the execution environment is involved in this phase? 
Just to better understand this step!

Thanks again,
Flavio


On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[hidden email]> wrote:
Yes this will also work. You only have to make sure that the list of data sets is processed properly later on in your code.

On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,

I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio







Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Plamen L. Simeonov
Absolutely correct and good idea! Why don’t call it a „digester“, taking the term from chemistry/ medicine: 

Chemistry A vessel („pipeline“) in which substances are softened or decomposed, usually for further processing.

All the best!


___________________________________

Dr.-Ing. Plamen L. Simeonov
Department 1: Geodäsie und Fernerkundung
Sektion 1.5: Geoinformatik
Tel.: +49 (0)331/288-1587
Fax:  +49 (0)331/288-1732
email: [hidden email]
http://www.gfz-potsdam.de/
___________________________________

Helmholtz-Zentrum Potsdam
Deutsches GeoForschungsZentrum - GFZ
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg A 20, 14473 Potsdam
**************************************************




On 21 Jan 2015, at 20:21, Stephan Ewen <[hidden email]> wrote:

There is a common misunderstanding between the "compile" phase of the Java/Scala compiler (which does not generate the Flink plan) and the Flink "compile/optimize" phase (happening when calling env.execute()).

The Flink compile/optimize phase is not a compile phase in the sense that source code is parsed and translated to byte code. It only is a set of transformations on the program's data flow

We should probably stop calling the Flink phase "compile", but simply "pre-flight" or "optimize" or "prepare". Otherwise, it creates frequent confusions...

On Wed, Jan 21, 2015 at 6:05 AM, Flavio Pompermaier <[hidden email]> wrote:
Thanks Fabian, that makes a lot of sense :)

Best,
Flavio

On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[hidden email]> wrote:
The program is compiled when the ExecutionEnvironment.execute() method is called. At that moment, theEexecutionEnvironment collects all data sources that were previously created and traverses them towards connected data sinks. All sinks that are found this way are remembered and treated as execution targets. The sinks and all connected operators and data sources are given to the optimizer which analyses the plan, compiles an execution plan, and submits the plan to the execution system which the ExecutionEnvironment refers to (local, remote, ...).  

Therefore, your code can build arbitrary data flows with as many source as you like. Once you call ExecutionEnvironment.execute() all data sources and operators which are required to compute the result of all data sinks are executed.


2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Great! Could you explain me a little bit the internals of how and when Flink will generate the plan and how the execution environment is involved in this phase? 
Just to better understand this step!

Thanks again,
Flavio


On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[hidden email]> wrote:
Yes this will also work. You only have to make sure that the list of data sets is processed properly later on in your code.

On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,




I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio




















Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

rmetzger0
In reply to this post by Stephan Ewen
How about renaming the "flink-compiler" to "flink-optimizer" ?

On Wed, Jan 21, 2015 at 8:21 PM, Stephan Ewen <[hidden email]> wrote:
There is a common misunderstanding between the "compile" phase of the Java/Scala compiler (which does not generate the Flink plan) and the Flink "compile/optimize" phase (happening when calling env.execute()).

The Flink compile/optimize phase is not a compile phase in the sense that source code is parsed and translated to byte code. It only is a set of transformations on the program's data flow

We should probably stop calling the Flink phase "compile", but simply "pre-flight" or "optimize" or "prepare". Otherwise, it creates frequent confusions...

On Wed, Jan 21, 2015 at 6:05 AM, Flavio Pompermaier <[hidden email]> wrote:
Thanks Fabian, that makes a lot of sense :)

Best,
Flavio

On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[hidden email]> wrote:
The program is compiled when the ExecutionEnvironment.execute() method is called. At that moment, theEexecutionEnvironment collects all data sources that were previously created and traverses them towards connected data sinks. All sinks that are found this way are remembered and treated as execution targets. The sinks and all connected operators and data sources are given to the optimizer which analyses the plan, compiles an execution plan, and submits the plan to the execution system which the ExecutionEnvironment refers to (local, remote, ...).  

Therefore, your code can build arbitrary data flows with as many source as you like. Once you call ExecutionEnvironment.execute() all data sources and operators which are required to compute the result of all data sinks are executed.


2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Great! Could you explain me a little bit the internals of how and when Flink will generate the plan and how the execution environment is involved in this phase? 
Just to better understand this step!

Thanks again,
Flavio


On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[hidden email]> wrote:
Yes this will also work. You only have to make sure that the list of data sets is processed properly later on in your code.

On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,

I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio








Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Plamen L. Simeonov
Renaming is fine when understood also by newcomers in the list.
The problem with „optimiser“ is that it is not clear what exactly optimises an „optimiser", whereas terms like „compiler“, „interpreter“ and „processor“ are clearly defined in software engineering. I would suggest whatever term is taken it has to be defined: 1) somewhere on the flink web site, perhaps in a glossary or a simple BNF like spec; this holds for all special terms used; and 2) in Wikipedia, so that everyone can check what is what.

I hope this helps. 

Plamen

PS By the way, terms starting with „flink-„ are a good idea.

___________________________________

Dr.-Ing. Plamen L. Simeonov
Department 1: Geodäsie und Fernerkundung
Sektion 1.5: Geoinformatik
Tel.: +49 (0)331/288-1587
Fax:  +49 (0)331/288-1732
email: [hidden email]
http://www.gfz-potsdam.de/
___________________________________

Helmholtz-Zentrum Potsdam
Deutsches GeoForschungsZentrum - GFZ
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg A 20, 14473 Potsdam
**************************************************




On 22 Jan 2015, at 15:34, Robert Metzger <[hidden email]> wrote:

How about renaming the "flink-compiler" to "flink-optimizer" ?

On Wed, Jan 21, 2015 at 8:21 PM, Stephan Ewen <[hidden email]> wrote:
There is a common misunderstanding between the "compile" phase of the Java/Scala compiler (which does not generate the Flink plan) and the Flink "compile/optimize" phase (happening when calling env.execute()).

The Flink compile/optimize phase is not a compile phase in the sense that source code is parsed and translated to byte code. It only is a set of transformations on the program's data flow

We should probably stop calling the Flink phase "compile", but simply "pre-flight" or "optimize" or "prepare". Otherwise, it creates frequent confusions...

On Wed, Jan 21, 2015 at 6:05 AM, Flavio Pompermaier <[hidden email]> wrote:
Thanks Fabian, that makes a lot of sense :)

Best,
Flavio

On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[hidden email]> wrote:
The program is compiled when the ExecutionEnvironment.execute() method is called. At that moment, theEexecutionEnvironment collects all data sources that were previously created and traverses them towards connected data sinks. All sinks that are found this way are remembered and treated as execution targets. The sinks and all connected operators and data sources are given to the optimizer which analyses the plan, compiles an execution plan, and submits the plan to the execution system which the ExecutionEnvironment refers to (local, remote, ...).  

Therefore, your code can build arbitrary data flows with as many source as you like. Once you call ExecutionEnvironment.execute() all data sources and operators which are required to compute the result of all data sinks are executed.


2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Great! Could you explain me a little bit the internals of how and when Flink will generate the plan and how the execution environment is involved in this phase? 
Just to better understand this step!

Thanks again,
Flavio


On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[hidden email]> wrote:
Yes this will also work. You only have to make sure that the list of data sets is processed properly later on in your code.

On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,




I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio





















Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Till Rohrmann
In reply to this post by rmetzger0
But it not only optimizes the data flow. It also translates it into a different representation.

On Thu, Jan 22, 2015 at 3:34 PM, Robert Metzger <[hidden email]> wrote:
How about renaming the "flink-compiler" to "flink-optimizer" ?

On Wed, Jan 21, 2015 at 8:21 PM, Stephan Ewen <[hidden email]> wrote:
There is a common misunderstanding between the "compile" phase of the Java/Scala compiler (which does not generate the Flink plan) and the Flink "compile/optimize" phase (happening when calling env.execute()).

The Flink compile/optimize phase is not a compile phase in the sense that source code is parsed and translated to byte code. It only is a set of transformations on the program's data flow

We should probably stop calling the Flink phase "compile", but simply "pre-flight" or "optimize" or "prepare". Otherwise, it creates frequent confusions...

On Wed, Jan 21, 2015 at 6:05 AM, Flavio Pompermaier <[hidden email]> wrote:
Thanks Fabian, that makes a lot of sense :)

Best,
Flavio

On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[hidden email]> wrote:
The program is compiled when the ExecutionEnvironment.execute() method is called. At that moment, theEexecutionEnvironment collects all data sources that were previously created and traverses them towards connected data sinks. All sinks that are found this way are remembered and treated as execution targets. The sinks and all connected operators and data sources are given to the optimizer which analyses the plan, compiles an execution plan, and submits the plan to the execution system which the ExecutionEnvironment refers to (local, remote, ...).  

Therefore, your code can build arbitrary data flows with as many source as you like. Once you call ExecutionEnvironment.execute() all data sources and operators which are required to compute the result of all data sinks are executed.


2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Great! Could you explain me a little bit the internals of how and when Flink will generate the plan and how the execution environment is involved in this phase? 
Just to better understand this step!

Thanks again,
Flavio


On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[hidden email]> wrote:
Yes this will also work. You only have to make sure that the list of data sets is processed properly later on in your code.

On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,

I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio









Reply | Threaded
Open this post in threaded view
|

Re: Runtime generated (source) datasets

Plamen L. Simeonov
That’s fine. All this should come in the definition of a „flink-something“.

___________________________________

Dr.-Ing. Plamen L. Simeonov
Department 1: Geodäsie und Fernerkundung
Sektion 1.5: Geoinformatik
Tel.: +49 (0)331/288-1587
Fax:  +49 (0)331/288-1732
email: [hidden email]
http://www.gfz-potsdam.de/
___________________________________

Helmholtz-Zentrum Potsdam
Deutsches GeoForschungsZentrum - GFZ
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg A 20, 14473 Potsdam
**************************************************




On 22 Jan 2015, at 15:50, Till Rohrmann <[hidden email]> wrote:

But it not only optimizes the data flow. It also translates it into a different representation.

On Thu, Jan 22, 2015 at 3:34 PM, Robert Metzger <[hidden email]> wrote:
How about renaming the "flink-compiler" to "flink-optimizer" ?

On Wed, Jan 21, 2015 at 8:21 PM, Stephan Ewen <[hidden email]> wrote:
There is a common misunderstanding between the "compile" phase of the Java/Scala compiler (which does not generate the Flink plan) and the Flink "compile/optimize" phase (happening when calling env.execute()).

The Flink compile/optimize phase is not a compile phase in the sense that source code is parsed and translated to byte code. It only is a set of transformations on the program's data flow

We should probably stop calling the Flink phase "compile", but simply "pre-flight" or "optimize" or "prepare". Otherwise, it creates frequent confusions...

On Wed, Jan 21, 2015 at 6:05 AM, Flavio Pompermaier <[hidden email]> wrote:
Thanks Fabian, that makes a lot of sense :)

Best,
Flavio

On Wed, Jan 21, 2015 at 2:41 PM, Fabian Hueske <[hidden email]> wrote:
The program is compiled when the ExecutionEnvironment.execute() method is called. At that moment, theEexecutionEnvironment collects all data sources that were previously created and traverses them towards connected data sinks. All sinks that are found this way are remembered and treated as execution targets. The sinks and all connected operators and data sources are given to the optimizer which analyses the plan, compiles an execution plan, and submits the plan to the execution system which the ExecutionEnvironment refers to (local, remote, ...).  

Therefore, your code can build arbitrary data flows with as many source as you like. Once you call ExecutionEnvironment.execute() all data sources and operators which are required to compute the result of all data sinks are executed.


2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Great! Could you explain me a little bit the internals of how and when Flink will generate the plan and how the execution environment is involved in this phase? 
Just to better understand this step!

Thanks again,
Flavio


On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <[hidden email]> wrote:
Yes this will also work. You only have to make sure that the list of data sets is processed properly later on in your code.

On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Till,
thanks for the reply. However my problem is that I'll have something like:

List<Dataset<<ElementType>>  getInput(String[] args, ExecutionEnvironment env) {....}

So I don't know in advance how many of them I'll have at runtime. Does it still work?

On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <[hidden email]> wrote:
Hi Flavio,

if your question was whether you can write a Flink job which can read input from different sources, depending on the user input, then the answer is yes. The Flink job plans are actually generated at runtime so that you can easily write a method which generates a user dependent input/data set.

You could do something like this:

DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env) {
  if(args[0] == csv) {
    return env.readCsvFile(...);
  } else {
    return env.createInput(new AvroInputFormat<ElementType>(...));
  }
}

as long as the element type of the data set are all equal for all possible data sources. I hope that I understood your problem correctly.

Greets,

Till

On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,




I have a big question for you about how Fling handles job's plan generation:
let's suppose that I want to write a job that takes as input a description of a set of datasets that I want to work on (for example a csv file and its path, 2 hbase tables, 1 parquet directory and its path, etc). 
From what I know Flink generates the job's plan at compile time, so I was wondering whether this is possible right now or not..

Thanks in advance,
Flavio