bigpetstore flink : parallelizing collections

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

bigpetstore flink : parallelizing collections

jay vyas
Hi flink.

Im happy to announce that ive done a small bit of initial hacking on bigpetstore-flink, in order to represent what we do in spark in flink. 

TL;DR the main question is at the bottom!

Currently, i want to generate transactions for a list of customers.  The generation of transactions is a parallel process, and the customers are generated beforehand.

In hadoop , we can create an input format with custom splits if we want to split a data set up, otherwise, we can break it into files.

in spark, there is a conveneint "parallelize" which we can run on a list, which we can then capture the RDD from , and run a parallelized transform.

In flink, i have an array of "customers" and i want to parallelize our transaction generator for each customer.  How would i do that?

--
jay vyas
Reply | Threaded
Open this post in threaded view
|

Re: bigpetstore flink : parallelizing collections

Stephan Ewen
Hi Jay!

You can use the "fromCollection()" or "fromElements()" method to create a DataSet or DataStream from a Java/Scala collection. That moves the data into the cluster and allows you to run parallel transformations on the elements.

Make sure you set the parallelism of the operation that you want to be parallel.


Here is a code sample:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<MyType> data = env.fromElements(myArray);

data.map(new TrasactionMapper()).setParallelism(80); // makes sure you have 80 mappers


Stephan


On Sun, Jul 12, 2015 at 3:04 PM, jay vyas <[hidden email]> wrote:
Hi flink.

Im happy to announce that ive done a small bit of initial hacking on bigpetstore-flink, in order to represent what we do in spark in flink. 

TL;DR the main question is at the bottom!

Currently, i want to generate transactions for a list of customers.  The generation of transactions is a parallel process, and the customers are generated beforehand.

In hadoop , we can create an input format with custom splits if we want to split a data set up, otherwise, we can break it into files.

in spark, there is a conveneint "parallelize" which we can run on a list, which we can then capture the RDD from , and run a parallelized transform.

In flink, i have an array of "customers" and i want to parallelize our transaction generator for each customer.  How would i do that?

--
jay vyas

Reply | Threaded
Open this post in threaded view
|

Re: bigpetstore flink : parallelizing collections

jay vyas
awesome thanks ! i ll  try it out.

This is part of  a wave of jiras for bigtop flink integration.  If your distro/packaging folks collaborate with us - it will save you time in the long run, because you can piggy back the bigtop infra for rpm/deb packaging, smoke testing, and HDFS interop testing ....

https://issues.apache.org/jira/browse/BIGTOP-1927

Just FYI, great to connect stephan and others, will keep you posted !

On Sun, Jul 12, 2015 at 9:16 AM, Stephan Ewen <[hidden email]> wrote:
Hi Jay!

You can use the "fromCollection()" or "fromElements()" method to create a DataSet or DataStream from a Java/Scala collection. That moves the data into the cluster and allows you to run parallel transformations on the elements.

Make sure you set the parallelism of the operation that you want to be parallel.


Here is a code sample:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<MyType> data = env.fromElements(myArray);

data.map(new TrasactionMapper()).setParallelism(80); // makes sure you have 80 mappers


Stephan


On Sun, Jul 12, 2015 at 3:04 PM, jay vyas <[hidden email]> wrote:
Hi flink.

Im happy to announce that ive done a small bit of initial hacking on bigpetstore-flink, in order to represent what we do in spark in flink. 

TL;DR the main question is at the bottom!

Currently, i want to generate transactions for a list of customers.  The generation of transactions is a parallel process, and the customers are generated beforehand.

In hadoop , we can create an input format with custom splits if we want to split a data set up, otherwise, we can break it into files.

in spark, there is a conveneint "parallelize" which we can run on a list, which we can then capture the RDD from , and run a parallelized transform.

In flink, i have an array of "customers" and i want to parallelize our transaction generator for each customer.  How would i do that?

--
jay vyas




--
jay vyas
Reply | Threaded
Open this post in threaded view
|

Re: bigpetstore flink : parallelizing collections

Maximilian Michels
Hi Jay,

Great to hear there is effort to integrate Flink with BigTop. Please let us know if any questions come up in the course of the integration!

Best,
Max


On Sun, Jul 12, 2015 at 3:57 PM, jay vyas <[hidden email]> wrote:
awesome thanks ! i ll  try it out.

This is part of  a wave of jiras for bigtop flink integration.  If your distro/packaging folks collaborate with us - it will save you time in the long run, because you can piggy back the bigtop infra for rpm/deb packaging, smoke testing, and HDFS interop testing ....

https://issues.apache.org/jira/browse/BIGTOP-1927

Just FYI, great to connect stephan and others, will keep you posted !

On Sun, Jul 12, 2015 at 9:16 AM, Stephan Ewen <[hidden email]> wrote:
Hi Jay!

You can use the "fromCollection()" or "fromElements()" method to create a DataSet or DataStream from a Java/Scala collection. That moves the data into the cluster and allows you to run parallel transformations on the elements.

Make sure you set the parallelism of the operation that you want to be parallel.


Here is a code sample:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<MyType> data = env.fromElements(myArray);

data.map(new TrasactionMapper()).setParallelism(80); // makes sure you have 80 mappers


Stephan


On Sun, Jul 12, 2015 at 3:04 PM, jay vyas <[hidden email]> wrote:
Hi flink.

Im happy to announce that ive done a small bit of initial hacking on bigpetstore-flink, in order to represent what we do in spark in flink. 

TL;DR the main question is at the bottom!

Currently, i want to generate transactions for a list of customers.  The generation of transactions is a parallel process, and the customers are generated beforehand.

In hadoop , we can create an input format with custom splits if we want to split a data set up, otherwise, we can break it into files.

in spark, there is a conveneint "parallelize" which we can run on a list, which we can then capture the RDD from , and run a parallelized transform.

In flink, i have an array of "customers" and i want to parallelize our transaction generator for each customer.  How would i do that?

--
jay vyas




--
jay vyas

Reply | Threaded
Open this post in threaded view
|

Re: bigpetstore flink : parallelizing collections

jay vyas
ok.  now ** my thoughts ** on this are that it should be synergistic with flink needs, rather than an orthogonal task that you guys help us with, so please keep us updated what your needs are so that the work is synergistic https://issues.apache.org/jira/browse/BIGTOP-1927

On Mon, Jul 13, 2015 at 9:07 AM, Maximilian Michels <[hidden email]> wrote:
Hi Jay,

Great to hear there is effort to integrate Flink with BigTop. Please let us know if any questions come up in the course of the integration!

Best,
Max


On Sun, Jul 12, 2015 at 3:57 PM, jay vyas <[hidden email]> wrote:
awesome thanks ! i ll  try it out.

This is part of  a wave of jiras for bigtop flink integration.  If your distro/packaging folks collaborate with us - it will save you time in the long run, because you can piggy back the bigtop infra for rpm/deb packaging, smoke testing, and HDFS interop testing ....

https://issues.apache.org/jira/browse/BIGTOP-1927

Just FYI, great to connect stephan and others, will keep you posted !

On Sun, Jul 12, 2015 at 9:16 AM, Stephan Ewen <[hidden email]> wrote:
Hi Jay!

You can use the "fromCollection()" or "fromElements()" method to create a DataSet or DataStream from a Java/Scala collection. That moves the data into the cluster and allows you to run parallel transformations on the elements.

Make sure you set the parallelism of the operation that you want to be parallel.


Here is a code sample:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<MyType> data = env.fromElements(myArray);

data.map(new TrasactionMapper()).setParallelism(80); // makes sure you have 80 mappers


Stephan


On Sun, Jul 12, 2015 at 3:04 PM, jay vyas <[hidden email]> wrote:
Hi flink.

Im happy to announce that ive done a small bit of initial hacking on bigpetstore-flink, in order to represent what we do in spark in flink. 

TL;DR the main question is at the bottom!

Currently, i want to generate transactions for a list of customers.  The generation of transactions is a parallel process, and the customers are generated beforehand.

In hadoop , we can create an input format with custom splits if we want to split a data set up, otherwise, we can break it into files.

in spark, there is a conveneint "parallelize" which we can run on a list, which we can then capture the RDD from , and run a parallelized transform.

In flink, i have an array of "customers" and i want to parallelize our transaction generator for each customer.  How would i do that?

--
jay vyas




--
jay vyas




--
jay vyas
Reply | Threaded
Open this post in threaded view
|

Re: bigpetstore flink : parallelizing collections

Maximilian Michels
Absolutely. I see it as a synergistic process too. I just learned about BigTop. As for the packaging, I think Flink doesn't have very different demands compared to the other frameworks already integrated. As for the rest, I'm not familiar enough with BigTop. Currently, Henry is the only Flink committer looking into it but I'm sure we will find other Flink contributors to help out as well. I can even see myself getting into it.

Thanks for your effort so far and I'm sure we'll have a good collaboration between the projects.

Cheers,
Max

On Mon, Jul 13, 2015 at 4:56 PM, jay vyas <[hidden email]> wrote:
ok.  now ** my thoughts ** on this are that it should be synergistic with flink needs, rather than an orthogonal task that you guys help us with, so please keep us updated what your needs are so that the work is synergistic https://issues.apache.org/jira/browse/BIGTOP-1927

On Mon, Jul 13, 2015 at 9:07 AM, Maximilian Michels <[hidden email]> wrote:
Hi Jay,

Great to hear there is effort to integrate Flink with BigTop. Please let us know if any questions come up in the course of the integration!

Best,
Max


On Sun, Jul 12, 2015 at 3:57 PM, jay vyas <[hidden email]> wrote:
awesome thanks ! i ll  try it out.

This is part of  a wave of jiras for bigtop flink integration.  If your distro/packaging folks collaborate with us - it will save you time in the long run, because you can piggy back the bigtop infra for rpm/deb packaging, smoke testing, and HDFS interop testing ....

https://issues.apache.org/jira/browse/BIGTOP-1927

Just FYI, great to connect stephan and others, will keep you posted !

On Sun, Jul 12, 2015 at 9:16 AM, Stephan Ewen <[hidden email]> wrote:
Hi Jay!

You can use the "fromCollection()" or "fromElements()" method to create a DataSet or DataStream from a Java/Scala collection. That moves the data into the cluster and allows you to run parallel transformations on the elements.

Make sure you set the parallelism of the operation that you want to be parallel.


Here is a code sample:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<MyType> data = env.fromElements(myArray);

data.map(new TrasactionMapper()).setParallelism(80); // makes sure you have 80 mappers


Stephan


On Sun, Jul 12, 2015 at 3:04 PM, jay vyas <[hidden email]> wrote:
Hi flink.

Im happy to announce that ive done a small bit of initial hacking on bigpetstore-flink, in order to represent what we do in spark in flink. 

TL;DR the main question is at the bottom!

Currently, i want to generate transactions for a list of customers.  The generation of transactions is a parallel process, and the customers are generated beforehand.

In hadoop , we can create an input format with custom splits if we want to split a data set up, otherwise, we can break it into files.

in spark, there is a conveneint "parallelize" which we can run on a list, which we can then capture the RDD from , and run a parallelized transform.

In flink, i have an array of "customers" and i want to parallelize our transaction generator for each customer.  How would i do that?

--
jay vyas




--
jay vyas




--
jay vyas