Random Shuffling

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Random Shuffling

Maximilian Alber
Hi Flinksters,

I would like to shuffle my elements in the data set and then split it in two according to some ratio. Each element in the data set has an unique id. Is there a nice way to do it with the flink api?
(It would be nice to have guaranteed random shuffling.)
Thanks!

Cheers,
Max
Reply | Threaded
Open this post in threaded view
|

Re: Random Shuffling

Matthias J. Sax
I think, you need to implement an own Partitioner.java and hand it via
DataSet.partitionCustom(partitioner, field)

(Just specify any field you like; as you don't want to group by key, it
doesn't matter.)

When implementing the partitionier, you can ignore the key parameter and
compute the output channel randomly.

This is kind of a work-around, but it should work.


-Matthias

On 06/15/2015 01:49 PM, Maximilian Alber wrote:

> Hi Flinksters,
>
> I would like to shuffle my elements in the data set and then split it in
> two according to some ratio. Each element in the data set has an unique
> id. Is there a nice way to do it with the flink api?
> (It would be nice to have guaranteed random shuffling.)
> Thanks!
>
> Cheers,
> Max


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Random Shuffling

Till Rohrmann
In reply to this post by Maximilian Alber

Hi Max,

you can always shuffle your elements using the rebalance method. What Flink here does is to distribute the elements of each partition among all available TaskManagers. This happens in a round-robin fashion and is thus not completely random.

A different mean is the partitionCustom method which allows you to specify for each element to which partition it shall be sent. You would have to specify a Partitioner to do this.

For the splitting there is at moment no syntactic sugar. What you can do, though, is to assign each item a split ID and then use a filter operation to filter the individual splits. Depending on you split ID distribution you will have differently sized splits.

Cheers,
Till

On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber alber.maximilian@... wrote:

Hi Flinksters,

I would like to shuffle my elements in the data set and then split it in two according to some ratio. Each element in the data set has an unique id. Is there a nice way to do it with the flink api?
(It would be nice to have guaranteed random shuffling.)
Thanks!

Cheers,
Max

Reply | Threaded
Open this post in threaded view
|

Re: Random Shuffling

Maximilian Alber
Thanks!

Ok, so for a random shuffle I need partitionCustom. But in that case the data might be out of balance then?

For the splitting. Is there no way to have exact sizes?

Cheers,
Max

On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann <[hidden email]> wrote:

Hi Max,

you can always shuffle your elements using the rebalance method. What Flink here does is to distribute the elements of each partition among all available TaskManagers. This happens in a round-robin fashion and is thus not completely random.

A different mean is the partitionCustom method which allows you to specify for each element to which partition it shall be sent. You would have to specify a Partitioner to do this.

For the splitting there is at moment no syntactic sugar. What you can do, though, is to assign each item a split ID and then use a filter operation to filter the individual splits. Depending on you split ID distribution you will have differently sized splits.

Cheers,
Till

On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber alber.maximilian@... wrote:

Hi Flinksters,

I would like to shuffle my elements in the data set and then split it in two according to some ratio. Each element in the data set has an unique id. Is there a nice way to do it with the flink api?
(It would be nice to have guaranteed random shuffling.)
Thanks!

Cheers,
Max


Reply | Threaded
Open this post in threaded view
|

Re: Random Shuffling

Matthias J. Sax
Hi,

using partitionCustom, the data distribution depends only on your
probability distribution. If it is uniform, you should be fine (ie,
choosing the channel like

> private final Random random = new Random(System.currentTimeMillis());
> int partition(K key, int numPartitions) {
>   return random.nextInt(numPartitions);
> }

should do the trick.

-Matthias

On 06/15/2015 05:41 PM, Maximilian Alber wrote:

> Thanks!
>
> Ok, so for a random shuffle I need partitionCustom. But in that case the
> data might be out of balance then?
>
> For the splitting. Is there no way to have exact sizes?
>
> Cheers,
> Max
>
> On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi Max,
>
>     you can always shuffle your elements using the |rebalance| method.
>     What Flink here does is to distribute the elements of each partition
>     among all available TaskManagers. This happens in a round-robin
>     fashion and is thus not completely random.
>
>     A different mean is the |partitionCustom| method which allows you to
>     specify for each element to which partition it shall be sent. You
>     would have to specify a |Partitioner| to do this.
>
>     For the splitting there is at moment no syntactic sugar. What you
>     can do, though, is to assign each item a split ID and then use a
>     |filter| operation to filter the individual splits. Depending on you
>     split ID distribution you will have differently sized splits.
>
>     Cheers,
>     Till
>
>     On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
>     [hidden email]
>     <http://mailto:alber.maximilian@...> wrote:
>
>         Hi Flinksters,
>
>         I would like to shuffle my elements in the data set and then
>         split it in two according to some ratio. Each element in the
>         data set has an unique id. Is there a nice way to do it with the
>         flink api?
>         (It would be nice to have guaranteed random shuffling.)
>         Thanks!
>
>         Cheers,
>         Max
>
>     ​
>
>


signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Random Shuffling

Maximilian Alber
Thank you!

Still I cannot guarantee the size of each partition, or can I?
Something like randomSplit in Spark.

Cheers,
Max

On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax <[hidden email]> wrote:
Hi,

using partitionCustom, the data distribution depends only on your
probability distribution. If it is uniform, you should be fine (ie,
choosing the channel like

> private final Random random = new Random(System.currentTimeMillis());
> int partition(K key, int numPartitions) {
>   return random.nextInt(numPartitions);
> }

should do the trick.

-Matthias

On 06/15/2015 05:41 PM, Maximilian Alber wrote:
> Thanks!
>
> Ok, so for a random shuffle I need partitionCustom. But in that case the
> data might be out of balance then?
>
> For the splitting. Is there no way to have exact sizes?
>
> Cheers,
> Max
>
> On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi Max,
>
>     you can always shuffle your elements using the |rebalance| method.
>     What Flink here does is to distribute the elements of each partition
>     among all available TaskManagers. This happens in a round-robin
>     fashion and is thus not completely random.
>
>     A different mean is the |partitionCustom| method which allows you to
>     specify for each element to which partition it shall be sent. You
>     would have to specify a |Partitioner| to do this.
>
>     For the splitting there is at moment no syntactic sugar. What you
>     can do, though, is to assign each item a split ID and then use a
>     |filter| operation to filter the individual splits. Depending on you
>     split ID distribution you will have differently sized splits.
>
>     Cheers,
>     Till
>
>     On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
>     [hidden email]
>     <http://mailto:alber.maximilian@...> wrote:
>
>         Hi Flinksters,
>
>         I would like to shuffle my elements in the data set and then
>         split it in two according to some ratio. Each element in the
>         data set has an unique id. Is there a nice way to do it with the
>         flink api?
>         (It would be nice to have guaranteed random shuffling.)
>         Thanks!
>
>         Cheers,
>         Max
>
>     ​
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Random Shuffling

Stephan Ewen
If you do "rebalance()", it will redistribute elements round-robin fashion, which should give you very even partition sizes.


On Tue, Jun 23, 2015 at 11:51 AM, Maximilian Alber <[hidden email]> wrote:
Thank you!

Still I cannot guarantee the size of each partition, or can I?
Something like randomSplit in Spark.

Cheers,
Max

On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax <[hidden email]> wrote:
Hi,

using partitionCustom, the data distribution depends only on your
probability distribution. If it is uniform, you should be fine (ie,
choosing the channel like

> private final Random random = new Random(System.currentTimeMillis());
> int partition(K key, int numPartitions) {
>   return random.nextInt(numPartitions);
> }

should do the trick.

-Matthias

On 06/15/2015 05:41 PM, Maximilian Alber wrote:
> Thanks!
>
> Ok, so for a random shuffle I need partitionCustom. But in that case the
> data might be out of balance then?
>
> For the splitting. Is there no way to have exact sizes?
>
> Cheers,
> Max
>
> On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi Max,
>
>     you can always shuffle your elements using the |rebalance| method.
>     What Flink here does is to distribute the elements of each partition
>     among all available TaskManagers. This happens in a round-robin
>     fashion and is thus not completely random.
>
>     A different mean is the |partitionCustom| method which allows you to
>     specify for each element to which partition it shall be sent. You
>     would have to specify a |Partitioner| to do this.
>
>     For the splitting there is at moment no syntactic sugar. What you
>     can do, though, is to assign each item a split ID and then use a
>     |filter| operation to filter the individual splits. Depending on you
>     split ID distribution you will have differently sized splits.
>
>     Cheers,
>     Till
>
>     On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
>     [hidden email]
>     <http://mailto:alber.maximilian@...> wrote:
>
>         Hi Flinksters,
>
>         I would like to shuffle my elements in the data set and then
>         split it in two according to some ratio. Each element in the
>         data set has an unique id. Is there a nice way to do it with the
>         flink api?
>         (It would be nice to have guaranteed random shuffling.)
>         Thanks!
>
>         Cheers,
>         Max
>
>     ​
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Random Shuffling

Maximilian Alber
That's not the point. In Machine Learning one often divides a data set X into f.e. three sets, one for the training, one for the validation, one for the final testing. The sets are usually created randomly according to some ratio. Thus it would be important to keep the ratio and to do the whole process randomly.

Cheers,
Max

On Wed, Jun 24, 2015 at 9:51 AM, Stephan Ewen <[hidden email]> wrote:
If you do "rebalance()", it will redistribute elements round-robin fashion, which should give you very even partition sizes.


On Tue, Jun 23, 2015 at 11:51 AM, Maximilian Alber <[hidden email]> wrote:
Thank you!

Still I cannot guarantee the size of each partition, or can I?
Something like randomSplit in Spark.

Cheers,
Max

On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax <[hidden email]> wrote:
Hi,

using partitionCustom, the data distribution depends only on your
probability distribution. If it is uniform, you should be fine (ie,
choosing the channel like

> private final Random random = new Random(System.currentTimeMillis());
> int partition(K key, int numPartitions) {
>   return random.nextInt(numPartitions);
> }

should do the trick.

-Matthias

On 06/15/2015 05:41 PM, Maximilian Alber wrote:
> Thanks!
>
> Ok, so for a random shuffle I need partitionCustom. But in that case the
> data might be out of balance then?
>
> For the splitting. Is there no way to have exact sizes?
>
> Cheers,
> Max
>
> On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi Max,
>
>     you can always shuffle your elements using the |rebalance| method.
>     What Flink here does is to distribute the elements of each partition
>     among all available TaskManagers. This happens in a round-robin
>     fashion and is thus not completely random.
>
>     A different mean is the |partitionCustom| method which allows you to
>     specify for each element to which partition it shall be sent. You
>     would have to specify a |Partitioner| to do this.
>
>     For the splitting there is at moment no syntactic sugar. What you
>     can do, though, is to assign each item a split ID and then use a
>     |filter| operation to filter the individual splits. Depending on you
>     split ID distribution you will have differently sized splits.
>
>     Cheers,
>     Till
>
>     On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
>     [hidden email]
>     <http://mailto:alber.maximilian@...> wrote:
>
>         Hi Flinksters,
>
>         I would like to shuffle my elements in the data set and then
>         split it in two according to some ratio. Each element in the
>         data set has an unique id. Is there a nice way to do it with the
>         flink api?
>         (It would be nice to have guaranteed random shuffling.)
>         Thanks!
>
>         Cheers,
>         Max
>
>     ​
>
>




Reply | Threaded
Open this post in threaded view
|

Re: Random Shuffling

Sebastian Schelter-2
A very simple way to achieve is to generate a random variate on the
driver that describes a mapping of datapoints to samples. Then you
simply join the dataset with this mapping to generate the samples.

This approach requires you to know the size of the dataset in advance,
but has the advantage that you can guarantee the sizes of the samples
and can easily support more involved techniques such as sampling with
replacement.

--sebastian


On 24.06.2015 10:38, Maximilian Alber wrote:

> That's not the point. In Machine Learning one often divides a data set X
> into f.e. three sets, one for the training, one for the validation, one
> for the final testing. The sets are usually created randomly according
> to some ratio. Thus it would be important to keep the ratio and to do
> the whole process randomly.
>
> Cheers,
> Max
>
> On Wed, Jun 24, 2015 at 9:51 AM, Stephan Ewen <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     If you do "rebalance()", it will redistribute elements round-robin
>     fashion, which should give you very even partition sizes.
>
>
>     On Tue, Jun 23, 2015 at 11:51 AM, Maximilian Alber
>     <[hidden email] <mailto:[hidden email]>> wrote:
>
>         Thank you!
>
>         Still I cannot guarantee the size of each partition, or can I?
>         Something like randomSplit in Spark.
>
>         Cheers,
>         Max
>
>         On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax
>         <[hidden email]
>         <mailto:[hidden email]>> wrote:
>
>             Hi,
>
>             using partitionCustom, the data distribution depends only on
>             your
>             probability distribution. If it is uniform, you should be
>             fine (ie,
>             choosing the channel like
>
>              > private final Random random = new
>             Random(System.currentTimeMillis());
>              > int partition(K key, int numPartitions) {
>              >   return random.nextInt(numPartitions);
>              > }
>
>             should do the trick.
>
>             -Matthias
>
>             On 06/15/2015 05:41 PM, Maximilian Alber wrote:
>             > Thanks!
>             >
>             > Ok, so for a random shuffle I need partitionCustom. But in that case the
>             > data might be out of balance then?
>             >
>             > For the splitting. Is there no way to have exact sizes?
>             >
>             > Cheers,
>             > Max
>             >
>             > On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann <[hidden email] <mailto:[hidden email]>
>             > <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>             >
>             >     Hi Max,
>             >
>             >     you can always shuffle your elements using the |rebalance| method.
>             >     What Flink here does is to distribute the elements of each partition
>             >     among all available TaskManagers. This happens in a round-robin
>             >     fashion and is thus not completely random.
>             >
>             >     A different mean is the |partitionCustom| method which allows you to
>             >     specify for each element to which partition it shall be sent. You
>             >     would have to specify a |Partitioner| to do this.
>             >
>             >     For the splitting there is at moment no syntactic sugar. What you
>             >     can do, though, is to assign each item a split ID and then use a
>             >     |filter| operation to filter the individual splits. Depending on you
>             >     split ID distribution you will have differently sized splits.
>             >
>             >     Cheers,
>             >     Till
>             >
>             >     On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
>             >[hidden email] <mailto:[hidden email]>
>              >     <http://mailto:alber.maximilian@...> wrote:
>              >
>              >         Hi Flinksters,
>              >
>              >         I would like to shuffle my elements in the data
>             set and then
>              >         split it in two according to some ratio. Each
>             element in the
>              >         data set has an unique id. Is there a nice way to
>             do it with the
>              >         flink api?
>              >         (It would be nice to have guaranteed random
>             shuffling.)
>              >         Thanks!
>              >
>              >         Cheers,
>              >         Max
>              >
>              >     ​
>              >
>              >
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Random Shuffling

Maximilian Alber
Thanks Sebastian!
What do you intend with driver? Before submitting to the cluster?
Knowing the dataset size is ok.

On Wed, Jun 24, 2015 at 11:08 AM, Sebastian <[hidden email]> wrote:
A very simple way to achieve is to generate a random variate on the driver that describes a mapping of datapoints to samples. Then you simply join the dataset with this mapping to generate the samples.

This approach requires you to know the size of the dataset in advance, but has the advantage that you can guarantee the sizes of the samples and can easily support more involved techniques such as sampling with replacement.

--sebastian


On <a href="tel:24.06.2015%2010" value="+12406201510" target="_blank">24.06.2015 10:38, Maximilian Alber wrote:
That's not the point. In Machine Learning one often divides a data set X
into f.e. three sets, one for the training, one for the validation, one
for the final testing. The sets are usually created randomly according
to some ratio. Thus it would be important to keep the ratio and to do
the whole process randomly.

Cheers,
Max

On Wed, Jun 24, 2015 at 9:51 AM, Stephan Ewen <[hidden email]
<mailto:[hidden email]>> wrote:

    If you do "rebalance()", it will redistribute elements round-robin
    fashion, which should give you very even partition sizes.


    On Tue, Jun 23, 2015 at 11:51 AM, Maximilian Alber
    <[hidden email] <mailto:[hidden email]>> wrote:

        Thank you!

        Still I cannot guarantee the size of each partition, or can I?
        Something like randomSplit in Spark.

        Cheers,
        Max

        On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax
        <[hidden email]
        <mailto:[hidden email]>> wrote:

            Hi,

            using partitionCustom, the data distribution depends only on
            your
            probability distribution. If it is uniform, you should be
            fine (ie,
            choosing the channel like

             > private final Random random = new
            Random(System.currentTimeMillis());
             > int partition(K key, int numPartitions) {
             >   return random.nextInt(numPartitions);
             > }

            should do the trick.

            -Matthias

            On 06/15/2015 05:41 PM, Maximilian Alber wrote:
            > Thanks!
            >
            > Ok, so for a random shuffle I need partitionCustom. But in that case the
            > data might be out of balance then?
            >
            > For the splitting. Is there no way to have exact sizes?
            >
            > Cheers,
            > Max
            >
            > On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann <[hidden email] <mailto:[hidden email]>
            > <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
            >
            >     Hi Max,
            >
            >     you can always shuffle your elements using the |rebalance| method.
            >     What Flink here does is to distribute the elements of each partition
            >     among all available TaskManagers. This happens in a round-robin
            >     fashion and is thus not completely random.
            >
            >     A different mean is the |partitionCustom| method which allows you to
            >     specify for each element to which partition it shall be sent. You
            >     would have to specify a |Partitioner| to do this.
            >
            >     For the splitting there is at moment no syntactic sugar. What you
            >     can do, though, is to assign each item a split ID and then use a
            >     |filter| operation to filter the individual splits. Depending on you
            >     split ID distribution you will have differently sized splits.
            >
            >     Cheers,
            >     Till
            >
            >     On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
            >[hidden email] <mailto:[hidden email]>
             >     <http://mailto:alber.maximilian@...> wrote:
             >
             >         Hi Flinksters,
             >
             >         I would like to shuffle my elements in the data
            set and then
             >         split it in two according to some ratio. Each
            element in the
             >         data set has an unique id. Is there a nice way to
            do it with the
             >         flink api?
             >         (It would be nice to have guaranteed random
            shuffling.)
             >         Thanks!
             >
             >         Cheers,
             >         Max
             >
             >     ​
             >
             >