Best way to trigger dataset sampling

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Best way to trigger dataset sampling

Flavio Pompermaier
Hi to all,

I have a use case where I need to tell a Flink cluster to give me a sample of X records using parametrizable sampling functions. Is there any best practice or advice to do that?

Should I create a Remote ExecutionEnvironment or should I use the Flink client (I don't know if it uses REST services or RPC or whatever)?
Is there any java snippet for that?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Best way to trigger dataset sampling

Maximilian Michels
Hi Flavio,

Do you want to sample from a running batch job? That would be like
Queryable State in streaming jobs but it is not supported in batch
mode.

Cheers,
Max


On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
<[hidden email]> wrote:

> Hi to all,
>
> I have a use case where I need to tell a Flink cluster to give me a sample
> of X records using parametrizable sampling functions. Is there any best
> practice or advice to do that?
>
> Should I create a Remote ExecutionEnvironment or should I use the Flink
> client (I don't know if it uses REST services or RPC or whatever)?
> Is there any java snippet for that?
>
> Best,
> Flavio
>
Reply | Threaded
Open this post in threaded view
|

Re: Best way to trigger dataset sampling

Flavio Pompermaier
Hi Max,
actually I have a jar containing sampling jobs and I need to collect results from a client.
I've tried to use ExecutionEnvironment.createRemoteEnvironment but I fear that it's not the right way to do that because
I just need to tell the cluster the main class and the parameters to run the job (and where the jar file is on HDFS).

Best,
Flavio

On Tue, Sep 27, 2016 at 12:06 PM, Maximilian Michels <[hidden email]> wrote:
Hi Flavio,

Do you want to sample from a running batch job? That would be like
Queryable State in streaming jobs but it is not supported in batch
mode.

Cheers,
Max


On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
<[hidden email]> wrote:
> Hi to all,
>
> I have a use case where I need to tell a Flink cluster to give me a sample
> of X records using parametrizable sampling functions. Is there any best
> practice or advice to do that?
>
> Should I create a Remote ExecutionEnvironment or should I use the Flink
> client (I don't know if it uses REST services or RPC or whatever)?
> Is there any java snippet for that?
>
> Best,
> Flavio
>



Reply | Threaded
Open this post in threaded view
|

Re: Best way to trigger dataset sampling

Maximilian Michels
Hi Flavio,

This is not really possible at the moment. Though there is a workaround. You can create a dummy jar file (may be empty). Then you can use

./flink run -C hdfs:///path/to/cluster.jar -c org.package.SampleClass /path/to/dummy.jar

That way Flink will include your cluster jar and you can load all classes necessary.

Alternatively, using the Remote Environment, this looks like this:
public static void main(String[] args) throws Exception {

final RemoteEnvironment env = new RemoteEnvironment(
"remoteHost",
6123,
new Configuration(),
new String[0],
new URL[]{
new URL("file:///path/to/sample.jar"),
new URL("file:///Users/max/Dev/flink/build-target/lib/flink-dist_2.10-1.2-SNAPSHOT.jar")});
URLClassLoader classLoader = new URLClassLoader(env.globalClasspaths.toArray(new URL[0]));

Class<?> clazz = classLoader.loadClass("org.package.sample.SampleClass");

Method main = clazz.getDeclaredMethod("sampleMethod", ExecutionEnvironment.class);

// pass environment as an argument to your sample method
// the method should return the results of the execution
Object sampleResult = main.invoke(null, env);
}

Beware, this is extremely hacky. We should have a better way to invoke jar files remotely. Honestly, the best thing is if you keep a local copy of your sampling jars and work directly with them.

Cheers,
Max

On Tue, Sep 27, 2016 at 12:25 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Max,
actually I have a jar containing sampling jobs and I need to collect results from a client.
I've tried to use ExecutionEnvironment.createRemoteEnvironment but I fear that it's not the right way to do that because
I just need to tell the cluster the main class and the parameters to run the job (and where the jar file is on HDFS).

Best,
Flavio

On Tue, Sep 27, 2016 at 12:06 PM, Maximilian Michels <[hidden email]> wrote:
Hi Flavio,

Do you want to sample from a running batch job? That would be like
Queryable State in streaming jobs but it is not supported in batch
mode.

Cheers,
Max


On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
<[hidden email]> wrote:
> Hi to all,
>
> I have a use case where I need to tell a Flink cluster to give me a sample
> of X records using parametrizable sampling functions. Is there any best
> practice or advice to do that?
>
> Should I create a Remote ExecutionEnvironment or should I use the Flink
> client (I don't know if it uses REST services or RPC or whatever)?
> Is there any java snippet for that?
>
> Best,
> Flavio
>




Reply | Threaded
Open this post in threaded view
|

Re: Best way to trigger dataset sampling

Flavio Pompermaier
Hi max,
that's exactly what I was looking for. What do you mean for 'the best thing is if you keep a local copy of your sampling jars and work directly with them'?

Best,
Flavio

On Tue, Sep 27, 2016 at 2:35 PM, Maximilian Michels <[hidden email]> wrote:
Hi Flavio,

This is not really possible at the moment. Though there is a workaround. You can create a dummy jar file (may be empty). Then you can use

./flink run -C hdfs:///path/to/cluster.jar -c org.package.SampleClass /path/to/dummy.jar

That way Flink will include your cluster jar and you can load all classes necessary.

Alternatively, using the Remote Environment, this looks like this:
public static void main(String[] args) throws Exception {

final RemoteEnvironment env = new RemoteEnvironment(
"remoteHost",
6123,
new Configuration(),
new String[0],
new URL[]{
new URL("file:///path/to/sample.jar"),
new URL("file:///Users/max/Dev/flink/build-target/lib/flink-dist_2.10-1.2-SNAPSHOT.jar")});
URLClassLoader classLoader = new URLClassLoader(env.globalClasspaths.toArray(new URL[0]));

Class<?> clazz = classLoader.loadClass("org.package.sample.SampleClass");

Method main = clazz.getDeclaredMethod("sampleMethod", ExecutionEnvironment.class);

// pass environment as an argument to your sample method
// the method should return the results of the execution
Object sampleResult = main.invoke(null, env);
}

Beware, this is extremely hacky. We should have a better way to invoke jar files remotely. Honestly, the best thing is if you keep a local copy of your sampling jars and work directly with them.

Cheers,
Max

On Tue, Sep 27, 2016 at 12:25 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi Max,
actually I have a jar containing sampling jobs and I need to collect results from a client.
I've tried to use ExecutionEnvironment.createRemoteEnvironment but I fear that it's not the right way to do that because
I just need to tell the cluster the main class and the parameters to run the job (and where the jar file is on HDFS).

Best,
Flavio

On Tue, Sep 27, 2016 at 12:06 PM, Maximilian Michels <[hidden email]> wrote:
Hi Flavio,

Do you want to sample from a running batch job? That would be like
Queryable State in streaming jobs but it is not supported in batch
mode.

Cheers,
Max


On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
<[hidden email]> wrote:
> Hi to all,
>
> I have a use case where I need to tell a Flink cluster to give me a sample
> of X records using parametrizable sampling functions. Is there any best
> practice or advice to do that?
>
> Should I create a Remote ExecutionEnvironment or should I use the Flink
> client (I don't know if it uses REST services or RPC or whatever)?
> Is there any java snippet for that?
>
> Best,
> Flavio
>






Reply | Threaded
Open this post in threaded view
|

Re: Best way to trigger dataset sampling

Maximilian Michels
I meant that you simply keep the sampling jar on the machine where you
want to sample. However, you mentioned that it is a requirement for it
to be on the cluster.

Cheers,
Max

On Tue, Sep 27, 2016 at 3:18 PM, Flavio Pompermaier
<[hidden email]> wrote:

> Hi max,
> that's exactly what I was looking for. What do you mean for 'the best thing
> is if you keep a local copy of your sampling jars and work directly with
> them'?
>
> Best,
> Flavio
>
> On Tue, Sep 27, 2016 at 2:35 PM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Flavio,
>>
>> This is not really possible at the moment. Though there is a workaround.
>> You can create a dummy jar file (may be empty). Then you can use
>>
>> ./flink run -C hdfs:///path/to/cluster.jar -c org.package.SampleClass
>> /path/to/dummy.jar
>>
>> That way Flink will include your cluster jar and you can load all classes
>> necessary.
>>
>> Alternatively, using the Remote Environment, this looks like this:
>>
>> public static void main(String[] args) throws Exception {
>>
>>    final RemoteEnvironment env = new RemoteEnvironment(
>>       "remoteHost",
>>       6123,
>>       new Configuration(),
>>       new String[0],
>>       new URL[]{
>>          new URL("file:///path/to/sample.jar"),
>>          new
>> URL("file:///Users/max/Dev/flink/build-target/lib/flink-dist_2.10-1.2-SNAPSHOT.jar")});
>>    URLClassLoader classLoader = new
>> URLClassLoader(env.globalClasspaths.toArray(new URL[0]));
>>
>>    Class<?> clazz =
>> classLoader.loadClass("org.package.sample.SampleClass");
>>
>>    Method main = clazz.getDeclaredMethod("sampleMethod",
>> ExecutionEnvironment.class);
>>
>>    // pass environment as an argument to your sample method
>>    // the method should return the results of the execution
>>    Object sampleResult = main.invoke(null, env);
>> }
>>
>>
>> Beware, this is extremely hacky. We should have a better way to invoke jar
>> files remotely. Honestly, the best thing is if you keep a local copy of your
>> sampling jars and work directly with them.
>>
>> Cheers,
>> Max
>>
>> On Tue, Sep 27, 2016 at 12:25 PM, Flavio Pompermaier
>> <[hidden email]> wrote:
>>>
>>> Hi Max,
>>> actually I have a jar containing sampling jobs and I need to collect
>>> results from a client.
>>> I've tried to use ExecutionEnvironment.createRemoteEnvironment but I fear
>>> that it's not the right way to do that because
>>> I just need to tell the cluster the main class and the parameters to run
>>> the job (and where the jar file is on HDFS).
>>>
>>> Best,
>>> Flavio
>>>
>>> On Tue, Sep 27, 2016 at 12:06 PM, Maximilian Michels <[hidden email]>
>>> wrote:
>>>>
>>>> Hi Flavio,
>>>>
>>>> Do you want to sample from a running batch job? That would be like
>>>> Queryable State in streaming jobs but it is not supported in batch
>>>> mode.
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>>
>>>> On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
>>>> <[hidden email]> wrote:
>>>> > Hi to all,
>>>> >
>>>> > I have a use case where I need to tell a Flink cluster to give me a
>>>> > sample
>>>> > of X records using parametrizable sampling functions. Is there any
>>>> > best
>>>> > practice or advice to do that?
>>>> >
>>>> > Should I create a Remote ExecutionEnvironment or should I use the
>>>> > Flink
>>>> > client (I don't know if it uses REST services or RPC or whatever)?
>>>> > Is there any java snippet for that?
>>>> >
>>>> > Best,
>>>> > Flavio
>>>> >
>>>
>>>
>>>
>>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Best way to trigger dataset sampling

Flavio Pompermaier
I think I'll probably end with submitting the job through YARN in order to have a more standard approach :)

Thanks,
Flavio

On Wed, Sep 28, 2016 at 5:19 PM, Maximilian Michels <[hidden email]> wrote:
I meant that you simply keep the sampling jar on the machine where you
want to sample. However, you mentioned that it is a requirement for it
to be on the cluster.

Cheers,
Max

On Tue, Sep 27, 2016 at 3:18 PM, Flavio Pompermaier
<[hidden email]> wrote:
> Hi max,
> that's exactly what I was looking for. What do you mean for 'the best thing
> is if you keep a local copy of your sampling jars and work directly with
> them'?
>
> Best,
> Flavio
>
> On Tue, Sep 27, 2016 at 2:35 PM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Flavio,
>>
>> This is not really possible at the moment. Though there is a workaround.
>> You can create a dummy jar file (may be empty). Then you can use
>>
>> ./flink run -C hdfs:///path/to/cluster.jar -c org.package.SampleClass
>> /path/to/dummy.jar
>>
>> That way Flink will include your cluster jar and you can load all classes
>> necessary.
>>
>> Alternatively, using the Remote Environment, this looks like this:
>>
>> public static void main(String[] args) throws Exception {
>>
>>    final RemoteEnvironment env = new RemoteEnvironment(
>>       "remoteHost",
>>       6123,
>>       new Configuration(),
>>       new String[0],
>>       new URL[]{
>>          new URL("file:///path/to/sample.jar"),
>>          new
>> URL("file:///Users/max/Dev/flink/build-target/lib/flink-dist_2.10-1.2-SNAPSHOT.jar")});
>>    URLClassLoader classLoader = new
>> URLClassLoader(env.globalClasspaths.toArray(new URL[0]));
>>
>>    Class<?> clazz =
>> classLoader.loadClass("org.package.sample.SampleClass");
>>
>>    Method main = clazz.getDeclaredMethod("sampleMethod",
>> ExecutionEnvironment.class);
>>
>>    // pass environment as an argument to your sample method
>>    // the method should return the results of the execution
>>    Object sampleResult = main.invoke(null, env);
>> }
>>
>>
>> Beware, this is extremely hacky. We should have a better way to invoke jar
>> files remotely. Honestly, the best thing is if you keep a local copy of your
>> sampling jars and work directly with them.
>>
>> Cheers,
>> Max
>>
>> On Tue, Sep 27, 2016 at 12:25 PM, Flavio Pompermaier
>> <[hidden email]> wrote:
>>>
>>> Hi Max,
>>> actually I have a jar containing sampling jobs and I need to collect
>>> results from a client.
>>> I've tried to use ExecutionEnvironment.createRemoteEnvironment but I fear
>>> that it's not the right way to do that because
>>> I just need to tell the cluster the main class and the parameters to run
>>> the job (and where the jar file is on HDFS).
>>>
>>> Best,
>>> Flavio
>>>
>>> On Tue, Sep 27, 2016 at 12:06 PM, Maximilian Michels <[hidden email]>
>>> wrote:
>>>>
>>>> Hi Flavio,
>>>>
>>>> Do you want to sample from a running batch job? That would be like
>>>> Queryable State in streaming jobs but it is not supported in batch
>>>> mode.
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>>
>>>> On Mon, Sep 26, 2016 at 6:13 PM, Flavio Pompermaier
>>>> <[hidden email]> wrote:
>>>> > Hi to all,
>>>> >
>>>> > I have a use case where I need to tell a Flink cluster to give me a
>>>> > sample
>>>> > of X records using parametrizable sampling functions. Is there any
>>>> > best
>>>> > practice or advice to do that?
>>>> >
>>>> > Should I create a Remote ExecutionEnvironment or should I use the
>>>> > Flink
>>>> > client (I don't know if it uses REST services or RPC or whatever)?
>>>> > Is there any java snippet for that?
>>>> >
>>>> > Best,
>>>> > Flavio
>>>> >
>>>
>>>
>>>
>>>
>>
>
>