Standalone Cluster vs YARN

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Standalone Cluster vs YARN

tambunanw
Hi All, 

I would like to know if there any feature differences between using Standalone Cluster vs YARN ?

Until now we are using Standalone cluster for our jobs. 
Is there any added value for using YARN ?

We don't have any hadoop infrastructure in place right now but we can provide that if there's some value to that. 


Cheers

--
Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

Ufuk Celebi
> On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>
> Hi All,
>
> I would like to know if there any feature differences between using Standalone Cluster vs YARN ?
>
> Until now we are using Standalone cluster for our jobs.
> Is there any added value for using YARN ?
>
> We don't have any hadoop infrastructure in place right now but we can provide that if there's some value to that.

There are no features, which only work on YARN or in standalone clusters. YARN mode is essentially starting a standalone cluster in YARN containers.

In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

– Ufuk

Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

Fabian Hueske-2
A strong argument for YARN mode can be the isolation of multiple users and jobs. You can easily start a new Flink cluster for each job or user. However, this comes at the price of resource (memory) fragmentation. YARN mode does not use memory as effective as cluster mode.

2015-11-25 9:46 GMT+01:00 Ufuk Celebi <[hidden email]>:
> On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>
> Hi All,
>
> I would like to know if there any feature differences between using Standalone Cluster vs YARN ?
>
> Until now we are using Standalone cluster for our jobs.
> Is there any added value for using YARN ?
>
> We don't have any hadoop infrastructure in place right now but we can provide that if there's some value to that.

There are no features, which only work on YARN or in standalone clusters. YARN mode is essentially starting a standalone cluster in YARN containers.

In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

– Ufuk


Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

tambunanw
In reply to this post by Ufuk Celebi
Hi Ufuk

>In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

So this mean that we don't need zookeeper ? 


Cheers

On Wed, Nov 25, 2015 at 3:46 PM, Ufuk Celebi <[hidden email]> wrote:
> On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>
> Hi All,
>
> I would like to know if there any feature differences between using Standalone Cluster vs YARN ?
>
> Until now we are using Standalone cluster for our jobs.
> Is there any added value for using YARN ?
>
> We don't have any hadoop infrastructure in place right now but we can provide that if there's some value to that.

There are no features, which only work on YARN or in standalone clusters. YARN mode is essentially starting a standalone cluster in YARN containers.

In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

– Ufuk




--
Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

tambunanw
In reply to this post by Fabian Hueske-2
Hi Fabian, 

Interesting ! 

However YARN is still tightly couple to HDFS, is that seems wasteful to use only YARN without Hadoop ? 

Currently we are using Cassandra and CFS ( cass file system )  


Cheers

On Wed, Nov 25, 2015 at 3:51 PM, Fabian Hueske <[hidden email]> wrote:
A strong argument for YARN mode can be the isolation of multiple users and jobs. You can easily start a new Flink cluster for each job or user. However, this comes at the price of resource (memory) fragmentation. YARN mode does not use memory as effective as cluster mode.

2015-11-25 9:46 GMT+01:00 Ufuk Celebi <[hidden email]>:
> On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>
> Hi All,
>
> I would like to know if there any feature differences between using Standalone Cluster vs YARN ?
>
> Until now we are using Standalone cluster for our jobs.
> Is there any added value for using YARN ?
>
> We don't have any hadoop infrastructure in place right now but we can provide that if there's some value to that.

There are no features, which only work on YARN or in standalone clusters. YARN mode is essentially starting a standalone cluster in YARN containers.

In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

– Ufuk





--
Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

Fabian Hueske-2
YARN is not a replacement for Zookeeper. Zookeeper is mandatory to run Flink in high-availability mode and takes care of leader (JobManager) election and meta-data persistance.

With YARN, Flink can automatically start new Taskmanagers (and JobManagers) to compensate for failures. In cluster mode, you need stand-by TMs and JMs and manually take care that these are "filled-up" again in case of a failure.

2015-11-25 10:06 GMT+01:00 Welly Tambunan <[hidden email]>:
Hi Fabian, 

Interesting ! 

However YARN is still tightly couple to HDFS, is that seems wasteful to use only YARN without Hadoop ? 

Currently we are using Cassandra and CFS ( cass file system )  


Cheers

On Wed, Nov 25, 2015 at 3:51 PM, Fabian Hueske <[hidden email]> wrote:
A strong argument for YARN mode can be the isolation of multiple users and jobs. You can easily start a new Flink cluster for each job or user. However, this comes at the price of resource (memory) fragmentation. YARN mode does not use memory as effective as cluster mode.

2015-11-25 9:46 GMT+01:00 Ufuk Celebi <[hidden email]>:
> On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>
> Hi All,
>
> I would like to know if there any feature differences between using Standalone Cluster vs YARN ?
>
> Until now we are using Standalone cluster for our jobs.
> Is there any added value for using YARN ?
>
> We don't have any hadoop infrastructure in place right now but we can provide that if there's some value to that.

There are no features, which only work on YARN or in standalone clusters. YARN mode is essentially starting a standalone cluster in YARN containers.

In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

– Ufuk





--

Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

Andreas Fritzler
In reply to this post by tambunanw
Hi Welly,

you will need Zookeeper if you want to setup the standalone cluster in HA mode.

In the YARN case you probably have already Zookeeper in place if you are running YARN in HA mode.

Regards,
Andreas

On Wed, Nov 25, 2015 at 10:02 AM, Welly Tambunan <[hidden email]> wrote:
Hi Ufuk

>In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

So this mean that we don't need zookeeper ? 


Cheers

On Wed, Nov 25, 2015 at 3:46 PM, Ufuk Celebi <[hidden email]> wrote:
> On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>
> Hi All,
>
> I would like to know if there any feature differences between using Standalone Cluster vs YARN ?
>
> Until now we are using Standalone cluster for our jobs.
> Is there any added value for using YARN ?
>
> We don't have any hadoop infrastructure in place right now but we can provide that if there's some value to that.

There are no features, which only work on YARN or in standalone clusters. YARN mode is essentially starting a standalone cluster in YARN containers.

In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

– Ufuk




--

Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

tambunanw
In reply to this post by Fabian Hueske-2
Hi Fabian, 

This make sense now. 

I would like to avoid zookeeper if possible. Is there any way to avoid this to achieve HA ?

I see that DataStax Enterprise achieve this availability for Spark Master without using Zookeeper.

Is this possible to achieve in Flink also ? 


Cheers

On Wed, Nov 25, 2015 at 4:11 PM, Fabian Hueske <[hidden email]> wrote:
YARN is not a replacement for Zookeeper. Zookeeper is mandatory to run Flink in high-availability mode and takes care of leader (JobManager) election and meta-data persistance.

With YARN, Flink can automatically start new Taskmanagers (and JobManagers) to compensate for failures. In cluster mode, you need stand-by TMs and JMs and manually take care that these are "filled-up" again in case of a failure.

2015-11-25 10:06 GMT+01:00 Welly Tambunan <[hidden email]>:
Hi Fabian, 

Interesting ! 

However YARN is still tightly couple to HDFS, is that seems wasteful to use only YARN without Hadoop ? 

Currently we are using Cassandra and CFS ( cass file system )  


Cheers

On Wed, Nov 25, 2015 at 3:51 PM, Fabian Hueske <[hidden email]> wrote:
A strong argument for YARN mode can be the isolation of multiple users and jobs. You can easily start a new Flink cluster for each job or user. However, this comes at the price of resource (memory) fragmentation. YARN mode does not use memory as effective as cluster mode.

2015-11-25 9:46 GMT+01:00 Ufuk Celebi <[hidden email]>:
> On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>
> Hi All,
>
> I would like to know if there any feature differences between using Standalone Cluster vs YARN ?
>
> Until now we are using Standalone cluster for our jobs.
> Is there any added value for using YARN ?
>
> We don't have any hadoop infrastructure in place right now but we can provide that if there's some value to that.

There are no features, which only work on YARN or in standalone clusters. YARN mode is essentially starting a standalone cluster in YARN containers.

In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

– Ufuk





--




--
Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

tambunanw
In reply to this post by Andreas Fritzler
Hi Andreas, 

Yes, seems I can't avoid Zookeeper right now. It would be really nice if we can achieve HA via gossip protocol like Cassandra/Spark DSE does ? 

Is this possible ?


Cheers

On Wed, Nov 25, 2015 at 4:12 PM, Andreas Fritzler <[hidden email]> wrote:
Hi Welly,

you will need Zookeeper if you want to setup the standalone cluster in HA mode.

In the YARN case you probably have already Zookeeper in place if you are running YARN in HA mode.

Regards,
Andreas

On Wed, Nov 25, 2015 at 10:02 AM, Welly Tambunan <[hidden email]> wrote:
Hi Ufuk

>In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

So this mean that we don't need zookeeper ? 


Cheers

On Wed, Nov 25, 2015 at 3:46 PM, Ufuk Celebi <[hidden email]> wrote:
> On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>
> Hi All,
>
> I would like to know if there any feature differences between using Standalone Cluster vs YARN ?
>
> Until now we are using Standalone cluster for our jobs.
> Is there any added value for using YARN ?
>
> We don't have any hadoop infrastructure in place right now but we can provide that if there's some value to that.

There are no features, which only work on YARN or in standalone clusters. YARN mode is essentially starting a standalone cluster in YARN containers.

In failure cases I find YARN more convenient, because it takes care of restarting failed task manager processes/containers for you.

– Ufuk




--




--
Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

Maximilian Michels
In reply to this post by tambunanw
Hi Welly,

> However YARN is still tightly couple to HDFS, is that seems wasteful to use only YARN without Hadoop ?

I wouldn't say tightly coupled. You can use YARN without HDFS. To work
with YARN properly, you would have to setup another distributed file
system like xtreemfs. Or use the one provided with the AWS or Google
Cloud Platform. You can tell Hadoop which file system to use by
modifying "fs.default.name" in the Hadoop config.

Cheers,
Max

On Wed, Nov 25, 2015 at 10:06 AM, Welly Tambunan <[hidden email]> wrote:

> Hi Fabian,
>
> Interesting !
>
> However YARN is still tightly couple to HDFS, is that seems wasteful to use
> only YARN without Hadoop ?
>
> Currently we are using Cassandra and CFS ( cass file system )
>
>
> Cheers
>
> On Wed, Nov 25, 2015 at 3:51 PM, Fabian Hueske <[hidden email]> wrote:
>>
>> A strong argument for YARN mode can be the isolation of multiple users and
>> jobs. You can easily start a new Flink cluster for each job or user.
>> However, this comes at the price of resource (memory) fragmentation. YARN
>> mode does not use memory as effective as cluster mode.
>>
>> 2015-11-25 9:46 GMT+01:00 Ufuk Celebi <[hidden email]>:
>>>
>>> > On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>>> >
>>> > Hi All,
>>> >
>>> > I would like to know if there any feature differences between using
>>> > Standalone Cluster vs YARN ?
>>> >
>>> > Until now we are using Standalone cluster for our jobs.
>>> > Is there any added value for using YARN ?
>>> >
>>> > We don't have any hadoop infrastructure in place right now but we can
>>> > provide that if there's some value to that.
>>>
>>> There are no features, which only work on YARN or in standalone clusters.
>>> YARN mode is essentially starting a standalone cluster in YARN containers.
>>>
>>> In failure cases I find YARN more convenient, because it takes care of
>>> restarting failed task manager processes/containers for you.
>>>
>>> – Ufuk
>>>
>>
>
>
>
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com
Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

Till Rohrmann
Hi Welly,

at the moment Flink only supports HA via ZooKeeper. However, there is no limitation to use another system. The only requirement is that this system allows you to find a consensus among multiple participants and to retrieve the community decision. If this is possible, then it can be integrated into Flink to serve as an alternative HA backend.

Cheers,
Till

On Wed, Nov 25, 2015 at 10:30 AM, Maximilian Michels <[hidden email]> wrote:
Hi Welly,

> However YARN is still tightly couple to HDFS, is that seems wasteful to use only YARN without Hadoop ?

I wouldn't say tightly coupled. You can use YARN without HDFS. To work
with YARN properly, you would have to setup another distributed file
system like xtreemfs. Or use the one provided with the AWS or Google
Cloud Platform. You can tell Hadoop which file system to use by
modifying "fs.default.name" in the Hadoop config.

Cheers,
Max

On Wed, Nov 25, 2015 at 10:06 AM, Welly Tambunan <[hidden email]> wrote:
> Hi Fabian,
>
> Interesting !
>
> However YARN is still tightly couple to HDFS, is that seems wasteful to use
> only YARN without Hadoop ?
>
> Currently we are using Cassandra and CFS ( cass file system )
>
>
> Cheers
>
> On Wed, Nov 25, 2015 at 3:51 PM, Fabian Hueske <[hidden email]> wrote:
>>
>> A strong argument for YARN mode can be the isolation of multiple users and
>> jobs. You can easily start a new Flink cluster for each job or user.
>> However, this comes at the price of resource (memory) fragmentation. YARN
>> mode does not use memory as effective as cluster mode.
>>
>> 2015-11-25 9:46 GMT+01:00 Ufuk Celebi <[hidden email]>:
>>>
>>> > On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>>> >
>>> > Hi All,
>>> >
>>> > I would like to know if there any feature differences between using
>>> > Standalone Cluster vs YARN ?
>>> >
>>> > Until now we are using Standalone cluster for our jobs.
>>> > Is there any added value for using YARN ?
>>> >
>>> > We don't have any hadoop infrastructure in place right now but we can
>>> > provide that if there's some value to that.
>>>
>>> There are no features, which only work on YARN or in standalone clusters.
>>> YARN mode is essentially starting a standalone cluster in YARN containers.
>>>
>>> In failure cases I find YARN more convenient, because it takes care of
>>> restarting failed task manager processes/containers for you.
>>>
>>> – Ufuk
>>>
>>
>
>
>
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com

Reply | Threaded
Open this post in threaded view
|

Re: Standalone Cluster vs YARN

Andreas Fritzler
In reply to this post by Maximilian Michels
Hi Welly,

If you want to use cassandra, you might want to look into having a Mesos cluster with frameworks for cassandra and spark.

Regards,
Andreas


On Wed, Nov 25, 2015 at 10:30 AM, Maximilian Michels <[hidden email]> wrote:
Hi Welly,

> However YARN is still tightly couple to HDFS, is that seems wasteful to use only YARN without Hadoop ?

I wouldn't say tightly coupled. You can use YARN without HDFS. To work
with YARN properly, you would have to setup another distributed file
system like xtreemfs. Or use the one provided with the AWS or Google
Cloud Platform. You can tell Hadoop which file system to use by
modifying "fs.default.name" in the Hadoop config.

Cheers,
Max

On Wed, Nov 25, 2015 at 10:06 AM, Welly Tambunan <[hidden email]> wrote:
> Hi Fabian,
>
> Interesting !
>
> However YARN is still tightly couple to HDFS, is that seems wasteful to use
> only YARN without Hadoop ?
>
> Currently we are using Cassandra and CFS ( cass file system )
>
>
> Cheers
>
> On Wed, Nov 25, 2015 at 3:51 PM, Fabian Hueske <[hidden email]> wrote:
>>
>> A strong argument for YARN mode can be the isolation of multiple users and
>> jobs. You can easily start a new Flink cluster for each job or user.
>> However, this comes at the price of resource (memory) fragmentation. YARN
>> mode does not use memory as effective as cluster mode.
>>
>> 2015-11-25 9:46 GMT+01:00 Ufuk Celebi <[hidden email]>:
>>>
>>> > On 25 Nov 2015, at 02:35, Welly Tambunan <[hidden email]> wrote:
>>> >
>>> > Hi All,
>>> >
>>> > I would like to know if there any feature differences between using
>>> > Standalone Cluster vs YARN ?
>>> >
>>> > Until now we are using Standalone cluster for our jobs.
>>> > Is there any added value for using YARN ?
>>> >
>>> > We don't have any hadoop infrastructure in place right now but we can
>>> > provide that if there's some value to that.
>>>
>>> There are no features, which only work on YARN or in standalone clusters.
>>> YARN mode is essentially starting a standalone cluster in YARN containers.
>>>
>>> In failure cases I find YARN more convenient, because it takes care of
>>> restarting failed task manager processes/containers for you.
>>>
>>> – Ufuk
>>>
>>
>
>
>
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com