You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Introduction

Our current Bigtop deployment code shows its signs of age (it originated in pre 2.X puppet). That code currently serves as the main engine for us to dynamically deploy Bigtop clusters on EC2 for testing purposes. However, given that the Bigtop distro is the foundation for the commercial distros, we would like our Puppet code to be the go-to place for all the puppet-driven Hadoop deployment needs.

What we strive at building is a set of reusable/encapsulated/modular building blocks that can be arranged in any legal “hadoop toplogy”, as opposed to automating a particular deployment use case. Our goal is to offer capabilities, not to dictate policies.

Thus at the highest level, our Puppet code needs to be:

  1. self-contained, to a point where we might end up forking somebody else's module(s) and maitaining it (subject to licensing)
  2. useful for as many versions of Puppet as possible from the code (manifest) compatibility point of view. Obviously, we shouldn't obsess too much over something like Puppet 0.24, but we should keep the goal in mind. Also, we're NOT solving a problem of different versions of Puppet master and Puppet agents here.
  3. useful in a classical puppet-master driven setup where one has access to modules, hiera/extlookup, etc all nicely setup and maitained under /etc/puppet
  4. useful in a masterless mode so that things like Puppet/Whirr integration can be utilized: WHIRR-385 This is the usecase where the Puppet classes are guaranteed to be delivered to each node out of band and --modulepath will be given to puppet apply. Everything else (hiera/extlookup files, etc) is likely to require additional out-of-band communications that we would like to minimize.
  5. useful in orchestration scenarios (Apache Ambari, Reactor8) although this could be viewed as a subset of the previous one. We need to be absolutely clear though that we are NOT proposing to use Puppet itself as an orchestration platform, what is being proposed is to treat puppet as a tool for predictably stamping out the state of each node when and only when instructed by a higher level orchestration engine. The same orchestration engine will also make sure to provide all the required data for the puppet classes. We are leaving it up to the orchestration implementation to either require a Puppet master or not. The important part is the temporal sequence in which either puppet agent or puppet apply get invoked on every node.

Proof of concept implementation (ongoing)

The next generation puppet implementation will be a pretty radical departure from what we currently have in Bigtop. It feels like a POC with a subset of Bigtop component will be a useful first step. HBase deployment is one such subset that, on one hand, is still small enough, but on the other hand would require puppet deployment for 2 supporting components:

  1. Zookeeper
  2. HDFS

before HBase itself gets fully bootstrapped. For those unfamiliar with the HBase architecture the following provides a good introduction: http://hbase.apache.org/book.html#distributed

Also, there's now a GitHub repository for the code to evolve: https://github.com/rvs/bigtop-puppet

A detailed use case scenario

Testing release candidates of HBase

A user Alice wants to test the release candidate of HBase against the stable Zookeeper and HDFS. Alice uses Jenkins job to build packages for the HBase release candidate and wants to deploy on EC2. Alice decides that she wants a single topology to be tested and she wants the underlying OS to be Ubuntu Lucid. She uses Whirr to dynamically spin EC2 machines and she uses Whir-Puppet integration to deploy Zookeeper, HDFS and the latest build of HBase packages and configure them with suitable configuration. She invokes Whirr with the attached properties file and expects a fully functioning HBase cluster at the end of the Whirr execution:

$ cat puppet-hbase.properties

An external wizard/orchestrator use case

Another related perspective is if you had a wizard deployment tool that answered a few key questions to determine the characteristics of your hadoop deployment (i.e., what services are you running, the resiliency, whether secure mode, etc) you want to make it as easy as possible to build a mechanism that generate a site manifest that would then drive the Puppet automated configuration.

Issues to be resolved

Collection of modules vs. collection of classes in a single Bigtop module

A typical puppet module strives for the UNIX mantra of 'doing one thing and doing it well'. So it seems natural for Bigtop Puppet code to be split into independent modules each corresponding to a single Bigtop component. After all, that how the current Bigtop Puppet code is organized – into a collection of modules. Yet, I'd like to argue that given the level of coupling between different component of Bigtop we might as well be honest and admit that there's no way you can use them
in an independent fashion and you may as well create a single module called Bigtop with a collection of tightly coupled classes.

This also has a nice added benefit of simplifying the eventual publishing of our code on Puppet Forge – it is way easier to manage the versioning of a single module as opposed to multiple one.

Picking an overall 'theme' or 'paradigm' for the design of our modules.

The Node->Role->Profile->Module->Resource approach articulated here http://www.craigdunn.org/2012/05/239/ looks pretty promising.

Parameterized classes vs. external lookups

Which style is prefered:

class { "bigtop::hdfs::secondarynamenode": 
   hdfs_namenode_uri => "hdfs://nn.cluster.company.com:8020/"
} 

vs.

   # all the data required for the class gets
   # to us via external lookups
   include bigtop::hdfs::secondarynamenode 
jcbollinger says:

If you are going to rely on hiera or another external source for all class data, then you absolutely should use the 'include' form, not the parametrized form. The latter carries the constraint that it can be used only once for any given class, and that use must be the first one parsed. There are very good reasons, however, why sometimes you would like to declare a given class in more than one place. You can work around any need to do so with enough effort and code, but that generally makes your manifest set more brittle, and / or puts substantially greater demands on your ENC. I see from your subsequent comment (elided) that you recognize that, at least to some degree.

Source of external configuration: extlookup/hiera

Should we rely on extlookup or hiera? Should we somehow push that decision onto the consumer of our Puppet code so that they can use either one?

rvs says:

My biggest fear of embracing hiera 100% is compatibility concerns with older Pupppets

What is the best tool for modeling the data that characterize the configuration to be deployed

It seems like class parameters is an obvious choice here

jcbollinger says:

You are supposing that these will be modeled as class parameters in the first place. Certainly they are data that characterize the configuration to be deployed, and it is possible to model them as class parameters, but that is not the only – and maybe not the best – alternative available to you. Class parametrization is a protocol for declaring and documenting that data on which a class relies, and it enables mechanisms for obtaining that data that non-parametrized classes cannot use, but the same configuration objectives can be accomplished without them.

  • No labels