Apache Labs > Home > Clouds > CloudsApacheCloudComputingEdition
Added by Robert Burrell Donkin, last edited by Robert Burrell Donkin on May 19, 2009  (view change)

Apache Cloud Computing Edition

Annotation Web Version

of the presentation .
A call to arms for the open source cloud annotated for the web.

The Cloud

Notes

This is a talk on application architecture for the Cloud 1

So, this is about how to exploit Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) rather than Software as a Service (SaaS) 2 (though SaaS offerings may well exploit PaaS or IaaS). This is about how to build a software stack for the cloud.

Apache is in ideal position to influence this evolution since it already develops key elements for the LAMP and Java stacks.

Footnotes
Reference Notes
1 More in CloudsPictureTheJigsaw
2 More in CloudsEverythingAsAService

Why Move To The Cloud?

  • Good Reasons
    • Cost
      • Replace capital costs by pay-as-you-go
        • For a business plan, that might work
    • Outsource hardware problems
    • To handle petabytes of data
  • Bad Reasons
    • To avoid hiring an operations team
Notes

You will need an operations team more than ever. It's just they'll be the ones chillin' in the fashionable Cafe making small talk. They no longer need to be co-located since there's no "co" to be located with. The trade-off is that the cloud is 24-7.

Cloud computing is complex. Managing the cloud is tough. So challenging that you'll need a Ninja Ops Team .

Deployment and management of on-demand resources is an area where tooling is particularly lacking at the moment. Open source technologies and tools need to be created to fill this gap.

What Is A Cloud Application?

  • The program that is execute
  • The code/data needed to run it in a data centre
  • Anything needed to configure, manage and monitor the system
Notes

We've often tended to split the "application" from the installation, because that installation usually included buying hardware, connecting it together, etc. Not any more. Now the servers and storage are just procedural or declarative instructions in a different file in the repository. Which means you need to look at everything together.

See CloudsNinjaOpsTeam and CloudsComputingOnDemand

Not A Cloud Application

Notes

This is not a cloud application.

You can still run it on the cloud if you host it in a VM infrastructure that make the "cloud" look like physical machines. This is what enterprises do as it reduces hardware costs of maintaining machines for single apps, especially ones that don't get used very often. But you still have software maintenance and OS update costs, the latter is now a function of the #of OS images you have, not the #of physical machines.

See CloudsUtilityComputing#UtilityVsCluster

Not A Cloud Application

Notes

This is not a cloud application.

Not picking on any particular server here - just the small-cluster architecture at which Java EE excels.

  1. Lots of beans hiding the complexity of remoting 3 from the developer 4 .
  2. The database handles replication with data security delivered through hardware 5
  3. The container keeps the beans consistent and hides concurrency from the developer
  4. WS-* and IIOP deliver good integration with other applications which are somewhat controlled 6
  5. In house - your customers can't walk away. Externally, you may have scale problems at times of popularity.
  6. The Ops team probably spends their life fighting development
  7. Hardware is probably ordered before the software is delivered and before the likely demand is known 7
  8. Scaling by added hardware as demand grows is not part of the Ops culture 8

Footnotes
Reference Notes
3 Databases, Web services, CORBA, RMI, Legacy system and so on
4 As XML, SQL and so on are considered too complicated
5 Typically RAID-5
6 It is possible to get all the architects into a phone conference and for them to agree on which WS-* specs to use, what the single-signon protocol will be (often WinNT auth, maybe via LDAP), etc.
7 Utilitisation and performance are therefore often poor in practice
8 The relative cost benefits of code changes verses hardware changes are therefore not considered when performance issues are tackled

Things Must Change

  • Web UI for users, affiliates, marketing, operations
  • Agile machine management as part of the API
  • Scale up and down
  • Live upgrade of running system
  • Persistence with key-value stores
  • A Petabyte filesystem is part of the application
  • MapReduce jobs close the loop
  • Developers deploy to the cloud to test
Notes

A call to arms

See:

The Agile Will Survive

Enter The Cloud

Notes

This is a cloud application.

  • Scale: many users -or down to nearly none.
  • Collaboration: if there are live feeds of all friends/colleagues, it is hard to shard
  • Business model may include affiliate applications -web based
  • Management, developers, ops behind the scenes
  • Pay as you go infrastructure
  • See the possible inter-connections? That why it needs to be REST. Baking conversational state into protocols is simply not going to work they there are so many flexible interconnections.
  • This diagram is a simplification. The nodes are not fixed but dynamic and may hop in and out of existence as demands change.

EC2 As The Cloud Provider

  • S3 for persistence, public downloads
  • SimpleDB provides key-value storage, but query costs unpredictable.
  • Typica for EC2 services API
  • AWS IP rental or Dyndns for hostnames
  • No billing or test APIs
  • No image management services

But No Standard Apache Stack

Notes

Image management: provide a list of RPMs or other requirements of a machine, send a message with this list to a provider, get back the machine address to log in to. People like rightscale and similar are providing this for a fee, but AWS could do this just as easily. They are not doing it yet, but if your startup's business model depends on AWS not doing it, then your business model is the same as those people who provided add-ons for Windows 3.1. Provided. The only ones left now provide extra security for the OS, because MS now do nearly everything else themselves.

Typica is a good Java front end API to this service

The problem is that the tooling is patchy. You can't just change a few settings and use your standard stack. You can't trial on a locally then use the same scripts and tools to scale out. See CloudsBlending

Sun, IBM, HP as the cloud providers

  • S3 -like filestore
  • EC2 and Sun RESTy APIs
  • Unknown queue and keystore services
  • More secure networking?
  • Billing and monitoring?
  • Testing?
  • Image management services?

But No Standard Apache Stack

Notes

Who else can do this? Vendors with capital can afford to roll out infrastructure. Different business model from Amazon (who invest in datacentres for the xmas peak), they are all driven by a need to sell hardware into a world where hardware goes into datacentres.

It makes sense for the hardware manufactures to fill this niche. Build big data centres with fat pipes where power is cheap and fill them with hardware. No expensive sales teams just great engineering. Expect the majors to pioneer this new land, followed a little later by no-name, no frills second tier corporations based in the developing world.

See Also CloudsStandardsTangle

Private cloud

  • Eucalyptus for deploying Xen images
  • Various persistence options
  • Private filestore: HDFS, kfs, Lustre
  • Kickstart for image management?

But No Standard Apache Stack

Notes

Private clouds are an interesting idea. You can do it today with VMWare, but it uses a different machine API, and is fairly biased towards humans and GUIS.

Eucalyptus is the OSS tool for managing a few thousand servers. "Small" datacentres -but enough for many organisations. The API gives you dynamic machines, but you are left with all the other details.
The problem with this is that you have to be quite a big enterprise to benefit.

More likely are blended approaches where enterprises own some physical boxes, rent some private clouds and use the public ones to cope with variability in demand. See CloudsBlending

Ownership

  1. Who owns the API owns the App
  2. Who owns the Data owns You
Notes

Two observations based on the previous years of the PC business. Some people may think of Microsoft and Oracle, but in fact IBM probably invented both of these first, those two companies just executed it better.

It's about power: who owns you, and who do you own.

But we're coming into this with both eyes open...

See CloudsStandardsTangle

Apache Cloud Computing Edition

  • Diverse mix of high-level technologies
  • Very large filestore at the bottom :
  • Hadoop APIs, Java 7 NIO, Fuse, WebDAV
  • MapReduce phase for post-processing
  • We need stories for
    • persistence
    • configuration
    • resource management

A Standard Apache Stack

Notes

Here's a different idea. How about Apache becoming the Apache for the cloud, with our own stack, an evolution of what we have today in terms of Apache HTTPD, the Apache Java ecosystem, and what we are doing in other parts of the community?

Apache has a history of swimming with the sharks - we like it: it keeps us on our toes.

No one truth. Apache has a history of encouraging competition and diversity. An open stack - a menu from which individuals and vendors could choose a particular set of solutions from those on offer to best match their requirements.

See

Front End

  • The existing Web Front ends should work: servlets, JSP, wicket, PSP, grails
    • (maybe with memcached)
  • Glue: queues, scatter-gather, tuple-space, events
  • Everything needs to handle an agile world
  • Everything needs instrumenting for management
Notes

The good news: the front end still works, mostly. Where things get into trouble is if they cache changing hostnames, or contain other assumptions about where data lives on other machines, hard-coded JDBC paths, etc.

Better Glue is needed

Queues. AWS provides something built in; the competitors need them too. Again, standard APIs are nice here.
Scatter-Gather. Doug Cutting can explain what this is. It probably work bests on networks with multicast, which means not-on-EC2.
Tuple-spaces. People who remember JINI and Java Spaces may remember these, but they are in fact quite useful. Any machine can assert a fact into the T-Space, other machines can look for them, act on them. A nice way to loosely couple machines. We use this for some of our resource management, though again multicast and assumptions about linear clocks can create fun in a virtual world.

Persistence

Notes

This is a real troublespot. Because one of the goals of classic O/R mapping was 1:1 mapping of entries in a db to Java objects. You can't get that if you scale out the front to 200 machines, you have to deal with eventual consistence, and have more of a model of read-only views versus things you can write back to.

Options

  • JBDC-like API to Hadoop: Hive
  • Things built atop Hadoop DFS: HBase, Hypertable, Cassandra
  • Other cloud-scale apache code: CouchDB
  • layers to hide cloud-specific databases, such as SimpleDB and SimpleJPA

Open Questions:

  • Do we want keystore databases to be retrofitted to look like RDMS systems, or should we do something cleaner, with less locking and more eventual consistency?
  • What should the back end be?
  • What makes a good API for Java, other languages
  • Integration with Hadoop makes database work inside MR jobs easier, but we also need fast read-access (=fast DB or slow DB+memcached), and sometimes you really do need transactions.

See:

Everything needs a REST API

  • REST is the long-haul API -why have a separate internal one?
  • JAX-RS is very nice: CXF, Jersey, RESTEasy, Restlet implement it
  • Client API evolving
  • Http Components/HttpClient can be the foundation for the Apache client; needs AWS support.
Notes

What is nice about JAX-RS is that it is an API that is nice to use (unlike, say, JAX-WS), and there are multiple implementations. I can and should be the foundation for all RESTy service endpoints in Apache unless you are tightly coupled to existing applications

Restlet is worth a play with, even though Apache folk won't like the license. It has a very clean transport-neutral model, one in which the client API is a mirror of the server one, as opposed to the java.net and Servlet APIs, that are completely different.

See CloudsGlossary#REST

Events and Messages

Examples:

  • "disk 3436 is failing"
  • Bluetooth phone 04:5a:1f:c2:87:91 entered cell 56 in London NW2
  • Queued purchases with card numbers

Internal and external events: reliability, scalability, triggered actions

Notes

Eventing and messaging. Are they the same or different?

What to do with them: some you can queue, others should trigger immediate actions, others can be dealt with later, just add more data to the filestore for mining. That phone, for example: is it a triggered action, or is it something important.

This is an interesting problem, and there's a lot of work required here. Clouds need to be able to push out into the real world. If your Ninja Ops team really is sipping Expressos in Venice then they need to be able to tool the cloud so that it can tell them when stuff goes wrong.

This is going to mean more flexible approach to messaging, bridging between protocols to find the most suitable. Intelligent message processing is also going to be required. The number of messages is just too big.

Also, a statistical approach is going to be needed. For example, some messages are only important when there are too many of them. Quite possible that machine learning approaches are going to be needed.

Resource Management?

  • HA resource manager to monitor front end/back end load and request/release machines on demand
  • Kill unhealthy nodes (liveness, performance)
  • Programmable policies (money vs. load)
  • Choreograph live upgrade/migration
  • Resource Manager as a service
Notes

Resource Management is the problem of allocating enough machines for demand and your budget, but not too many. You need to create machines when load is high (within constraints), free them when low, but taking into account rules about minimum per-hour cost of machines; time to instantiate.

Also need to be able to shut down webapps in a way that they can handle. If they are crash-only this is easy, but if not, then the RM needs to shut them down gracefully.

Health also needs monitoring. In a VM-world, killing and reinstantiating is powerful -it avoids problems with overloaded racks, it avoids problems with one specific machine in the rack. But: IPAddresses can change, restart costs higher. Best to have the RM manage this gracefully with a restart followed by a full VM-termination.

The RM does not need to be something every single user of a cloud needs to host for themselves.

Just as approaching to clustering need revision for the cloud, so too will approaches to high availability .

See Also

Configuration & Management

  • Lots of questions in this area
  • LDAP (and APIs)?
  • key-value stores?
  • SmartFrog moving to Apache license
  • What is Spring planning?
  • Tooling for existing solutions?
Notes

You need to worry a lot more about configuration here, because everything is agile. No more hard coded hostnames in Java pages, JDBC URLs with server hostnames in them in your jJava source

Ninja ops require killer configuration apps. Too little work has been put into Agile configuration and control tooling. Too often, this is seem as the work of an expensive fixed integration team that come in and fix all your problems in place. Not agile enough for the cloud.

See CloudsConfigurationAndManagementMissingFromJigsaw

Development

  • How to build and test in this world?
  • How to step through a program running on a remote datacentre?
  • How to control testing costs?
Notes

Let's agree not to argue about Ant vs Maven here

Eclipse is becoming the standard API, which is good in one way - one thing to target UI plugins for - bad in others. Good for uniformity, bad if you find it painful to use.

Testing is very different here. You can create machines on demand, but then your tests run up bills. Development budgeting is going to need to change. Technologies which allowing blending between local and remote clouds will have advantages for development.

A good practise here is for the developers to start their cluster when they come in in the morning; for it to be shut down when they go idle. And to use a separate credit card from production, so if they go over budget, your service doesn't go off-line.

See CloudsDevelopmentMissingFromJigsaw
See Also CloudsBlending

Testing - Your first terabyte of data

MapReduce UnitTest
protected void map(Text key, Text test, Context context)
        throws IOException, InterruptedException {
  TestResult result = new TestResult();
  Class<?> testClass = loadClass(context, test);
  Test testSuite = JUnitMRUtils.extractTest(testClass);
  TestSuiteRun tsr = new TestSuiteRun();
  result.addListener(tsr);
  testSuite.run(result);
  for (SingleTestRun singleTestRun : tsr.getTests()) {
    context.write(new Text(singleTestRun.name),
            singleTestRun);
  }
}

9

Lots of opportunities here!

Notes

Testing is not merely "different", it is an opportunity to use data mining within your own process. Now you can keep all the old test run info, compare performance over time, use CM tools to explore more of the configuration space (which cluster/app options give best value for money), integrate this with the test runs.

Needed:

  • test runners to feed data into Hadoop filesystems
  • analysis algorithms
  • presentation of results.

Last.fm have been doing lots of this stuff; we've been playing with different Hadoop configs, with automated configuration generators on the plans.

See CloudsApacheHadoop

Footnotes
Reference Notes
9 Here is something else of mine, something that runs a list of tests. takes a text file listing all test suites/classes to execute, each is a separate job. The results are pushed back as a new entry for every test in the suite

Testing - Infrastructure can help

Pseudo-RNG driven cluster configuration

Notes

This is how HP tests some of its infrastructure-on-demand services: with a pseduo-RNG configuration generator generating more of the configuration space than humans could do themselves.

One of the challenges when integration testing Cloud applications is to simulate nebulus complexity. Testing against a single fixed configuration created by a human just isn't going to be realistic enough. Variable capacity, network connectivity and demands needs to be included. Tooling these simulations is going to be a challenge.

Cirrus Cloud Testbed?

  • HP, Intel, Yahoo!, universities
  • Heterogeneous, multiple datacentres
  • Offering datacentre time, not specific apps
  • Low-level API for physical machines
  • Cloud-API for virtual machines
  • Paying customers? No, not yet
  • Open source projects? We hope so
Notes

What's up with the HP-Intel-Yahoo! Cirrus testbed?

Cirrus is more than just Hadoop and friends in a datacentre -it is low-level physical machine access letting you install physical OS images for periods of time, and layers on top. Hadoop is one key application, but not the only one.

Hopefully, it will be possible to get some OSS time in here too - because it helps ensure our code runs across multiple machines. For open source, reliabilty of supply isn't necessary - in fact, unreliability makes for more interesting testing. Also, open source usage is a way of increasing visibility and (potentially) attracting paying customers. So maybe there's some sort of deal to be made about exploitation of spare capacity.

HP: 1000 cores, 256+ boxes, some disk heavy, some RAM-heavy

See CloudsApacheHadoop

What Next?

Apache has the core of a Cloud Computing stack

How do we take this and:

  • Integrate the various pieces?
  • Extend them where appropriate?
  • Provide an alternative to AppEngine and Azureus?

A Call To Arms

  • Stop focussing on EJB
  • Start collecting as much data as you can and feeding that MapReduce mining-phase.
  • Design for: distributed not-quite-Posix filesystems, message queues, name-value databases and resource centric data stores

Apache: let's build our own cloud platform

Notes

This leads into the Clouds project, of which this documentation is a part.

See Clouds

Let's Build The Cloud