Following are the Performance Run Results done pre-4.x. 

CONFIGURATIONS

1.      Management server

Processor

Dual core Intel(R) Xeon(R) CPU processor, 2.27GHz, ht enabled, 4 processor

Operating System

CentOS release 5.5 (Final), x86_64

Configuration Parameters

Following config parameters were used in both the management servers

-        Java heap size = 5 GB

-        db.cloud.maxActive = 250

-        db.cloud.url.params=prepStmtCacheSize=517&cachePrepStmts=true&prepStmtCacheSqlLimit=4096&includeInnodbStatusInDeadlockExceptions=true&logSlowQueries=true

Java version

java version "1.6.0"

OpenJDK  Runtime Environment (build 1.6.0-b09)

OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)

2.      Database

 Processor

Quad-Core AMD Opteron(tm) Processor, 2.1GHz, ht enabled, 8 processor

Operating System

CentOS release 6.2 (Final), x86_64

Configuration Parameters

DB configurations for this run is detailed in the my.cnf attached: my.cnf

Mysql version

MySQL-server-5.5.21-1.linux2.6.x86_64

TEST ENVIRONMENT SET UP

Test Set up for this run consists of 1 zone with 1800 simulated hosts across over a hundred pods. 4000 accounts were created with each account having one network.

Following is the detailed configuration of the infrastructure:

1 Zone

112 Pods [Each Pod having 2 Clusters]

224 Clusters [Each cluster having 8 hosts and one primary storage]

1782 Hosts

4000 User accounts [Each account having one network]

12000 User instances

8000 Virtual Routers [Since we are using Redundant Virtual Router offering]

This run was carried out with induced delay using simulator for the following agent commands:

DhcpEntryCommand - 10s

CreateCommand - 20s

StartCommand - 20s

ClusterDeltaSyncCommand - 3s

PingCommand- 300 ms

PingTestCommand - 300 ms

CheckRouterCommand- 5 and 10s

ManageSnapshotCommand

BackupSnapshotCommand

TEST ENVIRONMENT SET UP

Test Set up for this run consists of 1 zone with 1800 simulated hosts across over a hundred pods. 4000 accounts were created with each account having one network.

Following is the detailed configuration of the infrastructure:

1 Zone

115 Pods [Each Pod having 2 Clusters]

230 Clusters [Each cluster having 8 hosts and one primary storage]

1840 Hosts

4000 User accounts [Each account having one network]

12000 User instances

8000 Virtual Routers [Since we are using Redundant Virtual Router offering]

USE CASES

  1. Deploy VM 
    CPU Utilization
    No. of DB Connections
    Time for async job to complete
    Time to return job id 
  2. Steady state Measures
    CPU Utilization
    No of DB Connections
  3. Restart Management Server (agent load size 500, 1000, 1500)
    Time to Stop MSTime taken to Start MS and rebalance hosts
  4. Restart MS measures with Host in maintenance mode (agent load size 500, 1000, 1500)
    Time to Stop
    Time taken to Start MS and rebalance hosts 
  5. List* API Response Time
  6. Creation of Snapshots for all VMs

RESULTS

Use case 1: Deploy VM

CPU UTILIZATION

Following graph shows the CPU Utilization for one of the management servers during deploying simulator VMs. Total time taken for all the VMs to complete deployment is ~3hrs.

No. OF DB CONNECTIONS

Following shows the number of DB connections to the mysql DB during Deploy VM.

Observation:

There are spikes every 8 mins (approx) on the No. of DB connections to almost 250 connections. The frequency of spikes increases with time

ASYNC JOB RESPONSE TIME

Following shows the time taken for Deploy VM Async Job to complete. Measures are derived from the DB for each job-id.

Observation:

With the number of VMs increasing, the time taken for the async job to complete is also more, longest time being 51 sec. As seen from the graph, the first few VMs took around 5-10 sec while the last VMs deployed (> 11000) took almost 50 sec to deploy.

TIME TAKEN BY ASYNC JOB TO RETURN JOB ID

This shows the time taken for the job id to return in response to the Deploy VM async job. The average time taken across Deploy VM API calls is 0.7 sec and the Median value is 0.418. This means, most API calls took < 0.418 sec to return the job id 

 
Graphs for Deploy VM:

Use case 2: Steady State Measures

 

CPU UTILIZATION

 

The highlighted area shows the readings taken during Deploy VM. The graphs cover a total time of around 9 hours (including deploy VM which took ~ 3 hours)

MS RESTARTS

 

direct.agent.load.size

Time for all hosts to connect to MS2
after stopping MS1

Time for all hosts to get disconnected
after stopping MS2

Time for all hosts to connect to
MS1 after it is started

Time for rebalancing the hosts between
the two MSs

500

460 s

135 s

120 s

265 s

1000

140 s

50 s

100 s

202 s

 

MS RESTARTS WITH HOSTS IN MAINTENANCE MODE

 

direct.agent.load.size

Time for all hosts to connect to MS2
after stopping MS1

Time for all hosts to get disconnected
after stopping MS2

Time for all hosts to connect to
MS1 after it is started

Time for rebalancing the hosts between
the two MSs

500

135 s

52 s

110 s

213 s

1000

92 s

82 s

120 s

248 s

 

MEASURING THE DELAY BETWEEN SENDING AND EXECUTING AGENT COMMANDS

 

The delay between Sending... and Executing... for various agent commands was measured. The commands also had simulated delay induced. Following commands were measured:

 

DhcpEntryCommand

CreateCommand

StartCommand

CheckRouterCommand

ManageSnapshotCommand 

BackupSnapshotCommand

 

The delay for all was well within 100 ms. At times, goes upto 400 ms

The VMs were deployed in steps of 3 iterations - 4K VMs each. Also set up recurring snapshots for 1000 Volumes. 

Use case 5: List* API Response Time

Following are the results of a first attempt at measuring the List* API response time for few APIs:

Observations:

  1. For all APIs, (except listVirtualMachines which was fixed lately) beyond pagesize of 5000, it’s taking too long (> 10 mins at times) to return the results. In many cases, I also get an error which says: “HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers”
  2. Finding cases where json response is taking longer in some cases
  3. For some cases like listStoragePools, although the total count = 224, the time taken for pagesize=5000 > pagesize=1000. Not sure why this should happen.
  4. Randomly observed that on 8096, the calls take much longer to return. Is this expected? Shouldn't 8080 should take longer due to the authentication

The following tables shows an initial measure done for few APIs. For the cases where it failed with the error message: "HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers” the resut is marked "F"

 

API

pagesize

Response time XML 
in sec

Response Time - JSON
in sec

Comments

listHosts - count:1794

100

6

12

 

 

1000

117

114

 

 

5000

160

196

 

 

10000

175

176

 

 

no pagesize

209

168

 

listVolumes-count:12K

100

6

5

 

 

1000

104

85

 

 

5000

F

833

Failed - XML

 

10000

F

F

Failed

 

no pagesize

Didn’t try

Didn’t try

 

listVirtualMachines-count:12K

100

2

2

 

 

1000

35

14

 

 

5000

193

145

 

 

10000

330

269

 

 

no pagesize

 

 

 

listRouters- count:8K

100

32

39

 

 

1000

374

F

Failed

 

5000

F

 

 

 

10000

Didn’t try

Didn’t try

 

 

no pagesize

 

 

 

listAccounts-count:4K

100

62

59

 

 

1000

 

F

Failed

 

5000

F

 

Failed

 

10000

NA

NA

since count=4K

listUsers-count:4K

no pagesize

Didn’t try

Didn’t try

 

 

100

 

13

 

 

1000

49

37

 

 

5000

136

74

 

 

10000

NA

NA

since count=4K

 

no pagesize

NA

NA

since count=4K

listAsyncJobs

100

6

11

 

 

1000

68

96

 

 

5000

F

 

Failed

 

10000

 

 

 

 

no pagesize

 

 

 

listStoragePools-count:224

100

2

5

 

 

1000

15

7

 

 

5000

25

32

 

 

10000

NA

NA

since count=224

 

no pagesize

 

 

 

 

Use case 6: Snapshots 

 

This use case relates to Snapshots and the measures taken during snapshots being triggered by MS and the CPU Load during that time.

snapshot.poll.interval was set to default value of 300 sec.

Following are the results:

  1. Hourly snapshots for 1000 Volumes## Snapshots were triggered for all 1000 VMs and the job ids were all generated within the 300 sec interval before the next poll begun.
  2. Hourly Snapshots for 10000 Volumes## Snapshots were triggered for all 10000 Volumes. But the time taken was beyond 300 sec. So the polling continued only after all snapshots were triggered. 

 

Following graph shows the CPU Utilization during snapshots being triggered (for 10000 volume case)

  • No labels