Following are the Performance Run Results done pre-4.x.
Processor
Dual core Intel(R) Xeon(R) CPU processor, 2.27GHz, ht enabled, 4 processor
Operating System
CentOS release 5.5 (Final), x86_64
Configuration Parameters
Following config parameters were used in both the management servers
- Java heap size = 5 GB
- db.cloud.maxActive = 250
- db.cloud.url.params=prepStmtCacheSize=517&cachePrepStmts=true&prepStmtCacheSqlLimit=4096&includeInnodbStatusInDeadlockExceptions=true&logSlowQueries=true
Java version
java version "1.6.0"
OpenJDK Runtime Environment (build 1.6.0-b09)
OpenJDK 64-Bit Server VM (build 1.6.0-b09, mixed mode)
Processor
Quad-Core AMD Opteron(tm) Processor, 2.1GHz, ht enabled, 8 processor
Operating System
CentOS release 6.2 (Final), x86_64
Configuration Parameters
DB configurations for this run is detailed in the my.cnf attached: my.cnf
Mysql version
MySQL-server-5.5.21-1.linux2.6.x86_64
Test Set up for this run consists of 1 zone with 1800 simulated hosts across over a hundred pods. 4000 accounts were created with each account having one network.
Following is the detailed configuration of the infrastructure:
1 Zone
112 Pods [Each Pod having 2 Clusters]
224 Clusters [Each cluster having 8 hosts and one primary storage]
1782 Hosts
4000 User accounts [Each account having one network]
12000 User instances
8000 Virtual Routers [Since we are using Redundant Virtual Router offering]
This run was carried out with induced delay using simulator for the following agent commands:
DhcpEntryCommand - 10s
CreateCommand - 20s
StartCommand - 20s
ClusterDeltaSyncCommand - 3s
PingCommand- 300 ms
PingTestCommand - 300 ms
CheckRouterCommand- 5 and 10s
ManageSnapshotCommand
BackupSnapshotCommand
Test Set up for this run consists of 1 zone with 1800 simulated hosts across over a hundred pods. 4000 accounts were created with each account having one network.
Following is the detailed configuration of the infrastructure:
1 Zone
115 Pods [Each Pod having 2 Clusters]
230 Clusters [Each cluster having 8 hosts and one primary storage]
1840 Hosts
4000 User accounts [Each account having one network]
12000 User instances
8000 Virtual Routers [Since we are using Redundant Virtual Router offering]
CPU UTILIZATION
Following graph shows the CPU Utilization for one of the management servers during deploying simulator VMs. Total time taken for all the VMs to complete deployment is ~3hrs.
No. OF DB CONNECTIONS
Following shows the number of DB connections to the mysql DB during Deploy VM.
Observation:
There are spikes every 8 mins (approx) on the No. of DB connections to almost 250 connections. The frequency of spikes increases with time
ASYNC JOB RESPONSE TIME
Following shows the time taken for Deploy VM Async Job to complete. Measures are derived from the DB for each job-id.
Observation:
With the number of VMs increasing, the time taken for the async job to complete is also more, longest time being 51 sec. As seen from the graph, the first few VMs took around 5-10 sec while the last VMs deployed (> 11000) took almost 50 sec to deploy.
TIME TAKEN BY ASYNC JOB TO RETURN JOB ID
This shows the time taken for the job id to return in response to the Deploy VM async job. The average time taken across Deploy VM API calls is 0.7 sec and the Median value is 0.418. This means, most API calls took < 0.418 sec to return the job id
Graphs for Deploy VM:
CPU UTILIZATION
The highlighted area shows the readings taken during Deploy VM. The graphs cover a total time of around 9 hours (including deploy VM which took ~ 3 hours)
direct.agent.load.size |
Time for all hosts to connect to MS2 |
Time for all hosts to get disconnected |
Time for all hosts to connect to |
Time for rebalancing the hosts between |
---|---|---|---|---|
500 |
460 s |
135 s |
120 s |
265 s |
1000 |
140 s |
50 s |
100 s |
202 s |
direct.agent.load.size |
Time for all hosts to connect to MS2 |
Time for all hosts to get disconnected |
Time for all hosts to connect to |
Time for rebalancing the hosts between |
---|---|---|---|---|
500 |
135 s |
52 s |
110 s |
213 s |
1000 |
92 s |
82 s |
120 s |
248 s |
The delay between Sending... and Executing... for various agent commands was measured. The commands also had simulated delay induced. Following commands were measured:
DhcpEntryCommand
CreateCommand
StartCommand
CheckRouterCommand
ManageSnapshotCommand
BackupSnapshotCommand
The delay for all was well within 100 ms. At times, goes upto 400 ms
The VMs were deployed in steps of 3 iterations - 4K VMs each. Also set up recurring snapshots for 1000 Volumes.
Following are the results of a first attempt at measuring the List* API response time for few APIs:
Observations:
The following tables shows an initial measure done for few APIs. For the cases where it failed with the error message: "HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers” the resut is marked "F"
API |
pagesize |
Response time XML |
Response Time - JSON |
Comments |
listHosts - count:1794 |
100 |
6 |
12 |
|
|
1000 |
117 |
114 |
|
|
5000 |
160 |
196 |
|
|
10000 |
175 |
176 |
|
|
no pagesize |
209 |
168 |
|
listVolumes-count:12K |
100 |
6 |
5 |
|
|
1000 |
104 |
85 |
|
|
5000 |
F |
833 |
Failed - XML |
|
10000 |
F |
F |
Failed |
|
no pagesize |
Didn’t try |
Didn’t try |
|
listVirtualMachines-count:12K |
100 |
2 |
2 |
|
|
1000 |
35 |
14 |
|
|
5000 |
193 |
145 |
|
|
10000 |
330 |
269 |
|
|
no pagesize |
|
|
|
listRouters- count:8K |
100 |
32 |
39 |
|
|
1000 |
374 |
F |
Failed |
|
5000 |
F |
|
|
|
10000 |
Didn’t try |
Didn’t try |
|
|
no pagesize |
|
|
|
listAccounts-count:4K |
100 |
62 |
59 |
|
|
1000 |
|
F |
Failed |
|
5000 |
F |
|
Failed |
|
10000 |
NA |
NA |
since count=4K |
listUsers-count:4K |
no pagesize |
Didn’t try |
Didn’t try |
|
|
100 |
|
13 |
|
|
1000 |
49 |
37 |
|
|
5000 |
136 |
74 |
|
|
10000 |
NA |
NA |
since count=4K |
|
no pagesize |
NA |
NA |
since count=4K |
listAsyncJobs |
100 |
6 |
11 |
|
|
1000 |
68 |
96 |
|
|
5000 |
F |
|
Failed |
|
10000 |
|
|
|
|
no pagesize |
|
|
|
listStoragePools-count:224 |
100 |
2 |
5 |
|
|
1000 |
15 |
7 |
|
|
5000 |
25 |
32 |
|
|
10000 |
NA |
NA |
since count=224 |
|
no pagesize |
|
|
|
This use case relates to Snapshots and the measures taken during snapshots being triggered by MS and the CPU Load during that time.
snapshot.poll.interval was set to default value of 300 sec.
Following are the results:
Following graph shows the CPU Utilization during snapshots being triggered (for 10000 volume case)