TBD
During the past few years some issues were observed with the usage server. Most of these issues were related to limitations and bugs on it. For example, the usage server stops working and cannot be initiate it back again to work if some duplicate entries are detected for events, needing manual intervention.
It has also been observed that the performance of the current usage server implementation starts getting degraded once the database starts growing in size with shorter aggregation times, such as hourly. When this happens, the execution times of the usage jobs exceed the aggregation time and a delay is observed on the date of the generated usage records.
A major refactor is proposed on the usage server to increase the usage server performance
Document History
Version | Author/Reviewer | Date |
---|---|---|
1.0 | Nicolas Vazquez |
|
This feature proposes a refactor on the usage server ensuring that the following requirements are met for administrators:
The administrators must be able to re-generate usage data for a certain date range for all the accounts or specific account(s) of a domain:
If the account or domain ID are not set, then all the accounts’ usage data will be removed and re-generated
The API execution will create and persist a new job with a type = ‘REGENERATE’ to differentiate from other usage jobs.
The existing generateUsageRecords API must be fixed to allow re-generation if data from cloud_usage.cloud_usage is missing for the specified start/end date but as per the job metadata it is found to be available. The existing API does not regenerate the usage records properly, it only works if there is no usage data generated after the initial time set on the API
The usage job’s last step is the parsing of the helper tables into ‘cloud_usage’ records. The parsing is performed sequentially one account after the other and can be parallelized to increase its performance by a better CPU utilization. The proposed approach suggests using parallelisation in terms of the number of threads to process one account sequentially, i.e. each thread will process one account sequentially. The processing of the events is performed sequentially prior to the parallelisation of the accounts processing.
The usage job must be more robust on its execution. One known issue with the current usage server implementation comes from duplicate event records causing usage job to fail and never succeed unless the duplicates are removed. To increase robustness one additional step can be added after the helper records creation and before their parsing. The additional step must perform the following actions:
If after the additional step there are failures on the usage jobs, then the job must fail detailing the reason, log the errors and send an alert to the administrator for taking actions.
A new global setting will be created to control the amount of years the old usage data will be kept in the ‘cloud_usage’ database for the ‘cloud_usage’ table records.
The sanity check implementation must be refactored in favour of robustness and completeness of the checks:.
N/A
The following test must be included as a separate marvin class (e.g. have its own tag) to prevent it runs automatically on production environments as it will change configurations in the system, unless it is explicitly executed by an administrator. However, the test must be included on the Trillian automation
Start the usage server
Wait for 2 or 3 times the aggregation range minutes so that usage records can be generated.
Stop the usage server
List usage records for the account and ensure records have been generated for the resources created and a certain type, for example running time for VMs
Ensure the number of records match the expected number depending on the time spent generating records
Wait for 2 or 3 times the aggregation range minutes, set the aggregation range to daily
Start the usage server
List the usage records for the account and check the number of records per a certain resource and usage type
Regenerate usage records for account on the date
List the usage records for the account and the same resource and usage type, and compare it to the number obtain above. Ensure the number of usage records is higher than the previous check.