discuss updates to metering roadmap #1706

loomis · 2018-11-09T07:19:59Z

The experience with using the metering infrastructure for HNSciCloud has revealed various limitations in the current implementation. These limitations should be discussed and a roadmap created to plan the evolution of the feature.

Some of the takeaways from the HNSciCloud experience are:

Billable/non-billable usage. Monitored resources may be in states where they are visible, but not (fully) billed. For example, a suspended virtual machine may be billed only for storage but not for CPU or RAM. The states and detailed billing policies probably will vary from provider to provider.
Inaccurate pricing. A single price is used to calculate the cost within a metering record. Consequently, variations for the price of a resource, for example when a VM is suspended, cannot be properly reflected in the calculated cost.
Inaccurate resource usage values. Similarly, the resource usage totals may not be correct when the resource is in an inactive state. For example, the CPU and RAM values should not be included in the totals for a suspended VM.
Workarounds for some of these problems have been added to the UI (e.g. the billable flag). This really goes against the spirit of the metering infrastructure where the client should simply be able to sum values of the metering records to receive correct totals. In general, the client should not need to know about the detailed pricing calculation from the providers.
The current solution produces a large volume of documents within the database. This requires that a retention and/or consolidation policy be put in place (or an acceptance that the storage costs will continuously increase).
Outages of SlipStream or the underlying provides directly affects the accuracy of the metering, as metering records are not produced (or not correctly produced) in these situations. "Backfilling" is possible, but this has never been done in practice.
Tying resource usage to a particular user, group, or role has been done with ad hoc changes to the system. A general mechanism for doing this needs to be developed.
The monitoring system collects usage information through active probing of the cloud resources. This works reasonably well for virtual machines, even through this puts a large load on the job execution framework. This does not work for S3 resources as collecting bucket size information requires scanning all objects within the bucket. The latency is too large to be useful. For S3, exclusive use of ExternalObject resources would avoid these issues, but that requires buy-in from users and avoiding direct use of the underlying S3 cloud services/APIs. Other resources may have similar problems.

loomis added T4 Discussion S04 Proposed for Sprint P00 Internal labels Nov 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discuss updates to metering roadmap #1706

discuss updates to metering roadmap #1706

loomis commented Nov 9, 2018

discuss updates to metering roadmap #1706

discuss updates to metering roadmap #1706

Comments

loomis commented Nov 9, 2018