Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discuss updates to metering roadmap #1706

Open
loomis opened this issue Nov 9, 2018 · 0 comments
Open

discuss updates to metering roadmap #1706

loomis opened this issue Nov 9, 2018 · 0 comments

Comments

@loomis
Copy link
Contributor

loomis commented Nov 9, 2018

The experience with using the metering infrastructure for HNSciCloud has revealed various limitations in the current implementation. These limitations should be discussed and a roadmap created to plan the evolution of the feature.

Some of the takeaways from the HNSciCloud experience are:

  • Billable/non-billable usage. Monitored resources may be in states where they are visible, but not (fully) billed. For example, a suspended virtual machine may be billed only for storage but not for CPU or RAM. The states and detailed billing policies probably will vary from provider to provider.
  • Inaccurate pricing. A single price is used to calculate the cost within a metering record. Consequently, variations for the price of a resource, for example when a VM is suspended, cannot be properly reflected in the calculated cost.
  • Inaccurate resource usage values. Similarly, the resource usage totals may not be correct when the resource is in an inactive state. For example, the CPU and RAM values should not be included in the totals for a suspended VM.
  • Workarounds for some of these problems have been added to the UI (e.g. the billable flag). This really goes against the spirit of the metering infrastructure where the client should simply be able to sum values of the metering records to receive correct totals. In general, the client should not need to know about the detailed pricing calculation from the providers.
  • The current solution produces a large volume of documents within the database. This requires that a retention and/or consolidation policy be put in place (or an acceptance that the storage costs will continuously increase).
  • Outages of SlipStream or the underlying provides directly affects the accuracy of the metering, as metering records are not produced (or not correctly produced) in these situations. "Backfilling" is possible, but this has never been done in practice.
  • Tying resource usage to a particular user, group, or role has been done with ad hoc changes to the system. A general mechanism for doing this needs to be developed.
  • The monitoring system collects usage information through active probing of the cloud resources. This works reasonably well for virtual machines, even through this puts a large load on the job execution framework. This does not work for S3 resources as collecting bucket size information requires scanning all objects within the bucket. The latency is too large to be useful. For S3, exclusive use of ExternalObject resources would avoid these issues, but that requires buy-in from users and avoiding direct use of the underlying S3 cloud services/APIs. Other resources may have similar problems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant