Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Bill for network egress #11

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 222 additions & 0 deletions rfc/0000-batch-network-usage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
==================================
Billing for Network Usage in Batch
==================================

.. author:: Jackie Goldstein
.. date-accepted:: Leave blank. This will be filled in when the proposal is accepted.
.. ticket-url:: Leave blank. This will eventually be filled with the
ticket URL which will track the progress of the
implementation of the feature.
.. implemented:: Leave blank. This will be filled in with the first Hail version which
implements the described feature.
.. header:: This proposal is `discussed at this pull request <https://github.com/hail-is/hail-rfc/pull/0>`_.
**After creating the pull request, edit this file again, update the
number in the link, and delete this bold sentence.**
.. sectnum::
.. contents::
.. role:: python(code)


**********
Motivation
**********

Hail Batch is a multitenant elastic batch job processing system that
executes user pipelines on a fleet of worker VMs. Jobs execute
arbitrary commands inside of a container on a worker VM. Users are
billed for their share of the cost for cpu, memory, storage costs, IP
fees, and an additional service fee. However, Hail Batch does not
currently bill for network usage from the external internet in the
user's container. These costs are billed to the operators of Batch and
not to the users leaving a financial vulnerability that users can
exploit. To address this vulnerability, we will add a new resource for
bytes sent and received by the user over the network within our
existing billing system.


****************************
How the Current System Works
****************************

The current billing system works by having six tables in the database
that are used in concert to determine how much a user has spent.

1. ``products``
2. ``latest_product_versions``
3. ``resources``
4. ``attempts``
5. ``attempt_resources``
6. ``aggregated_{batch,job,billing_project_user,billing_project_user_by_date}_resources_v2``


Resource Identity
=================

A resource is composed of a product and a version. A product is a
specific item (i.e. SKU) that we bill for. An example product is
"compute/n1-preemptible/us-central1". Each product has multiple
versions associated with it that track how much the product cost at a
given timepoint. The version is usually the timestamp in which the
product was first billed at that rate. The ``latest_product_versions``
table keeps track of the current version for each product in the
database. A resource is the combination of a product + version and has
a rate associated with it. The rate is in a relevant unit for the
product over time in milliseconds. For example, CPU is tracked in
mcpu * msec. Certain products are tracked by the share of the VM to
compute the cost. These values are stored as integers where 100% is
stored as 1024 and the rate is computed to take shares out of 1024. We
use integers to avoid floating point imprecision as much as
possible. Resources are stored as unique integer IDs and are
referenced by their integer ID throughout the database.


Computing Resource Usage
========================

To compute how much of a resource an attempt has used, we store a row
for each resource used by each attempt that has an associated
quantity. The units of the quantity are specific to the resource
associated with it. For example, a compute resource has a quantity
that is in units of mCPU. Memory has a quantity that is in units of
MiB. All quantities must be integer values. To get the usage of the
resource, we store the start time and end time in the ``attempts``
table for each attempt. Therefore, the total ``usage`` of a resource
can be computed by ``quantity * duration``. There are four aggregated
billing tables that then store the overall usage across billing
projects, batches, and jobs. Two database triggers
(``attempts_after_update`` and ``attempt_resources_after_insert``) are
responsible for making sure the aggregated resource tables are up to
date by taking the change in duration and multiplying it by the
quantity of resource.


Computing Cost
==============

The aggregated billing tables are tokenized such that multiple
attempts can add their usage to the aggregation table
simultaneously. Therefore, an example query to compute the overall
cost used is shown below for a job:

.. code::text

SELECT batch_id, job_id, resource, COALESCE(CAST(COALESCE(SUM(`usage`), 0) AS SIGNED) * rate, 0) AS cost
FROM aggregated_job_resources_v2
LEFT JOIN resources ON aggregated_job_resources_v2.resource_id = resources.resource_id
WHERE aggregated_job_resources_v2.batch_id = 2 AND aggregated_job_resources_v2.job_id = 1
GROUP BY batch_id, job_id, resource;


Real-Time Billing
=================

Real time billing is implemented by having the worker VM send updates
every minute to the driver with the last known timestamp an attempt
has been running for. The driver then updates the ``rollup_time`` in
the ``attempts`` table which then updates the aggregation tables
within the ``attempts_after_update`` trigger.


Network Bandwidth Tracking
==========================

We use iptable to mark packets with which network namespace they
originated from or are destined to. We track how many bytes have been
transferred by polling iptables periodically. We then display the
upload and download bandwidths to the user on the UI page.

The relevant iptable commands are:

.. code::text

iptables -w {IPTABLES_WAIT_TIMEOUT_SECS} -t mangle -A PREROUTING --in-interface {self.veth_host} -j MARK --set-mark 10 && \
iptables -w {IPTABLES_WAIT_TIMEOUT_SECS} -t mangle -A POSTROUTING --out-interface {self.veth_host} -j MARK --set-mark 11

iptables -t mangle -L -v -n -x -w | grep "{self.veth_host}" | awk '{{ if ($6 == "{self.veth_host}" || $7 == "{self.veth_host}") print $2, $6, $7 }}'


*****************************
Proposed Change Specification
*****************************

We will make the following changes to the system to support billing
for network traffic:

1. There will be a new network bandwidth product added to the
``products`` table.
2. The rate in the ``resources`` table for the network bandwidth
product will depend on the cloud provider and be in cost per byte
transferred. We will add a new column to the ``resources`` table to
distinguish resources where the rate is `by_time` or `by_unit`.
3. The quantity stored for the new product in the
``attempt_resources`` table will be equal to the current total
number of bytes transferred for that attempt.
4. The ``attempts_after_update`` and
``attempt_resources_after_insert`` trigger will be modified such
that resources that are `by_time` will use the change in duration
when updating the usage while resources that are `by_unit` will
instead have ``usage = quantity``.
5. The billing updates from the worker VM to the driver will contain
the total number of bytes egressed (from VM to outside world) over
the network for each attempt since the attempt began. The driver
will then try and update the quantity for these attempts using
``INSERT ... ON DUPLICATE KEY UPDATE``.

Computing costs remains the same since `cost = usage * rate` and the
egress bytes are correctly accounted for in the `usage` and `rate`.

To start, we will only bill for outbound bytes in the main container
regardless of the destination of the packet. To protect ourselves from
extra charges for data transferred in the output container, we will
only allow output files to be copied to the same cloud as the worker
is currently running in.


***********************
Effect and Interactions
***********************

My proposed change addresses the problem of billing for network egress
usage in the main container in a coarse manner by ignoring the packet
destination.


*******************
Costs and Drawbacks
*******************

There are no performance costs to this plan on the Batch system. We
already track bytes sent out of the container's network and already
have the billing infrastructure in place. The only problem with this
plan is we are overcharging in the following cases:

1. We are overcharging users who write files to Google Cloud Storage
within the main container.

We are also limiting the ability of users to write output files from
one cloud to a different cloud.


************
Alternatives
************

There are no other alternative designs within our current framework.


********************
Unresolved Questions
********************

What is our strategy for tracking the destination of outbound packets
to bill only those destined for the external internet? What iptables
marks can we add and how do we distinguish packet destinations? This
plan is punting on how to get more fine-grained egress information for
now.


************
Endorsements
************

Not applicable.