Lambdas for processing podcast downloads/impressions for PRX Dovetail.
These lambdas subscribe to kinesis streams, containing metric event records. The types of records seen are:
antebytes- From Dovetail Router, this record has information about a request, which ads were inserted, and listener metadata. These requests are normally redirected to the CDN, so we don't know if the listener downloaded the actual bytes of the file yet.bytes- From the Counts Lambda, this indicates that the listener downloaded more than 1 minute of audio and can be counted as an IAB download.segmentbytes- Similar to bytes, but indicates 100% of a specific ad segment in the audio file was downloaded, and can be counted as an IAB impression.postbytes- The intersection of the antebytes and bytes/segmentbytes records, indicating we can count downloads/impressions using this upfront redirect metadata.
Because of differences in retry logic, this repo is deployed as 4 different lambdas with different handlers - DynamoDB, Frequency, Pingbacks and BigQuery. This diagram generally describes the data flow around these 4 lambdas (along with the separate dovetail-counts-lambda:
This lambda temporarily stores Dovetail Router redirect data, and unites it with the
actual counted bytes of CDN downloads. Once all or part of the audio file is downloaded,
we filter the redirect download/impressions to what actually happened and fire a postbytes
back onto kinesis. Which is consumed by the other 3 lambdas.
DynamoDB rows are keyed on the listenerEpisode (a hash of IP address, user agent, and the episode guid)
and digest (a hash of the specific segments in the audio arrangement they got). The row also
has a payload which is just the compressed data receieved from Dovetail Router.
+-----------+-----------------------+-------------------------+
| id | payload | segments |
+-----------+-----------------------+-------------------------+
| 1234.abcd | <binary gzipped json> | 1624299980 1624299942.2 |
| 1234.efgh | | 1624300094.1 |
| 5678.efgh | <binary gzipped json> | |
+-----------+-----------------------+-------------------------+
The segments String Set
contains the epoch timestamp that came in on each byte or segmentbyte
record (the time the bytes were actually downloaded from the CDN). And
optionally a . and the segment number. This field acts as a gatekeeper, so we
never double-count the same bytes/segmentbytes on the same UTC day.
(NOTE: a single antebytes record could legally be counted twice on 2
different UTC days, if the listener downloaded the episode from the CDN twice
just before and after midnight).
Any failures/errors writing to DynamoDB will result in the entire kinesis batch retrying. But since
the output postbytes are logged to STDOUT synchronously (and picked up via a
Subscription Filter)
it's fine to rerun them - they'll no-op the second time.
After requests are counted, this lambda inserts them into the dt_downloads and dt_impressions tables
in BigQuery using streaming inserts.
Any errors inserting will result in the entire kinesis batch retrying. To avoid duplicate inserted rows, they each have a unique insertId which BigQuery will de-duplicate on a best effort basis.
After requests are counted, this lambda looks for impressions occurring on frequency capped campaigns. These are inserted into a DynamoDB table that Dovetail Router uses to restrict how often a listener gets a campaign across any number of requests/episodes.
The frequency table is keyed on Listener ID and Campaign ID, and contains an epoch-milliseconds set of each impression the listener got this campaign.
+----------+---------------+-------------------------+
| campaign | listener | impressions |
+----------+---------------+-------------------------+
| 1234 | <some-hash-A> | [1624299980 1624299942] |
| 5678 | <some-hash-B> | [1624300094] |
| 5678 | <some-hash-B> | [1624300094] |
+----------+---------------+-------------------------+
Since that list of timestamps is a Set, operations here are idempotent, and any failures will retry the entire kinesis batch.
After requests are counted, this lambda looks for any non-duplicate impressions.
The Flight IDs and counts of those impressions are POST-ed to Dovetail Router, to give it a very up-to-the-minute idea of how often things served. So it can precisely reach flight goals, and balance between which flights it serves throughout the day.
Additionally, records with an impression[].pings url array will be pinged via
an HTTP GET. This "ping" does follow redirects, but expects to land on a 200
response afterwards. Although 500 errors will be retried internally in the
code, any ping failures will be allowed to fail after error/timeout.
Unlike the other lambdas, these operations are not idempotent, so we don't want to over-ping a url. All errors will be handled internally so Kinesis doesn't attempt to re-exec the batch of records.
Pingback urls should be valid RFC 6570 URI template. Valid parameters are:
| Parameter Name | Description |
|---|---|
ad |
Ad id (intersection of creative and flight) |
agent |
Requester user-agent string |
agentmd5 |
An md5'd user-agent string |
campaign |
Campaign id |
creative |
Creative id |
episode |
Feeder episode guid |
flight |
Flight id |
ip |
Request ip address |
ipmask |
Masked ip, with the last octet changed to 0s |
ipv4 |
Request ip address, but blank for any ipv6 requests |
listener |
Unique string for this "listener" |
listenerepisode |
Unique string for "listener + url" |
podcast |
Feeder podcast id |
randomstr |
Random string |
randomint |
Random integer |
referer |
Requester http referer |
timestamp |
Epoch milliseconds of request |
url |
Full url of request, including host and query parameters, but without the protocol https:// |
See the Contributing file.