Featured

Best practices for metering cloud resources

Dan Elman, Ofer Shterling, Puneet Gupta

August 1, 2022

Metering sounds simple enough at first glance, but upon taking a closer look, it becomes clear why companies like AWS, Snowflake, and Twilio have dedicated entire engineering teams to the construction and maintenance of their internal usage metering systems. As we have written before, metering is a heavy lift, and there is very little margin for error in getting it wrong. One of the key design principles that we have upheld since day one is building a flexible, scalable, and domain-agnostic metering solution for the world. To accomplish this, our Metering Cloud must be able to accurately track and aggregate any resource, infrastructure, or user action possible.

One of the critical aspects of this is handling different types of usage events and reporting methodologies. 

Cloud Resources (single vs long-lasting)

Amberflo allows you to meter both instantaneous, discrete events and long-lasting events. Long-lasting events describe non-momentary resource consumption at a certain frequency or scale (known as the usage rate).

The two types of events can be thought of in graphical terms as below:

Single and Long-lasting Events

Single events plotted over time can be thought of as points in the mathematical sense, having no length since they do not take place over time, they simply take place. As an example, consider you want to meter the count of API calls for some endpoint. The endpoint sends a meter event to Amberflo for each completed API call indicating that it took place. These events can be plotted as points over time (based on the timestamp received at ingestion), and then counted up when queried to return the number of calls handled over a given time.

On the other hand, long-lasting events do take place over time, so when graphed over time they can be represented as curves with nonzero length. It is important to note that for long-lasting events, the rate of usage may change throughout the time that an event is taking place. For example, consider a data storage solution; from the moment the first dataset is ingested for storage lasting until the last data is removed from the system, resources are being consumed. That said, the rate of consumption changes as data is ingested and extracted from the system and the amount of storage in use changes.

Reporting Single Events

Relative to long-lasting events, reporting single events is relatively straightforward.

In Amberflo, simply define the meter, associated dimensions (if any), and send in the count as the value of the meter. Select Sum as the meter aggregation type, and Amberflo automatically ingests, persists, aggregates the events, slices the aggregate values over time-series and presents the data back to you via dashboard and API in real-time.

Reporting Long-lasting Events

Reporting for long-lasting events presents one unique challenge that isn’t present for single instantaneous events, and that is correctly reporting stop events. In the instantaneous case, if a record is not sent to Amberflo, then the consequence is simply that the one single record is missing, causing the total count to be incorrect by whatever the missing value for that event is. In the long-lasting case, if a start event for a resource is received but for some reason the stop event fails to arrive, then the meter will be tracked as if that usage goes on forever (until a stop event is received). If you are billing the customer for that usage, clearly this has the potential to cause massive problems very quickly.

To address this, we provide a timeout period. If a start event is received and no stop event is received within that timeout period, then the meter is automatically reset as if a stop event were received. We allow you to set a timeout on two levels: globally at the meter definition level (the default is set to one year), and at the usage event level by setting “aflo.expiration_time_seconds” equal to the max anticipated (or allowed) usage for that meter event.

To have further built-in resilience, we recommend employing a “heartbeat” approach to reporting usage for long-lasting events when employing the momentary reporting approach (see below). That is, define some small time interval within the timeout period where you will constantly report the current usage (or value) at that time. For example, if you have set a timeout period of 30 minutes, then you may use heartbeat intervals of 5 minutes; every 5 minutes, the system would send an event to Amberflo with the current usage rate, even if it hadn’t changed from the previous reporting. This way, if a stop event is missed for any reason, and no subsequent heartbeat events are received, Amberflo will automatically mark the end of the long-lasting stage. This allows you to create an upper bound for maximum possible time that you could record incorrect usage for, based on the length of the heartbeat intervals. If you want to keep the threshold lower, set a lower limit for timeout and higher frequency of heartbeats within the timeout.

Momentary Reporting Method

As a best practice we recommend a momentary reporting method, that is where the account sends the value of the usage rate for the resource at the moment of reporting. Using the same storage example from above, the momentary usage events sent to Amberflo would be:

  1. {time: 9:00am, value: 8}
  2. {time: 11:00am, value: 11}
  3. {time: 11:30am, value: 7}
  4. {time: 11:50am, value: 0}

Using the meters from the example above, we can demonstrate the momentary reporting method with added heartbeat meters sent every 30 minutes (shown in yellow) as follows:

  1. {time: 9:00am, value: 8}
  2. {time: 9:30am, value: 8}
  3. {time: 10:00am, value: 8}
  4. {time: 10:30am, value: 8}
  5. {time: 11:00am, value: 11}
  6. {time: 11:30am, value: 7}
  7. {time: 11:50am, value: 0}

You can see in this example how having the heartbeat meters allows for the possibility of losing a meter event and still metering accurately. If event 1 is dropped and never sent to Amberflo, without the heartbeat, the customer consumes 8 of the metered units which will not be tracked. Essentially, the customer would receive the 8 free units over 2 hours for free. With the heartbeat events, if event 1 is missed, only 30 minutes of free usage would be allowed for the customers, since at 9:30am a heartbeat meter with value 8 would be sent, correcting the status. If any of the heartbeat events are missed over the 2 hours between 9:00am and 11:00am, it will not cause errors since the other heartbeat meters provide built-in contingency.

As another example, we can consider event 7. Let’s first consider the case where only start, stop, and updated value meters are used (no timeout set or heartbeat meters sent). If that event is missed and not reported to Amberflo, then with no stop event is incoming, the customer will be billed as if they consumed those 7 metered units forever.

By employing a timeout period, the customer cannot be erroneously billed ‘forever’ if a stop event is never received. Whatever the length of the timeout period is the max amount of time that a customer can be billed for erroneously if a stop event is not sent. In the above example, let’s suppose we define a timeout period of 3 hours; then in the worst case scenario, event 7 is missed and no stop event is forthcoming. Then at 2:30pm (3 hours after the most recent meter sent), the meter value would be reset to 0 as if a stop event were received.

A best practice would be to employ both, a timeout period and the use of heartbeat meters. This will minimize the impact of missed meter events and allows you to set an upper bound for the potential reporting error of the system. In the above example, with the 3 hour timeout period and 30 minute heartbeat intervals, if event 7 is missed, a heartbeat event would be incoming 30 minutes after the most recent event (event 6) at 12:00pm with a value 0. If that event were to be missed, additional meters would be coming every 30 minutes until the timeout at 2:30, so there would be several chances to correct the missed event, and it would certainly result in the lowest error of all the examples we discussed.

Best practices

To tie it all together, we have three key best practices for reporting long-lasting events.

  1. Wherever possible employ the momentary reporting method with regular heartbeat events. Send an event with the momentary value whenever the usage rate is changed, and send heartbeat events every X minutes (defined by the timeout period) to minimize inaccuracy in case of a missed event.
  2. Set a tight global timeout on the meter level. Consider the maximum amount of time that each resource you are metering might realistically be consumed for. This timeout eliminates the possibility of missing a stop event and erroneously recording a long-lasting event that never stops. 
  3. Add a timeout at the event level wherever possible. For each event, it may be possible to calculate a more precise timeout value; for example, you might correlate the event-level timeout for a storage solution to the amount of data being ingested. Larger ingests might have longer timeouts while smaller files would time out more quickly. By monitoring your system over time you can calculate these event-level values for your own resources and usage patterns.

Related Posts