AWS Burst Credits Issue
This document will use the information from the actual root cause analysis to make a generic view that can be used if you run into the same or a similar problem. Finally, the document will describe the situation, the problem, the analysis, and the result.
Should you have any questions, please get in touch with academy@emagiz.com.
1. Situation
When developing your integration solution with the eMagiz platform and running your solution in the AWS cloud, it can happen that your allotted Burst credits on the EFS are depleting. When this happens, you need to analyze the problem, preferably before the credits run out. If you do not do this in time, your interaction with EFS defaults to a baseline that is 'too slow for you (as your EFS burst credits are depleting).
2. Problem
The problem, in this case, was that a loop of messages was created, generating incorrect data that was processed by the system. As a result, much interaction with EFS was triggered, leading to the depletion of Burst credits.
3. Analysis
3.1 Analyze Burst Credits Balance
Should you receive an alert that the Burst Credit Balance is depleting, you can control this in AWS under CloudWatch -> Metrics. You can select the " EFS " option in the metrics section and, subsequently, "File System Metrics." Here you can activate the graph that shows the "BurstCreditBalance."
If this graph continuously goes down, it becomes time to take action and analyze why the Burst Credit is depleting. In the remainder of this section, we will discuss this further.
3.2 (EFS) Burst Credits Explained
Before we can provide analysis, it is good to explain how AWS handles reading and writing data from and to EFS (Elastic File Storage). You are allowed to execute these actions within a specific bandwidth (throughput). This bandwidth depends on the size of your EFS and is evaluated on a per/hour basis.
AWS offers two options for configuring this bandwidth.
- Provisioned throughput
- This option allows you to determine how much bandwidth you will use. Note that the option can only be configured by Support as it involves costs.
- Bursting throughput
- This option gives you a balance (number of credits) that you can use to increase your bandwidth when needed temporarily. This is the standard method when using eMagiz. The larger the difference between the baseline and what you use, the more burst credits will go down (or up). Once the actual usage is higher compared to the baseline the burst credits will go down. Vice versa, the burst credits will increase when the actual usage is lower.
- For more information, see:
Logically the next steps of inquiry are:
- What is my baseline bandwidth (i.e., throughput)?
- How much bandwidth do I use in my environment?
3.3 EFS Burst Performance & Calculation
In the table above, the number of MiB/s for your baseline is shown (depending on the size of your EFS). The size of your EFS will be determined dynamically (per hour) but will be relatively low in eMagiz as we don't store (that much) data on EFS. We can assume the size is below 256 GiB. Based on the table, this means that your default throughput is 0.5 MiB/s. However, this is only partially accurate as AWS states that the minimal throughput for EFS solutions under 20GB (the vast majority of eMagiz environments) is set to 1MiB/s. Therefore it is safe to assume that the baseline of your eMagiz environment is 1MiB/s.
To determine what the proper EFS size is for you based on the throughput of your environment, you need to execute the following steps:
Om te bepalen wat een goede file system size is op basis van de throughput van je omgeving zou je de volgende stappen uit moeten voeren:
- Identify your need for bandwidth based on historical usage. To do this, you can navigate to the metric section under CloudWatch and look at the sum of the "TotalIOBytes" over the past fourteen days.
- Click on the "Math Expression" option and select "Start with empty expression."
- Fill in the following expression: (m1/1048576)/PERIOD(m1)
- Make sure the results are displayed in a graph. This graph should give you the throughput in MiB/s. Based on this graph, you can look at the peak performance. In the example below, this is 2.7 MiB/s.
- Should you govern the throughput based on peak performance, this would mean that your file size should be 60GiB (6 * 0,5 MiB/s). But as you can see in the following graph, the eMagiz environment is busy for a small amount of time and has little to do in the remainder of the time. Therefore a standard setting works for this environment.
- If we compare this with a much busier environment, we see a different behavior regarding bandwidth (throughput) and another behavior regarding how the burst credit balance develops.
- Based on the above example, a bandwidth of 1.5 MiB/s should be sufficient to stabilize the Burst balance.
- More information on the calculation can be found here:
3.4 EFS Burst Performance & Calculation
In case of a structural problem with the throughput being used, additional investigation is warranted. There are several causes for this:
- Excessive logging -> You can check this under CloudWatch -> "Log Groups". By clicking on the cogwheel and selecting the option "Stored bytes" AWS will show you the size of each log. If the number you see seems unexpectedly large, you can dig deeper into the log to see which logs could explain this behavior.
- Excessive 'metadata' -> In eMagiz, you can build your solutions how you like to develop them. There are, however, also constructions that take many steps to finish. As a result, much 'metadata' could be added to your message (such as the channels you have traversed). It can even grow subsequently bigger as your actual message. Read this microlearning for context and possible solutions to fix the issue.
- Repeating messages -> Input is not cleaned up, which means that messages keep coming in. This, in turn, increases traffic and results in a higher throughput on EFS. If this is unnecessary, you can change it to save much bandwidth.
- A high number of messages compared to file size -> This is something you can only help with as a user. In this case, you are in a situation with much data traffic between various systems. You should either increase the file size or activate provisioned throughput to solve this.
3.5 Change EFS Burst Credits to Provisioned
4. Result
Once the loop was removed, the EFS burst credit balance stabilized again.