I tried AWS DevOps Guru for RDS!

Udaara Jayawardana
5 min readAug 11, 2022

Hello Readers! I recently started to explore AWS services that I had not previously worked with, and my enthusiasm for MLOps and familiarity with DevOps culture made AWS DevOps Guru the most intriguing offering to investigate.

The definition is as follows:

Amazon DevOps Guru is a machine learning (ML) powered service that makes it easy to improve an application’s operational performance and availability. DevOps Guru detects behaviours that are different from normal operating patterns so you can identify operational issues long before they impact your customers. DevOps Guru automatically ingests operational data from your AWS applications and provides a single dashboard to visualize issues in your operational data. You can get started with DevOps Guru to improve application availability and reliability with no manual setup or machine learning expertise.

DevOps Guru’s ML Models will detect abnormalities in application behaviour like slowness, increasing error rates, resource constraints, and a variety of other issues that could lead to an outage. It gives insights into the issue and aids in their resolution with as little MTTR as possible.

To be honest, that sounds too good to be true. So I thought to give it a go. Now that DevOps Guru offers a variety of services, I thought to use DevOps Guru for RDS as my trial program. I’ve spent a good amount of time on RDS reliability, which should be extremely useful in assessing DevOps Guru for RDS.

This image has nothing to do with the post. But I reaaaally wanted a ‘Guru’ for my cover image! :)

DevOps Guru (DOG!) for RDS detects anomalies and problematic performance in Amazon Aurora MySQL- and PostgreSQL-compliant instances. According to AWS documentations, DevOps Guru examines the anomaly to determine the likely causes. These conditions may include;

  • DB wait states or events
  • Memory, IO, and other problematic metrics
  • SQL queries or statements

Once a performance bottlenecks or operational issues are identified, DOG displays its findings in the DevOps Guru console, and sends notifications action before they become customer-impacting outages. This helps to implement real-time actions for the brewing issue.

DOG for RDS primarily monitors database load; that is, the number of active connections at any one moment. The term active is critical here since it only requires connections that are actively carrying out a transaction and ignores idle ones. This can provide a somewhat accurate assessment of the database’s current state.

You need to enable Performance Insights on your Aurora cluster to use DOG. If you are unfamiliar with this, please refer to this AWS Guide.

DOG ingests metrics from RDS. It merely gathers numerical metrics and does not need custom SQL commands or other sensitive data. The database load metrics are then monitored for anomalies by comparing historical baselines to current activity to identify unexpected behaviour. DOG for RDS is clever enough to detect anomalies over regular database workloads, such as periodic spikes in activity caused by faulty jobs, data ETL reporting, and other scheduled operations.

DOG for RDS analyses the metrics that makes up the anomaly once it is recognised, noting the most common wait statuses, SQL operations, and other metrics that correspond with the anomaly. The outcome is processed by a rule-based algorithm, and generates simple straightforward explanations and recommendations to resolve the anomaly. You can see these in the DOG console page.

DevOps Guru for RDS — How it Works

By the way, the RDS Performance Insights console will pop the typical AWS notification banner if your RDS supports DOG

Turn on DevOps Guru for RDS

Let’s investigate a defective RDS. We have an issue where there are over 100 sessions in an RDS with 2vCPUs. Now this alone should be sufficient to identify that there was something wrong here.

This is where a Database Reliability Engineer (DBRE) or a Database Administrator (DBA) may come in to help. So, why use DOG for RDS?

Well, the Performance Insights does not notify the Engineering team of this problem. As a result, there’s a good chance you’ll overlook this. Then you must resolve this. If you are unfamiliar with the DBRE/DBA domains, you may have difficulty resolving the problem. This is where DOG comes in. It not only detects these types of issues, but also alerts the team via SNS and gives diagnosis and advice on how to resolve them.

The Insights page of the DOG console displays various abnormalities discovered. By clicking on an anomaly, you can get a details summary of it. Then in it’s Aggregated Metrics section, you can get the detailed analysis of the issue by clicking the view analysis

RDS Performance Anomaly

Through this view, you can view the anomaly, the duration and a very friendly update on what happened there :)

Analysis and Recommendations panel provides you drilled down analysis of the issue and insights and recommendation on what you can do to resolve it.

I really liked its simple explanations (specially on metrics)

DevOps Guru for RDS — Analysis and Recommendations

Overall I really liked DevOps Guru for RDS. It definitely aids in identifying abnormalities that may cause outages, and incredibly clear explanations and insights into correcting the issue are really beneficial. DevOps Guru for RDS is a must-have tool for your Account if you are new to the DBRE domain yet want to maintain your RDS healthy!

--

--

Udaara Jayawardana

A DevOps Engineer who specialises in the design and implementation of AWS and Containerized Infrastructure.