It was 4:45 PM and I was packing up my laptop to leave the office when my desk phone rang. That’s seldom a good thing.
One of the applications teams got an alert when one of their development servers crashed after the swap partition filled up. They wanted me to investigate and go to their standup the next morning with the results.
I looked into it, went home, and showed up to their meeting early the next morning.
“Well, the swap filling up is a red herring. This machine has 512 GB of memory and 48 cores. You have a memory leak in your application.”
They wanted proof.
We don’t have good tooling in place to detect problems in our development environment, so I wrote a little bash script that collected memory usage stats on the top 10 memory consumers every 5 minutes and let it run for several days. I ran the data through some python tooling and made a simple graph showing memory utilization for a particular process growing over time and sent it to the application team.
“Oh, yeah, we recently deployed some new code. We’ll revert it.”
And problem solved with just a little data and a simple graph.
Can we do better?
Yes. Monitoring that sends a alert when a threshold is reached isn’t enough. You want tooling that shows trends and does anomaly detection for everything in your stack.
You want to find problems before they’re problems.