July 3, 2018

How I Used CheckItOn.Us to Monitor my Internal Services

A technical tutorial demonstrating how to create custom monitoring scripts for database replication health.

I built CheckItOn.Us as an uptime monitoring tool. Hit an endpoint, check if it responds, alert if it doesn't. Standard stuff.

Then I realized I had a bigger problem than website uptime.

Things Were Breaking Silently

Database replication was falling behind and nobody knew until queries started returning stale data. Backups were failing overnight and I'd only find out when I actually needed one. Disk space would fill up on a weekend and by Monday morning the app was throwing errors that had nothing obvious to do with storage.

The pattern was always the same. Something breaks silently. Time passes. Someone notices a symptom. I trace it back to the root cause. The fix takes five minutes. The not-knowing-about-it cost hours.

I needed a way to know about these things when they happened, not when the damage was already done.

The Idea

CheckItOn.Us already knew how to hit an HTTP endpoint and check the response. So what if I exposed health check endpoints for the things I actually cared about?

Not complex. Not an agent running on every server. Just simple HTTP endpoints that return a status, and a monitoring tool that knows how to check them.

The Endpoints

I set up lightweight health check routes on each service. Each one checks one thing and returns a clear answer.

Database replication: query the replica, compare its position to the primary, return the lag in seconds. If it's under a threshold, healthy. If not, unhealthy.

Backup status: check when the last successful backup completed. If it's within the expected window, healthy. If the last backup is older than it should be, unhealthy.

Disk space: check the percentage used on the volumes that matter. Under 85%, healthy. Over 85%, warning. Over 95%, unhealthy.

Queue depth: check the job queue. If it's processing normally, healthy. If jobs are piling up faster than they're being worked, unhealthy.

Each endpoint returns a simple JSON response. Status, a message, and a timestamp. CheckItOn.Us hits each one on a schedule and alerts me when something goes from healthy to unhealthy.

What Changed

Before: I'd find out about problems when users reported symptoms or when I stumbled across them myself. Could be hours. Could be days.

After: I know within minutes. Replication falls behind by more than 30 seconds, I get an alert. Backup doesn't complete by 4am, I get an alert. Disk crosses 85%, I get an alert.

The fixes are almost always quick. The expensive part was never the fixing. It was the not-knowing.

Why This Approach Worked

No agents to install. No software on the monitored servers beyond the health check endpoints I wrote myself. No vendor lock-in to a monitoring platform that wants to run a daemon on every machine.

The health check endpoints are just routes in the application. A few lines of code each. They check what I tell them to check and return a simple response. If I move to a different monitoring tool tomorrow, the endpoints still work. Any tool that can hit an HTTP endpoint can use them.

Cloud, on-premise, hybrid - doesn't matter. If it can serve an HTTP response, I can monitor it.

The Takeaway

The monitoring tools I'd looked at before CheckItOn.Us were either too simple (just pings a URL) or too heavy (install our agent on every server, give us SSH access, here's your $500/month bill). I wanted something in the middle. Hit my custom endpoints, check the response, alert me if something's wrong.

Building the health check endpoints took maybe an afternoon. Setting up the monitors in CheckItOn.Us took less than that. And it caught the next replication lag before anyone noticed.

That's the whole point. Knowing before anyone notices.

Share this article

How I Used CheckItOn.Us to Monitor my Internal Services

Things Were Breaking Silently

The Idea

The Endpoints

What Changed

Why This Approach Worked

The Takeaway

More from the Blog

The Best Way to Convert Handwritten Recipes to Digital (What Actually Works)

What It Really Costs to Digitize a Box of Handwritten Recipes

What OCR Actually Gets Wrong on Handwritten Recipes

Want to Work Together?