Back to Blog

Everything That Broke Along the Way

The commit history tells a story. Not the one I planned. The one where every assumption I made about backups got corrected by production.

The commit history of this project tells a story. Not the one I planned. The one where every assumption I made about how backups work got corrected by production.

This post is the bugs. The ones that looked like working features until they weren't.

Silent Failures

The first version of the mysqldump wrapper redirected stderr to stdout with 2>&1. Seemed fine. Except when mysqldump fails, it writes the error message to stdout - which means it ended up inside the SQL dump file. The backup "succeeded." The file had content. It just started with mysqldump: Got error: Access denied instead of CREATE TABLE.

Fix: capture stderr to a separate temp file, read it after the dump finishes, fail explicitly if it contains errors. Then I discovered that reading a temp file from a second SSH command was unreliable too - sometimes the file hadn't flushed yet. Final fix: echo stderr inline in the same command output and parse it with regex.

Three commits to get error capture right. For mysqldump. A tool that's been around for decades.

The Compression Corruption

The compression step used a fallback pattern:

pigz file.sql || gzip file.sql

If pigz failed partway through, it left a partial .gz file. Then gzip ran on the already-partially-compressed file and produced a corrupt double-compressed mess. The backup completed. The file had a reasonable size. It just couldn't be decompressed.

Fix: don't use || for fallback. Check whether pigz exists first, then run one or the other. If the chosen tool fails, fail the backup.

Concurrent Backup Races

The scheduler runs every minute. If a backup takes longer than a minute, the next scheduler run would create a new backup record for the same database and dispatch a second job. Now you have two backups for the same database running simultaneously, writing to the same temp files on the remote server.

The result: corrupted files, random failures, and backups that contained fragments of two different dumps.

Fix: two layers. First, the scheduler checks for existing pending or running backups before creating a new one. Second, the job itself acquires a cache lock keyed to the server and database. If the lock is held, it skips silently.

Truncated Downloads

phpseclib's SFTP has a default timeout. For small files, no problem. For an 800MB database dump, the download would silently truncate at whatever it managed to transfer before the timeout hit. The file existed. It had content. It was just missing the last 200MB.

The backup completed successfully because nothing checked the file size after download.

flowchart LR A[Download via SFTP] --> B{Compare local vs remote size} B -->|Match| C[Continue to upload] B -->|Mismatch| D[Retry up to 3 times] D -->|Still failing| E[Fallback to scp] E --> B

Fix: set SFTP timeout to unlimited, compare local file size to remote file size after every download, and retry with reconnection up to three times. If SFTP keeps failing, fall back to system scp which handles large transfers more reliably.

Stale Backup Detection v1, v2, and v3

Version 1 used a flat 48-hour window. If a database hadn't been backed up in 48 hours, it was stale. This meant every weekly backup schedule triggered stale alerts five days out of seven.

Version 2 used a 7-day lookback window with expected-vs-actual counting. Better, but newly created schedules would look backwards in time before they existed and flag themselves as having "missed" 14 backups.

Version 3 uses the cron expression to determine when the last backup should have run. It only checks consecutive missed cycles from the most recent one backward. A weekly schedule doesn't alert until it's actually missed a week. A new schedule doesn't count cycles from before it was created.

Three versions of the same feature to get stale detection right.

The Retention Cleanup Bug

Retention was supposed to respect per-schedule overrides. A schedule could say "keep backups for 90 days" even if the server default was 30. The cleanup job ignored the schedule-level setting and used the server default for everything. So 90-day backups were getting deleted after 30.

The fix was straightforward, but the discovery wasn't. I noticed it because a client asked me to pull a backup from six weeks ago and it was already gone.

Same cleanup job had another bug: it would hard-delete backup records from the database. Seemed fine until I wanted to know why a backup was missing. Now it soft-deletes with a pruned_at timestamp so the history stays.

And one more: the cleanup would delete the last backup of a database if it was past retention. That means if a database stopped being backed up (schedule disabled, server offline), eventually all its backups would be pruned. Now it checks: is there at least one recent backup before deleting old ones? If not, keep the last one.

.my.cnf Interference

Some servers use .my.cnf for MySQL credentials instead of passing them explicitly. When I added the --master-data flag for replication tracking, it required the RELOAD privilege - which the .my.cnf user didn't have. mysqldump would fail with exit code 5.

The error was captured correctly (thanks to the three commits from earlier), but it took a while to figure out why only some servers were failing. The fix: only add replication flags when explicit credentials are configured.

Remote Temp File Cleanup

When a backup failed, the temp file on the remote server stayed behind. One bad day with ten failures would leave ten orphaned dump files in /tmp. On a server with limited disk space, this compounds.

Fix: attempt cleanup in the catch block too, not just the success path. Log whether it worked. If it didn't, at least I know the file is there.

What I Learned

Every one of these bugs has the same shape. Something that works in the simple case - one server, small files, everything healthy - breaks when you add scale, variety, or failure to the mix.

The tool is reliable now. Not because I designed it to be reliable from the start. Because production found every assumption I made and proved it wrong, one commit at a time.


That's the series. Four posts covering why I built it, how the architecture works, why restores are the hard part, and what broke along the way. The tool runs across eight servers, handles multiple database engines, replicates to four storage destinations, and verifies its own backups automatically. Most days I just glance at the dashboard and move on.

That's the whole point.


Part of the Laravel backup system series.

Share this article

Want to Work Together?

Let's discuss how I can help with your project.

Get in Touch