The Pipeline Strategy Pattern
The first version had a big switch statement. Every strategy duplicated the compression logic. Then I realized what I actually had were composable steps.
The first version of this backup tool had a big switch statement. MySQL server? Run mysqldump and compress. PostgreSQL? Run pg_dump and compress. DirectAdmin server? Download the file.
Every strategy duplicated the compression logic. Every strategy duplicated the error handling around SSH. When I wanted to add "MySQL without compression" as an option, I copied the MySQL strategy and deleted the compression part. That's when I knew the architecture was wrong.
What I actually had were steps. Dump the database. Compress the file. Download an existing file. Each step is independent. The difference between strategies is which steps you chain together, not how each step works.
Steps
Every step gets a context object, does one thing to it, and hands it back. The key piece of state is the remote file path - where the file currently lives on the server. Each step reads it, does its work, and updates it to point at whatever it produced.
The mysqldump step SSHes into the server, runs the dump, and the context now points at a .sql file. The compression step reads whatever file the context points at, compresses it with pigz (falling back to gzip if pigz isn't installed), and the context now points at a .gz file. It doesn't know or care what it's compressing. SQL dump, tar archive, whatever.
The download-existing step doesn't create anything. It checks that a file exists at the expected path on the remote server and points the context at it. That's the DirectAdmin case - the backup already exists in a folder, I just need to pick it up. It also flags the file so the cleanup step doesn't delete it off the remote server afterward.
Strategies Are Just Step Lists
A strategy is a name and an ordered list of steps. That's it.
The runner loops through the steps in order, passing the same context to each one. After the last step, the context points at the final file on the remote server - compressed, ready to download.
Adding a new backup type means writing one step and registering one pipeline. When I needed directory backups, I wrote a tar step and paired it with the compression step that already existed. No duplication.
The Full Backup Lifecycle
The pipeline handles the remote work - creating the backup file on the server. But there's a whole lifecycle around it.
Each backup gets progress tracking through all of this. The pipeline steps map to 5-40% progress. Download is 45-80%. Upload is 80-95%. The dashboard shows a progress bar for running backups so I can see where things are.
The Download Problem
This is the part that took the most iteration. Downloading large backup files over SSH is unreliable.
phpseclib's SFTP consistently drops connections on files over 800MB. The first version would just fail and I'd see a failed backup in the dashboard. Not great when the failure is transient.
Now it retries SFTP three times with reconnection between attempts. If all three fail, it falls back to system scp. For servers with bandwidth limits (slow connections where I don't want to saturate the link), it uses scp with rate limiting from the start.
After every download, it compares the local file size to the remote file size. A mismatch means a truncated download, and that's a failure - not a completed backup with a corrupt file.
Storage and Replication
Once the file is downloaded, it goes to primary storage. Then the replication kicks in.
Each server has storage destinations assigned to it. The first one gets the file inline during the backup. The rest get it via background replication jobs that copy to each destination independently. If one destination is slow or temporarily unreachable, the others aren't blocked.
Each destination tracks its own copies - pending, completed, or failed. Each destination can have its own retention. The NAS might keep 30 days. rsync.net keeps a year.
The retention chain resolves the most specific rule first: schedule-and-destination override, then server-and-destination override, then destination default, then schedule default, then server default, then global default. Sounds complicated, but it means I can say "this schedule's copies on rsync.net keep for 365 days" without affecting anything else.
What I'd Do Differently
The step pattern works well. The one thing I'd change is how steps communicate with each other. Right now there's a generic metadata bag on the context - steps can stuff arbitrary keys in there and other steps read them out. It works, but it's the kind of thing that accumulates magic strings over time. For the current set of steps it's fine. If I add more, I'll formalize it.
That's the architecture. But building a system that backs things up is the easy half. The hard half is making sure those backups can actually be restored - and that's where I spent most of my time debugging.
This post is part of the backup system series. Related reading: I Built My Own Backup System · From Inbox to Database Without a Human in the Middle