Environments, CI/CD, and Security for a Live Rewrite

A rewrite is only as safe as the ground it lands on. You can write the cleanest code in the world, but if it deploys by hand to a server nobody can reproduce, with secrets in plaintext and no alarm when it falls over, you have not delivered a product. You have delivered a liability with good test coverage.

Taking over a live system means inheriting its environments too, and what we inherit is usually a single hand-tended machine: configured from memory, drifting from any documentation, impossible to rebuild, and invisible the moment it breaks at 2am. Production-grade delivery is the whole stack underneath the code: reproducible environments, a pipeline that refuses to ship anything unverified, a security posture worth the name, alerting that reaches a human, and backups you have actually restored. Here is how we redesign all of it.

#The tools, and the logic behind them

Before the detail, the choices, because the tools are downstream of a few decisions and the decisions are the part that matters. We optimize for three things: reproducibility, so an environment is never a one-off nobody can rebuild; automation, so the path to production has no manual step left to forget; and observability, so we hear about a problem before a customer does.

Concern	Our choice	Why
Provisioning	Ansible	Environments described as code, identical across test and production, rebuildable from nothing
Pipeline	GitHub Actions	Gates and deploys live beside the code and run on every push, with no human in the deploy path
Monitoring and alerting	Sentry	Real production errors with full stack traces and release context, routed to a human automatically
Durability	Binary logs plus off-host object storage	Point-in-time recovery that survives the loss of the server itself
Process supervision	supervisord	Background workers and the real-time server stay alive across crashes and deploys

Everything below is those decisions, in detail.

#Environments as code, not pets

A legacy server is a pet: hand-fed, irreplaceable, and subtly different from every assumption anyone makes about it. We replace it with infrastructure described as code, so an environment becomes a specification you can read, review, and rebuild from nothing.

# provision.yml: one specification builds every environment identically
- hosts: app
  become: true
  vars_files:
    - "vars/{{ env }}.yml"   # the only thing that differs between test and prod
  roles:
    - common                 # base packages, timezone, automatic security updates
    - hardening              # ssh, firewall, fail2ban
    - php                    # PHP 8.2 and extensions
    - app                    # release dirs, secrets from vault, nginx, supervisor

One playbook provisions every environment, and the only thing that changes between test and production is a variables file. That is what makes the test environment a real rehearsal rather than a hopeful approximation: both are built from the same code, so they cannot quietly drift apart. A server that dies can be rebuilt to a known-good state in minutes, because its state was never a secret living in one machine's filesystem. It was always written down.

#A pipeline that gates everything

Deployment should be boring, frequent, and impossible to get wrong. Every push to the main branch runs through a pipeline that gates on the same checks an engineer runs locally, and only a green pipeline goes anywhere near a server.

name: Deploy
on:
  push:
    branches: [main]

jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: shivammathur/setup-php@v2
        with:
          php-version: '8.2'
          coverage: pcov
      - run: composer install --prefer-dist --no-interaction
      - run: vendor/bin/pint --test                   # code style
      - run: vendor/bin/phpstan analyse --no-progress  # static analysis at level 10
      - run: php artisan test --coverage --min=80      # tests with a coverage floor
      - run: composer audit                            # fail on a known dependency CVE
      - run: npm ci && npm run lint && npm run build   # front-end lint and asset build

  deploy:
    needs: verify          # deploys only when every check above is green
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ansible-playbook deploy.yml -e env=production

Two things matter in that workflow. The deploy job needs the verify job, so nothing ships unless the full test suite, static analysis, and a dependency vulnerability audit all pass first. And the deploy itself runs through the same infrastructure-as-code that built the environment, so a release is just the playbook applied with new code, never a pile of manual SSH commands. Releases are zero-downtime: the new release is prepared alongside the running one and traffic switches over only once it is ready, so the application never goes dark during a deploy.

#Security is a posture, not a feature

Security is not something you bolt on at the end. It is a posture maintained at every layer, and most of it is unglamorous discipline.

Secrets never live in the repository; they are injected from an encrypted store at deploy time, so a leaked clone reveals nothing. Access follows least privilege: key-only SSH, no root login, a firewall that exposes only the ports that must be open, and fail2ban to blunt brute-force attempts. Every response carries sane security headers, and the server does not advertise its stack:

# Security defaults applied to every response
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-Frame-Options "DENY" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Content-Security-Policy "default-src 'self'" always;

server_tokens off;   # do not advertise nginx or its version

On top of that, traffic is TLS-only with redirects, sensitive endpoints are rate-limited, inbound webhooks are signature-verified before they are trusted, and the dependency audit in the pipeline fails the build on a known vulnerability rather than waiting for someone to notice. None of these is clever. All of them together are the difference between a system that merely works and one that is defensible.

#You cannot fix what you cannot see

Code in production behaves in ways no staging environment predicts, so the development loop only closes when production talks back. Errors stream into Sentry with full stack traces and release context, grouped and alerted so a spike pages a human instead of waiting in a log nobody reads. Uptime and health checks probe the application from outside on a schedule, and a failed deploy, a climbing queue backlog, or a sustained error rate each raises an alert rather than a surprise. Logs are aggregated and searchable, not trapped on a box. The goal is simple: we should learn that something is wrong from our own alerting, never from a customer's email.

#Backups you have actually restored

A backup you have never restored is not a backup. It is a hope, and a nightly logical dump is barely even that: it strains the live database, it is slow to take and slower to restore, and at best it returns you to last night while losing every transaction since. Production data earns a real strategy, and ours has two layers that together deliver point-in-time recovery.

The foundation is binary logging with global transaction IDs, configured for durability rather than left at defaults:

[mysqld]
server_id                       = 1
log_bin                         = /var/lib/mysql/binlog
binlog_format                   = ROW
binlog_row_image                = FULL
gtid_mode                       = ON
enforce_gtid_consistency        = ON
sync_binlog                     = 1            # flush every commit to disk
innodb_flush_log_at_trx_commit  = 1            # an acknowledged transaction is never lost

A periodic hot full backup is then taken without locking the live database and streamed straight off the host, encrypted in transit, into object storage in a separate account and region, so a lost or compromised server cannot take its own backups down with it:

# Hot, non-locking full backup, encrypted, written off-host in one stream:
xtrabackup --backup --stream=xbstream --compress \
  | openssl enc -aes-256-cbc -pbkdf2 -pass env:BACKUP_KEY \
  | rclone rcat "offsite:db/full/$(date +%FT%H%M).xb.enc"

Between those full backups, the binary logs are shipped continuously off-host by a supervised streamer rather than a nightly job, so the archived recovery point trails production by seconds, not a day. Recovery is then deterministic: restore the most recent full backup, replay the archived binary logs up to the exact transaction or timestamp you need, and stop. That is genuine point-in-time recovery, and it means an accidental mass-delete at 14:32 is undone to 14:31:59 instead of rolled back to midnight.

None of this is hand-tended. The full backups and the binlog stream run as monitored jobs that alert on failure exactly like the application does, retention is enforced by the object store's lifecycle policy rather than a fragile delete command, and a scheduled drill restores the latest backup into a disposable instance and checksums it, so the restore path is exercised continuously instead of discovered during an incident.

The same operational care extends to the processes that keep the app alive. Background workers and the real-time server are supervised, so a crash restarts cleanly and a deploy drains work gracefully instead of killing it mid-flight:

[program:queue-worker]
command=php /srv/app/artisan queue:work redis --queue=mail,default --tries=3
autostart=true
autorestart=true
stopwaitsecs=30        ; let a running job finish before stopping on deploy

[program:reverb]
command=php /srv/app/artisan reverb:start
autostart=true
autorestart=true

And maintenance is continuous, not a someday. Dependencies are patched on a schedule, security advisories are acted on, and the boring upkeep that keeps a system healthy is part of the engagement rather than an afterthought nobody owns.

#Delivery does not end at launch

All of this is what end-to-end actually means. We do not hand over a repository and wish the client luck. We take over the environments along with the code and redesign them to this standard, wire the same gates the engineers and the AI work behind into the pipeline, run the queues and email that depend on this infrastructure, and stay on to operate, monitor, and maintain the result. It is one continuous responsibility, and the Submitit rebuild was delivered exactly that way: live, gated, observed, and backed up, from the first commit through ongoing operations.

#Ship it like it matters

Production-grade is not a buzzword. It is reproducible environments, a pipeline that cannot ship something broken, security treated as a posture, alerting that reaches a human, and backups proven by restoration. Most of it is discipline rather than cleverness, which is exactly why it is so often skipped, and exactly why it is worth doing properly.

If your application runs on a server only one person understands, with no pipeline, no alerting, and backups nobody has ever tested, that is a risk you can retire. Let's talk.