Back to Blog
Testing a Proprietary Algorithm with Confidence

Testing a Proprietary Algorithm with Confidence

May 31, 2026
Stefan Mentovic
testingpestcharacterization-testingquality-assurancerefactoring

When an algorithm is the product, looks-right is not a test. How we reimplement a tuned, proprietary scoring engine with golden-master and parity tests, kept opaque.

When an algorithm is the product, looks-right is not a test. The matching engine at the center of the platform we rebuilt turns a writer's work and a journal's profile into a single match score, and years of editorial judgment are encoded in how it weighs one signal against another. Reimplement it during a rewrite and get it ninety-five percent right, and you have not shipped a small bug. You have quietly started sending writers' work to the wrong journals, and nobody will notice until the wrong things are already happening in the world.

You cannot eyeball correctness like that. You have to prove it. Here is how we test a proprietary algorithm, one we deliberately keep opaque in public, with enough rigor to reimplement it on a live system and tune it afterward without holding our breath.

#When the algorithm is the product

Most code is tested against what it should do. A scoring engine tuned over years against real editorial outcomes is different: a large part of what it should do is simply what it already does. The accumulated tuning is the asset, and it lives only in the implementation. There is no specification to check against, because the specification is the running system.

That changes the goal of testing. We are not only asking whether the engine is correct in the abstract, we are asking whether the new implementation produces exactly what the trusted one produced. Treat the scorer as a black box, a pairing in and a number out, and the problem becomes tractable without ever exposing or even fully re-deriving the internals. Which is convenient, because the internals are the part that does not go in a blog post.

#Capture it before you change it

Before a single line of the engine was reimplemented, we made it tell us what it does across thousands of real pairings, and wrote those answers down as fixtures. This is characterization testing, the same discipline we bring to any legacy takeover, focused here on the one piece of code that matters most.

// Captured from the production engine and version-controlled as the spec:
dataset('captured_scores', fn () => require database_path('fixtures/captured_scores.php'));

it('reproduces the score the production engine already returns', function (array $pairing, int $expected) {
    expect((new MatchScorer())->score($pairing))->toBe($expected);
})->with('captured_scores');

The captured outputs become the specification, and the new implementation is correct, by definition, when it reproduces them. Note what these tests do not assert: anything about whether the legacy score was good. That is a separate question. These lock the behavior so it cannot drift while we move it.

#Parity is the real safety net

A handful of representative fixtures keeps the readable cases honest. The real net is parity against the entire captured corpus: every pairing the production engine has scored, replayed through the new implementation and checked for an exact match.

it('matches the reference engine across the entire captured corpus', function () {
    $scorer = new MatchScorer();

    foreach (CapturedScores::all() as $case) {
        $this->assertSame(
            $case->expectedScore,
            $scorer->score($case->pairing),
            "score drift on captured case {$case->id}",
        );
    }
})->group('parity');

Any divergence is one of two things: a regression we introduced, or a change we made on purpose and must now sign off on by updating the expected value deliberately. There is no third category of probably-fine. Reimplementing a black box safely is exactly this: make it produce identical output over a corpus large enough that identical-everywhere means equivalent.

#The edges are where reimplementations break

Drift rarely happens in the middle of the input space. It happens at the edges: an empty submission, a single piece, tied attributes, a missing profile, every value at its maximum. These are the cases the original author handled with a quiet special case that nobody wrote down, and they are where a faithful-looking reimplementation silently diverges.

it('holds at the boundaries', function () {
    $scorer = new MatchScorer();

    expect($scorer->score(pairing(pieces: 0)))->toBe(0);          // nothing to score
    expect($scorer->score(pairing(profile: null)))->toBe(0);      // absent data, not a crash
    expect($scorer->score(pairing(allAttributesMaxed: true)))
        ->toBeGreaterThanOrEqual(0)
        ->toBeLessThanOrEqual(100);                               // always in range
});

A pairing with nothing to score returns zero rather than throwing. Absent data degrades gracefully instead of crashing. A maxed-out pairing stays inside the valid range. Each of these is a place the algorithm could plausibly behave several ways, so each one is pinned explicitly rather than left to chance.

#Properties, not just examples

Fixed cases only cover the inputs someone thought to write down. To catch the ones nobody imagined, we also assert the properties that must hold for every valid pairing, whatever the numbers turn out to be.

it('respects the invariants for any valid pairing', function (array $pairing) {
    $scorer = new MatchScorer();
    $score = $scorer->score($pairing);

    expect($score)->toBeGreaterThanOrEqual(0)->toBeLessThanOrEqual(100)
        ->and($scorer->score($pairing))->toBe($score); // deterministic on repeat
})->with('generated_pairings'); // hundreds of randomized, valid inputs

Three invariants carry most of the weight: the score always lands in its valid range, the same input always produces the same output, and an obviously weaker pairing never outscores a stronger one. Point those assertions at a few hundred generated pairings and they probe corners of the input space no hand-written fixture would reach, without anyone needing to know the formula to state what should always be true about it.

#Two jobs: engineers and QA

This is where dedicated QA earns its place alongside engineering, because the two roles catch different failures.

Engineers own the code-level net: the characterization fixtures, the parity corpus, and the boundary tests, all running in continuous integration so nothing merges without clearing them. That net proves the implementation did not drift from the reference.

QA engineers own the question the code-level net cannot answer: does the result still make sense? They design the test matrix that decides which scenarios count as representative, hunt the edge cases the engineers did not imagine, run exploratory passes against live behavior, and validate the ranked output against what a domain expert would actually expect. A parity test proves the new engine matches the old one. Only a human checking real results proves the old one was worth matching. Both nets are necessary, and neither substitutes for the other.

#Tuning without fear

The payoff of all this is not only a safe migration. It is that the algorithm stops being a black box nobody dares touch and becomes a system you can improve. Once the corpus and the boundary tests exist, a tuning change is made test-first: adjust the expected scores for the cases the change should affect, watch them fail, make the change, watch them pass, and let the parity corpus surface every unintended ripple.

That is what lets us tune with confidence instead of fear, which mattered most for the combination-aware matching the engine had to learn during the rebuild. The tests did not just protect the move. They turned the most sensitive code in the system into some of the safest to change.

#Prove it, do not eyeball it

The instinct with a hairy, important algorithm is to be careful and hope. Careful-and-hope is how subtle drift ships. The alternative is not heroics, it is method: capture the current behavior, prove parity over a real corpus, pin the edges, and put human judgment on top of the machine's. Do that, and a proprietary engine you are afraid to breathe near becomes something you can reimplement, move, and tune on a live system without a single held breath.

If you depend on an algorithm you cannot afford to get subtly wrong, that is a testing problem with a known shape. Let's talk.