System / project name: Checkout Service
Run type: combined technical review
Decision to support: Is this change safe for staged rollout?

Primary review questions:
1. What are the most likely regression risks?
2. Are there architecture stress points around retries, idempotency, or state handling?
3. What validation is still missing before rollout?

Product / service:
Checkout API used by web and mobile clients.

Languages / frameworks:
Python, FastAPI, PostgreSQL, Redis, async worker.

Runtime / platform:
Containerized API plus background jobs.

Environment:
Staging heading toward production rollout.

What is being reviewed:
A change that adds retry handling for payment-provider timeouts and moves some order-finalization behavior into an asynchronous worker.

Why now:
The team saw intermittent timeout spikes during peak traffic and wants to reduce customer-visible failures.

What changed:
- timeout retry logic added
- worker now finalizes some orders after provider acknowledgement
- Redis used for short-lived request correlation

What worries us most:
- duplicate charge risk
- race conditions between API and worker
- state divergence between payment status and order status
- incomplete monitoring for retry storms

What would make this unsafe to ship:
- missing idempotency guarantees
- retries that can duplicate side effects
- no proof that the worker cannot finalize stale or already-failed orders

Expected behavior:
- transient provider timeouts should retry safely
- successful charges should map to exactly one order finalization
- failed charges should not leave paid-looking orders

Observed behavior:
In staging, some retried requests finish successfully, but the logs show duplicated worker events for a small subset of timeout scenarios.

Code / change evidence:
- new retry helper around payment submit
- worker consumes provider confirmation events
- order status writes now happen in two execution contexts

Architecture context:
- API receives checkout request
- provider call occurs synchronously
- worker handles delayed confirmation and finalization
- Redis correlation keys expire after a short TTL

Validation material:
- unit tests added for retry helper
- no end-to-end idempotency test
- limited staging logs
- no chaos test around delayed provider callbacks

Key risks:
- duplicate side effects
- race between API response and worker completion
- stale correlation keys
- poor visibility into retry amplification

Runtime assumptions:
- provider callback order is mostly stable
- Redis keys survive long enough for delayed events
- database writes are serialized enough to avoid inconsistent finalization

Known weak spots:
- idempotency assumptions are not fully documented
- worker ownership of final status looks ambiguous
- rollback behavior is unclear if the worker fails after the provider succeeds

Missing information:
- exact code path for duplicate-event suppression
- database constraint strategy
- alerting for retry storm behavior

What a useful AXIOM output should focus on:
- highest-risk correctness issues
- hidden coupling between API and worker
- missing tests that block safe rollout