System / project name: Checkout Service Run type: combined technical review Decision to support: Is this change safe for staged rollout? Primary review questions: 1. What are the most likely regression risks? 2. Are there architecture stress points around retries, idempotency, or state handling? 3. What validation is still missing before rollout? Product / service: Checkout API used by web and mobile clients. Languages / frameworks: Python, FastAPI, PostgreSQL, Redis, async worker. Runtime / platform: Containerized API plus background jobs. Environment: Staging heading toward production rollout. What is being reviewed: A change that adds retry handling for payment-provider timeouts and moves some order-finalization behavior into an asynchronous worker. Why now: The team saw intermittent timeout spikes during peak traffic and wants to reduce customer-visible failures. What changed: - timeout retry logic added - worker now finalizes some orders after provider acknowledgement - Redis used for short-lived request correlation What worries us most: - duplicate charge risk - race conditions between API and worker - state divergence between payment status and order status - incomplete monitoring for retry storms What would make this unsafe to ship: - missing idempotency guarantees - retries that can duplicate side effects - no proof that the worker cannot finalize stale or already-failed orders Expected behavior: - transient provider timeouts should retry safely - successful charges should map to exactly one order finalization - failed charges should not leave paid-looking orders Observed behavior: In staging, some retried requests finish successfully, but the logs show duplicated worker events for a small subset of timeout scenarios. Code / change evidence: - new retry helper around payment submit - worker consumes provider confirmation events - order status writes now happen in two execution contexts Architecture context: - API receives checkout request - provider call occurs synchronously - worker handles delayed confirmation and finalization - Redis correlation keys expire after a short TTL Validation material: - unit tests added for retry helper - no end-to-end idempotency test - limited staging logs - no chaos test around delayed provider callbacks Key risks: - duplicate side effects - race between API response and worker completion - stale correlation keys - poor visibility into retry amplification Runtime assumptions: - provider callback order is mostly stable - Redis keys survive long enough for delayed events - database writes are serialized enough to avoid inconsistent finalization Known weak spots: - idempotency assumptions are not fully documented - worker ownership of final status looks ambiguous - rollback behavior is unclear if the worker fails after the provider succeeds Missing information: - exact code path for duplicate-event suppression - database constraint strategy - alerting for retry storm behavior What a useful AXIOM output should focus on: - highest-risk correctness issues - hidden coupling between API and worker - missing tests that block safe rollout