Cover image for Webhook Reliability for Billing Events: Retries, Ordering, and Poison Pills

Webhook Reliability for Billing Events: Retries, Ordering, and Poison Pills

Webhooks are the standard mechanism for billing systems to notify downstream consumers about state changes: invoice generated, payment succeeded, payment failed, subscription upgraded. Every payment processor, every billing platform, and every metering infrastructure tool uses them. Most teams treat billing webhooks the same way they treat any other webhook — write a handler, acknowledge with a 200, move on.

This works until it doesn't. Billing webhooks have a specific failure mode that's worse than most event types: if a payment.succeeded webhook is processed twice, you might provision twice. If an invoice.finalized webhook is never processed, the customer's account is never activated. The downstream state machines that billing webhooks feed are typically financial in nature, and incorrect state is costly to recover from.

Delivery Guarantees: What You're Actually Getting

Every webhook delivery system provides at-least-once delivery, not exactly-once. This is a fundamental property of HTTP-based event delivery: the sender sends the event, waits for a 200 response, and retries on timeout or non-2xx response. If the receiver processes the event and then the network drops before the 200 is sent, the sender retries — and the receiver processes the event a second time.

Your webhook handlers must be idempotent. Processing the same billing event twice should produce the same result as processing it once. The implementation patterns:

Write-once with event ID guard. On receipt, write the event ID to a processed-events store (Redis with TTL, or a database table). Check the store before processing. If the event ID is already present, return 200 without processing. If not present, process and then write the ID. The critical detail: the write to the processed-events store must be atomic with the business operation, not a two-step sequence. If you write the business result and then fail before writing the event ID, the next retry will process again.

def handle_invoice_paid(event_id: str, invoice_data: dict) -> bool:
    with db.transaction():
        if ProcessedEvent.exists(event_id):
            return True  # Already handled, return success

        # Business logic: activate subscription, send receipt, etc.
        activate_subscription(invoice_data["subscription_id"])
        send_payment_receipt(invoice_data["customer_id"])

        # Mark as processed within same transaction
        ProcessedEvent.create(event_id=event_id, processed_at=now())

    return True

The transaction boundary is the key detail. If activate_subscription succeeds but ProcessedEvent.create fails, the transaction rolls back entirely and the next retry will reprocess correctly. If you commit the business operation and then fail to write the event ID, you've created an undetectable double-processing risk.

Retry Queue Design

When your webhook handler fails (returns non-2xx, times out, or throws an exception the delivery system catches), the billing platform retries. The retry schedule matters — too aggressive and you create thundering-herd pressure on a downstream system that may already be struggling; too conservative and delayed state propagation causes real business problems (customer can't access their account because their invoice-paid webhook hasn't been processed for 4 hours).

The standard retry schedule for billing webhooks is exponential backoff with jitter: 5 seconds, 30 seconds, 5 minutes, 30 minutes, 2 hours, 8 hours, 24 hours. After 24 hours of failures, the event is moved to a dead-letter queue (DLQ) and an alert is fired. This schedule delivers approximately 7 attempts over 24 hours — enough to survive most transient downstream outages without being so aggressive that a partial outage becomes a full outage.

Jitter in the backoff is important when you have multiple webhook endpoints for different customers hosted on shared infrastructure. Without jitter, a temporary outage that causes all handlers to fail simultaneously produces a synchronized retry storm at each backoff boundary. With jitter (e.g., retry_delay × (1 + random(0, 0.3))), the retries spread out across a time window and the total retry load is manageable.

Event Ordering: When Sequence Matters

HTTP-based webhook delivery does not guarantee event ordering. A subscription.upgraded event and a subsequent invoice.created event might arrive in reverse order if network conditions are asymmetric. For most event types, this is tolerable. For billing events, out-of-order delivery can cause subtle state corruption.

The canonical example: your downstream system receives invoice.payment_failed and then invoice.payment_succeeded for the same invoice — in that order. The correct final state is "succeeded." But if your handler processes these sequentially without checking which is more recent, and your state machine doesn't enforce valid transitions, you might end up in a "failed" state after processing both events.

The solution is event sequence numbers or timestamps stored with each event, and handler logic that rejects state transitions that would move the object "backward" in its lifecycle. An invoice that has reached "paid" state should not transition to "failed" regardless of the event delivery order. Your handler should check the current object state before applying any transition and reject transitions that would violate the lifecycle order.

We're not saying you need a full event-sourcing architecture for billing webhooks. We're saying the handlers that process billing state transitions should be aware that events arrive out of order and should apply transitions defensively — validate that the transition is valid given the current state, not just given the event type.

Poison Pills and DLQ Strategy

A poison pill is an event that your handler fails to process on every retry — not due to transient failures, but due to a structural problem with the event itself (malformed payload, unexpected schema version, reference to an object that doesn't exist in your system). If a poison pill is retried indefinitely, it blocks the queue for that customer and delays delivery of all subsequent events for the same partition.

The DLQ (dead-letter queue) is where poison pills go after exhausting retries. The DLQ is not a black hole — it's an actionable queue that requires monitoring and response. Every item in the DLQ represents a billing event that was never processed and whose effects have not been applied to your system.

DLQ response procedures for billing events must be explicit and documented:

  • Schema mismatch: event payload doesn't match your handler's expected schema. Resolution: update handler to accept the new schema, then replay events from DLQ.
  • Missing reference: event references a customer ID or subscription ID that doesn't exist in your database. Resolution: investigate if it's a sync lag (the referenced object exists but wasn't synced yet) or a genuine inconsistency. For sync lag, re-queue with a delay. For genuine inconsistency, manual reconciliation is required.
  • Business logic exception: handler threw an unhandled exception during business logic. Resolution: fix the bug, then replay from DLQ.

DLQ items should never be silently discarded. Every billing event that ended in the DLQ represents a state that your system believes is different from what billing infrastructure believes. The reconciliation between these two states is a financial accuracy problem. Alert on DLQ growth, investigate each item, and either replay or explicitly reconcile the divergent state. An empty DLQ is a correctness signal as much as a queue health signal.

The teams that handle billing webhooks well think of the DLQ as a list of unresolved financial discrepancies, not as a queue of failed technical events. That framing changes how urgently they're investigated and how carefully the resolution procedures are documented.