Solving Webhook Race Conditions in Distributed Systems

Webhook delivery seems straightforward: create a record, queue a background job, send an HTTP request. But in production systems with high-velocity background workers and multi-tenant isolation, subtle timing issues emerge. Week 38's production deployment revealed race conditions where webhook jobs executed before database transactions committed, causing intermittent "record not found" errors that appeared only under load.

This article explores how we debugged these race conditions, understood their root causes, and implemented solutions that ensure reliable webhook delivery even in high-throughput scenarios.

The Problem: Intermittent Job Failures

Production monitoring showed webhook delivery jobs failing with ActiveRecord::RecordNotFound exceptions:

WebhookDeliveryJob failed:
  Couldn't find WebhookDelivery with 'id'=12345
  app/jobs/webhook_delivery_job.rb:8:in `perform'

The failures were intermittent—the same webhook might succeed, then fail, then succeed again. Development environments showed no issues. The failures only appeared in production with Heroku's responsive worker dynos processing jobs within milliseconds of creation.

Initial investigation suggested data corruption or database replication lag. But closer examination revealed the timing:

Property update triggers webhook creation
WebhookDelivery record saves to database
Job queues immediately: WebhookDeliveryJob.perform_later(delivery.id)
Sidekiq worker picks up job in < 10ms
Job tries to load record: WebhookDelivery.find(delivery.id)
Database transaction hasn't committed yet—record doesn't exist from worker's perspective
Job fails with RecordNotFound

The race condition occurred when workers processed jobs faster than database transactions committed—a pattern that occurs in high-performance production environments but rarely in development.

Understanding Database Transaction Timing

Rails wraps controller actions and many model operations in database transactions. These transactions don't commit immediately—they wait until the outermost transaction block completes:

# Controller action (implicit transaction)
def update
  @property.update!(property_params)      # 1. Updates property
  @property.trigger_webhooks              # 2. Creates webhook deliveries
                                          # 3. Queues background jobs
                                          # 4. Transaction commits here
end

When webhook delivery jobs queue during step 2, the transaction hasn't committed. If workers execute before step 4 completes, they can't find the webhook delivery records.

Solution 1: After-Commit Callbacks

The correct approach: queue jobs only after transactions commit. Rails provides after_commit callbacks for exactly this scenario:

class WebhookDelivery
  # Queue job AFTER database transaction commits
  after_commit :queue_delivery_job, on: :create

  private

  def queue_delivery_job
    # Add slight delay to ensure transaction fully completes
    # Even after_commit, some database replication scenarios need this
    WebhookDeliveryJob.set(wait: 1.second).perform_later(id)
  end
end

Moving job queuing from service classes to after_commit callbacks ensures jobs never queue before records exist in the database.

Refactoring Webhook Service

The webhook service previously queued jobs directly:

# Before: Service queues jobs immediately
class WebhookService
  def deliver_webhook(webhook, event, payload)
    delivery = WebhookDelivery.create!(
      webhook: webhook,
      event: event,
      request_body: payload.to_json
    )

    # This queues before transaction commits!
    WebhookDeliveryJob.perform_later(delivery.id)
  end
end

After refactoring, the service only creates records:

# After: Model queues jobs after commit
class WebhookService
  def deliver_webhook(webhook, event, payload)
    # Just create the delivery record
    # Callback handles job queueing after transaction commits
    WebhookDelivery.create!(
      webhook: webhook,
      event: event,
      request_body: payload.to_json
    )
  end
end

This separation of concerns improves reliability: services handle business logic; models handle persistence timing.

The One-Second Delay

Even with after_commit callbacks, we added a one-second delay before job execution:

WebhookDeliveryJob.set(wait: 1.second).perform_later(id)

Why delay when transactions already committed? Several reasons:

Database replication lag: Production databases often use read replicas. Even after primary commits, replicas might lag by milliseconds. Workers connecting to replicas may not see just-committed records.

Distributed system clock skew: In distributed environments, worker server clocks might differ slightly from database server clocks. A one-second buffer accounts for minor time discrepancies.

Transaction visibility: Some database isolation levels don't make transactions immediately visible to other connections. The delay ensures complete visibility across all database connections.

Debugging clarity: If failures still occur, the delay proves they're not simple race conditions—they indicate deeper issues requiring investigation.

The one-second delay adds negligible latency to webhooks (which already involve network requests taking hundreds of milliseconds) whilst dramatically improving reliability.

Solution 2: Multi-Tenant Context Preservation

A second issue emerged: even when jobs found delivery records, they failed with tenant scoping errors:

ActsAsTenant::Errors::NoTenantSet:
  ActsAsTenant::current_tenant is not set

Background jobs execute outside HTTP request context, so tenant information isn't automatically available. Jobs need explicit tenant context:

class WebhookDeliveryJob < ApplicationJob
  queue_as :high

  def perform(webhook_delivery_id)
    # Load delivery record without tenant scope
    # (Job has no tenant context yet)
    delivery = WebhookDelivery.unscoped.find(webhook_delivery_id)

    # Extract agency from delivery
    agency_id = delivery.webhook.agency_id

    # Set tenant context for all subsequent queries
    ActsAsTenant.with_tenant(Agency.find(agency_id)) do
      # Now reload delivery within tenant scope
      delivery = WebhookDelivery.find(webhook_delivery_id)

      # Perform webhook delivery with proper tenant context
      deliver_webhook(delivery)
    end
  end
end

This pattern ensures jobs have proper tenant context even when executing hours later during retry attempts.

Debugging Approach

Solving these race conditions required systematic debugging:

1. Add Debug Logging

First, we added comprehensive logging to understand timing:

def queue_delivery_job
  Rails.logger.info "[Webhook] After-commit callback triggered for delivery #{id}"
  Rails.logger.info "[Webhook] Queuing job at #{Time.current}"

  job = WebhookDeliveryJob.set(wait: 1.second).perform_later(id)

  Rails.logger.info "[Webhook] Job queued with ID #{job.provider_job_id}"
end

This logging revealed the tight timing between record creation and job execution.

2. Monitor Transaction Lifecycle

We added instrumentation to track transaction commits:

class WebhookDelivery
  after_create :log_creation
  after_commit :log_commit, on: :create

  private

  def log_creation
    Rails.logger.info "[Webhook] Delivery #{id} created (transaction not committed)"
  end

  def log_commit
    Rails.logger.info "[Webhook] Delivery #{id} committed to database"
  end
end

Logs showed "created" and "committed" messages within 5-10ms—explaining why fast workers encountered records not yet visible.

3. Reproduce Under Load

Development environments couldn't reproduce the issue due to slower processing. We created a load test simulating production traffic:

# Hammer the system with concurrent webhook deliveries
100.times.map do |i|
  Thread.new do
    property = Property.find(test_property_id)
    property.update!(price: 1000 + i)
    sleep 0.01
  end
end.each(&:join)

Running this against a production-like environment with fast workers reproduced the failures consistently, allowing debugging.

Testing Transaction-Aware Code

Testing code that depends on transaction timing requires careful test setup:

RSpec.describe WebhookDelivery do
  it "queues delivery job after transaction commits" do
    delivery = nil

    # Wrap in explicit transaction to test timing
    ActiveRecord::Base.transaction do
      delivery = create(:webhook_delivery)

      # Job shouldn't be queued yet (transaction not committed)
      expect(WebhookDeliveryJob).not_to have_been_enqueued.with(delivery.id)
    end

    # After transaction commits, job should be queued
    expect(WebhookDeliveryJob).to have_been_enqueued
      .with(delivery.id)
      .at(1.second.from_now)
  end

  it "preserves tenant context in background jobs" do
    agency = create(:agency)
    ActsAsTenant.current_tenant = agency

    delivery = create(:webhook_delivery)

    # Execute job
    WebhookDeliveryJob.perform_now(delivery.id)

    # Job should succeed with proper tenant context
    expect(delivery.reload.status).to eq('succeeded')
  end
end

These tests verify both transaction timing and tenant context preservation.

Monitoring and Alerts

Post-fix, we added monitoring to detect if race conditions reoccur:

class WebhookDeliveryJob
  rescue_from ActiveRecord::RecordNotFound do |exception|
    # Race condition detected!
    Rails.logger.error "[Webhook] Race condition: #{exception.message}"

    # Alert operations team
    ErrorTracker.notify(exception,
      context: { webhook_delivery_id: arguments.first },
      severity: 'critical'
    )

    # Retry job (record might exist after delay)
    retry_job wait: 5.seconds
  end
end

This monitoring catches race conditions if they appear, provides debugging context, and automatically retries jobs.

What's Next

These race condition fixes established patterns for all background job interactions: always use after_commit for job queuing, always preserve tenant context explicitly, and always add buffer time for distributed system timing variations.

Future improvements might include circuit breakers that detect persistent failures and stop retrying temporarily, automatic tenant context detection from job arguments, and distributed tracing that tracks webhook delivery timing end-to-end across multiple services.

The debugging experience reinforced that production environments reveal timing issues impossible to detect in development. Comprehensive logging, systematic reproduction, and careful transaction management are essential for building reliable distributed systems handling high-throughput asynchronous processing.

Tuesday, September 16, 2025