Webhook delivery seems straightforward: create a record, queue a background job, send an HTTP request. But in production systems with high-velocity background workers and multi-tenant isolation, subtle timing issues emerge. Week 38's production deployment revealed race conditions where webhook jobs executed before database transactions committed, causing intermittent "record not found" errors that appeared only under load.
This article explores how we debugged these race conditions, understood their root causes, and implemented solutions that ensure reliable webhook delivery even in high-throughput scenarios.
The Problem: Intermittent Job Failures
Production monitoring showed webhook delivery jobs failing with ActiveRecord::RecordNotFound exceptions:
WebhookDeliveryJob failed:
Couldn't find WebhookDelivery with 'id'=12345
app/jobs/webhook_delivery_job.rb:8:in `perform'
The failures were intermittent—the same webhook might succeed, then fail, then succeed again. Development environments showed no issues. The failures only appeared in production with Heroku's responsive worker dynos processing jobs within milliseconds of creation.
Initial investigation suggested data corruption or database replication lag. But closer examination revealed the timing:
- Property update triggers webhook creation
WebhookDeliveryrecord saves to database- Job queues immediately:
WebhookDeliveryJob.perform_later(delivery.id) - Sidekiq worker picks up job in < 10ms
- Job tries to load record:
WebhookDelivery.find(delivery.id) - Database transaction hasn't committed yet—record doesn't exist from worker's perspective
- Job fails with RecordNotFound
The race condition occurred when workers processed jobs faster than database transactions committed—a pattern that occurs in high-performance production environments but rarely in development.
Understanding Database Transaction Timing
Rails wraps controller actions and many model operations in database transactions. These transactions don't commit immediately—they wait until the outermost transaction block completes:
# Controller action (implicit transaction)
def update
@property.update!(property_params) # 1. Updates property
@property.trigger_webhooks # 2. Creates webhook deliveries
# 3. Queues background jobs
# 4. Transaction commits here
end
When webhook delivery jobs queue during step 2, the transaction hasn't committed. If workers execute before step 4 completes, they can't find the webhook delivery records.
Solution 1: After-Commit Callbacks
The correct approach: queue jobs only after transactions commit. Rails provides after_commit callbacks for exactly this scenario:
class WebhookDelivery
# Queue job AFTER database transaction commits
after_commit :queue_delivery_job, on: :create
private
def queue_delivery_job
# Add slight delay to ensure transaction fully completes
# Even after_commit, some database replication scenarios need this
WebhookDeliveryJob.set(wait: 1.second).perform_later(id)
end
end
Moving job queuing from service classes to after_commit callbacks ensures jobs never queue before records exist in the database.
Refactoring Webhook Service
The webhook service previously queued jobs directly:
# Before: Service queues jobs immediately
class WebhookService
def deliver_webhook(webhook, event, payload)
delivery = WebhookDelivery.create!(
webhook: webhook,
event: event,
request_body: payload.to_json
)
# This queues before transaction commits!
WebhookDeliveryJob.perform_later(delivery.id)
end
end
After refactoring, the service only creates records:
# After: Model queues jobs after commit
class WebhookService
def deliver_webhook(webhook, event, payload)
# Just create the delivery record
# Callback handles job queueing after transaction commits
WebhookDelivery.create!(
webhook: webhook,
event: event,
request_body: payload.to_json
)
end
end
This separation of concerns improves reliability: services handle business logic; models handle persistence timing.
The One-Second Delay
Even with after_commit callbacks, we added a one-second delay before job execution:
WebhookDeliveryJob.set(wait: 1.second).perform_later(id)
Why delay when transactions already committed? Several reasons:
Database replication lag: Production databases often use read replicas. Even after primary commits, replicas might lag by milliseconds. Workers connecting to replicas may not see just-committed records.
Distributed system clock skew: In distributed environments, worker server clocks might differ slightly from database server clocks. A one-second buffer accounts for minor time discrepancies.
Transaction visibility: Some database isolation levels don't make transactions immediately visible to other connections. The delay ensures complete visibility across all database connections.
Debugging clarity: If failures still occur, the delay proves they're not simple race conditions—they indicate deeper issues requiring investigation.
The one-second delay adds negligible latency to webhooks (which already involve network requests taking hundreds of milliseconds) whilst dramatically improving reliability.
Solution 2: Multi-Tenant Context Preservation
A second issue emerged: even when jobs found delivery records, they failed with tenant scoping errors:
ActsAsTenant::Errors::NoTenantSet:
ActsAsTenant::current_tenant is not set
Background jobs execute outside HTTP request context, so tenant information isn't automatically available. Jobs need explicit tenant context:
class WebhookDeliveryJob < ApplicationJob
queue_as :high
def perform(webhook_delivery_id)
# Load delivery record without tenant scope
# (Job has no tenant context yet)
delivery = WebhookDelivery.unscoped.find(webhook_delivery_id)
# Extract agency from delivery
agency_id = delivery.webhook.agency_id
# Set tenant context for all subsequent queries
ActsAsTenant.with_tenant(Agency.find(agency_id)) do
# Now reload delivery within tenant scope
delivery = WebhookDelivery.find(webhook_delivery_id)
# Perform webhook delivery with proper tenant context
deliver_webhook(delivery)
end
end
end
This pattern ensures jobs have proper tenant context even when executing hours later during retry attempts.
Debugging Approach
Solving these race conditions required systematic debugging:
1. Add Debug Logging
First, we added comprehensive logging to understand timing:
def queue_delivery_job
Rails.logger.info "[Webhook] After-commit callback triggered for delivery #{id}"
Rails.logger.info "[Webhook] Queuing job at #{Time.current}"
job = WebhookDeliveryJob.set(wait: 1.second).perform_later(id)
Rails.logger.info "[Webhook] Job queued with ID #{job.provider_job_id}"
end
This logging revealed the tight timing between record creation and job execution.
2. Monitor Transaction Lifecycle
We added instrumentation to track transaction commits:
class WebhookDelivery
after_create :log_creation
after_commit :log_commit, on: :create
private
def log_creation
Rails.logger.info "[Webhook] Delivery #{id} created (transaction not committed)"
end
def log_commit
Rails.logger.info "[Webhook] Delivery #{id} committed to database"
end
end
Logs showed "created" and "committed" messages within 5-10ms—explaining why fast workers encountered records not yet visible.
3. Reproduce Under Load
Development environments couldn't reproduce the issue due to slower processing. We created a load test simulating production traffic:
# Hammer the system with concurrent webhook deliveries
100.times.map do |i|
Thread.new do
property = Property.find(test_property_id)
property.update!(price: 1000 + i)
sleep 0.01
end
end.each(&:join)
Running this against a production-like environment with fast workers reproduced the failures consistently, allowing debugging.
Testing Transaction-Aware Code
Testing code that depends on transaction timing requires careful test setup:
RSpec.describe WebhookDelivery do
it "queues delivery job after transaction commits" do
delivery = nil
# Wrap in explicit transaction to test timing
ActiveRecord::Base.transaction do
delivery = create(:webhook_delivery)
# Job shouldn't be queued yet (transaction not committed)
expect(WebhookDeliveryJob).not_to have_been_enqueued.with(delivery.id)
end
# After transaction commits, job should be queued
expect(WebhookDeliveryJob).to have_been_enqueued
.with(delivery.id)
.at(1.second.from_now)
end
it "preserves tenant context in background jobs" do
agency = create(:agency)
ActsAsTenant.current_tenant = agency
delivery = create(:webhook_delivery)
# Execute job
WebhookDeliveryJob.perform_now(delivery.id)
# Job should succeed with proper tenant context
expect(delivery.reload.status).to eq('succeeded')
end
end
These tests verify both transaction timing and tenant context preservation.
Monitoring and Alerts
Post-fix, we added monitoring to detect if race conditions reoccur:
class WebhookDeliveryJob
rescue_from ActiveRecord::RecordNotFound do |exception|
# Race condition detected!
Rails.logger.error "[Webhook] Race condition: #{exception.message}"
# Alert operations team
ErrorTracker.notify(exception,
context: { webhook_delivery_id: arguments.first },
severity: 'critical'
)
# Retry job (record might exist after delay)
retry_job wait: 5.seconds
end
end
This monitoring catches race conditions if they appear, provides debugging context, and automatically retries jobs.
What's Next
These race condition fixes established patterns for all background job interactions: always use after_commit for job queuing, always preserve tenant context explicitly, and always add buffer time for distributed system timing variations.
Future improvements might include circuit breakers that detect persistent failures and stop retrying temporarily, automatic tenant context detection from job arguments, and distributed tracing that tracks webhook delivery timing end-to-end across multiple services.
The debugging experience reinforced that production environments reveal timing issues impossible to detect in development. Comprehensive logging, systematic reproduction, and careful transaction management are essential for building reliable distributed systems handling high-throughput asynchronous processing.
