A notification system’s primary job is to deliver messages, but the real challenge is ensuring those messages arrive reliably and promptly, even when the volume spikes to millions per hour.
Let’s say we want to send a push notification to a user.
{
"userId": "user-12345",
"message": {
"title": "New Message!",
"body": "You have a new message from Alice.",
"type": "chat"
},
"deliveryChannels": ["push", "email"]
}
This JSON payload gets sent to our notification service.
The notification service is the central hub. It takes this request and figures out how to deliver it. For "push," it needs to talk to Apple’s APNS (Apple Push Notification Service) or Google’s FCM (Firebase Cloud Messaging). For "email," it’ll talk to an SMTP server or a third-party email provider like SendGrid. For "SMS," it’ll interface with an SMS gateway like Twilio.
Here’s a look at the core components:
-
Ingestion API: This is the entry point. It receives notification requests from other services (e.g., your chat app, your e-commerce backend). It needs to be highly available and scalable. Think Nginx or HAProxy in front of multiple instances of your API service.
-
Message Queue: Once a request hits the API, it’s immediately pushed into a message queue like Kafka or RabbitMQ. This decouples the ingestion from the actual sending and acts as a buffer. If your sending workers get overloaded, messages just pile up in the queue, preventing data loss and allowing the system to catch up. A typical Kafka topic might have 10-20 partitions for high throughput.
-
Worker Services: These are the independent services that consume messages from the queue. You’ll have separate worker pools for push, email, and SMS.
- Push Workers: These connect to APNS/FCM. They maintain persistent connections or efficiently batch requests. They need to handle device token management (registration, unregistration, invalid tokens).
- Email Workers: These connect to your SMTP server or an email API. They’ll handle templating, sending, and potentially retries for failed deliveries.
- SMS Workers: These connect to your SMS gateway. They handle message formatting, carrier specifics, and delivery receipts.
-
Device Token Management: For push notifications, you need a robust system to store and update device tokens. Users might have multiple devices, and tokens can expire or become invalid. A database like Redis or DynamoDB is good for fast lookups and updates.
-
Rate Limiting & Throttling: To avoid overwhelming downstream providers (APNS, FCM, Twilio, SendGrid) or getting blocked, you need to implement rate limiting at the worker level and potentially at the ingestion API. For example, you might limit APNS requests to 10,000 per minute per connection, or Twilio to 50 messages per second.
-
Retry Mechanism: Failures happen. Network glitches, provider downtime, temporary errors. Your workers must implement intelligent retry logic. Exponential backoff is key here – don’t hammer a failing service. Wait 5 seconds, then 10, then 20, etc., up to a maximum retry count (e.g., 5 retries).
-
Analytics & Monitoring: You need to know what’s happening. Track delivery rates, latency, error rates per channel, and per provider. Tools like Prometheus for metrics and Grafana for dashboards are essential. Logging is critical; use a centralized logging system like Elasticsearch/Kibana.
Consider the flow for a single push notification:
- Your app server calls
POST /notificationswith the payload. - The Ingestion API validates the payload and publishes it to the
notificationsKafka topic. - A Push Worker consumes the message.
- The worker looks up
user-12345’s device tokens from Redis. - It formats the message for APNS (or FCM).
- It sends the payload to APNS.
- APNS returns a success/failure response.
- The worker logs the outcome and, if it’s a permanent failure (e.g., invalid token), marks the token for deletion in Redis.
The most challenging part of scaling is managing the device tokens and the state of deliveries. When APNS or FCM tells you a token is invalid, you must stop sending to it immediately. If you don’t, you’ll waste resources and potentially get your sender ID flagged. Furthermore, you need to handle feedback from these services that indicates delivery failures. APNS, for instance, sends feedback notifications that your system needs to process to prune invalid device tokens.
The next hurdle you’ll face is handling different user preferences for notification delivery, such as "only send me push notifications between 9 AM and 9 PM."