How We Send over One Million Notifications per Month to Learners around the World
by Jess Monroe
Whether it’s a teacher sending them a message or a classmate posting a comment on the classroom, notifying learners about activity on the platform is important to their educational experience.
Before we built in-app notifications, the only way they could know about new activity was via email. However, only 2-3% of learners on Outschool had a confirmed email associated with their accounts. Furthermore, most email services do not allow learners under 13 so those learners were completely unable to receive notifications.
To further our goal of standing for learners we decided that we should notify learners where they spend time and in a way which is convenient to them. As 90% of the learners attending classes on Outschool visit learner spaces, we realized that building notifications in-app on learner space would be effective in helping learners stay updated.
As one of the senior engineers at the learner-focused team at Outschool, the task of designing and leading this effort fell to me. When designing this system, we took into consideration a lot of different factors and designs. I want to walk you through how we designed this system.
We started by doing some research on user behavior using our internal analytics and observability systems. Based on that research, we determined that we would generate between one and two million notifications per month. We also made an estimate for the maximum number of concurrent learners that are using our platform at any given time. This allowed us to make decisions with the scale of the problem in mind.
When building a new system, we like to evaluate multiple possible options and weigh the pros and cons of each option. This allows us to have a balanced perspective as well as a discussion about why we went a certain route and what we might want to consider in the future if things turn out differently from the way they appear on paper.
Outschool utilizes a service-based architecture under the hood. Functionality for various services is exposed through a unified GQL API. For the notifications service, however, we decided to create a different GQL api exposed under a different domain. We decided to go a different direction with the notifications service for a couple of reasons.
First of all, notifications are, by their nature, different from other types of functionality we exposed through our API. They change, potentially rapidly, due to background changes and we want to be able to surface them in the UI soon after the action takes place. This means a large amount of network traffic and requests concurrently handled. They are also uniform in nature with a sequential nature. Notifications on our platform only consist of text, photos, and a sender (i.e., a learner or teacher) and a url to see the rest of the content. Thus, there isn’t a strong need to pull information from other services. This means it doesn’t need to be connected to the rest of the GQL API and could be an independent UI.
Finally, we might want to experiment with different ways of sending notifications in the future. If we exposed notifications through the GQL API, it would require all services to support things that are specific to notifications such as persistent connections, websockets, server-sent events, etc. By disconnecting the push notification service, it gave us the freedom to experiment with the right way to send notifications for us.
Persistence of Notifications
We already had a model in our DB for a notification sent to a given user about a certain object. This is used internally for mobile push notifications, preventing duplicate emails as well as allowing a user to “subscribe” to an object having a certain action performed later. Thus, we decided to leverage this model for in-app notifications. We added a column which allowed us to differentiate between email, push and in-app notifications.
Based on the research we conducted at the start of the project, we knew this table would grow rapidly once we rolled out this change. Thus, we decided to limit the overall growth of the table by having in-app notifications expire after a certain amount of time. After discussing this with stakeholders, we hit upon a 30 day lifetime for notifications after which they are deleted from the table.
Timing of Notifications
At Outschool, most of the interaction outside of the Zoom classroom is not real-time. For example, a teacher may post something to the classroom about homework but a notification sent in minutes is just as valuable as a notification sent in a few seconds. Thus, we decided to not require real-time notifications as part of the first pass. Instead, we chose a relatively loose SLA of no more than ten minutes between an action being taken and a notification being received by a learner.
Real time notifications require a complex architecture and are generally implemented through persistent connections which have different scaling requirements than request-based servers. This choice allowed us to have more flexibility in which infrastructures were available to use.
Sending Notifications to Clients
We considered three different solutions for requests to the server: polling, websockets, and server-sent events (SSE) API.
The benefit of a polling-based solution is that it is much more simple to understand as well as test. It utilizes technologies that we already use extensively in our application and so doesn’t require any specialized knowledge. The biggest issue with polling is scaling in the long term. Each concurrent learner session will increase the load on the server (and transitively the DB) by some amount.
The primary load caused by polling solutions described above is the regularly scheduled read queries that likely will return “no new notifications.” Thus, a simple optimization is to push from the server rather than have the client poll for notifications.
Our current GQL framework (Apollo) provides an easy to use set of tooling for implementing GQL subscriptions. GQL subscriptions are similar to queries, but the results of the subscription are updated in near-time through a push from the server. Generally, this is implemented through a WebSocket connection between the client and server. Scaling WebSocket servers can be difficult since it entails always having a connection open between the client and the server. Scaling up WebSocket server generally implies you spawn new instances of the server worker and load balance between them. This generally requires some form of coordination server/store between workers.
Such a service would scale effectively by the number of notifications sent rather than concurrent learner sessions.
SSE is similar to the WebSockets approach as outlined above, but would use Server Side Events rather than WebSockets. SSEs are considered to be better than WebSockets when you only need unidirectional communication from the server to the client.
SSE uses HTTP instead of a distinct protocol so it can leverage various features of HTTP such as headers, cookies, and multiplexing (for HTTP/2 servers). SSE is also better at traversing packet-sniffing firewalls such as the ones that generally exist in corporate and educational institutions’ networks.
To avoid having persistent connections, we decided to use a polling-based architecture. This meant we would be dealing with a request throughput that scales with the number of concurrent users, but it meant that the requests could be handled without any persistence at the server level. This made it much easier for us to scale the service horizontally with more web server workers.
If the number of requests becomes problematic in the future, we can batch notification queries with other requests. If the DB performance becomes problematic, we can optimize through standard DB optimizations as well as caching in some way.
Once we hit a point that the amount of optimization exceeds the required work for implementing server-side events, we will switch over to using server-side events. WebSockets are not going to be used due to potential issues with supporting another protocol as outlined above.
After considering all these options, we were able to decide on the final design for in-app notifications. Any time that content we wanted to notify a learner was posted, we would send an async request to the notifications service worker via SQS. This worker would do the heavy lifting of determining who should receive a notification and create the relevant rows in the DB.
In parallel, learners accessing the platform would send a request to the notifications service API to get the latest ‘N’ notifications from the service upon page load and then periodically poll for notifications whose sent time is after the most recent notifications known to the client.
This architecture is shown in the figure below:
We also decided that if we outgrew this architecture we would migrate to using server-sent events driven by Kafka under the hood as illustrated below:
Notifications are wildly popular with our learners, but once we had lived with the system for a few weeks we discovered a few things that we wanted to improve on.
One thing that became a thorn in our side was our content moderation system. At Outschool, we scan all user generated content for harmful or malicious conduct. We use a combination of automatic and human moderation to flag and remove such content.
We didn’t originally consider moderation to be an issue since the content would show up as deleted if you clicked into it and we always could omit showing those notifications through our UI.
However, we didn’t consider the unread count in our initial design. With the way we counted unread notifications, it was not trivial to omit them from the unread count. Thus, learners with notifications for content removed by moderation would have “phantom” unread notifications for up to 30 days after the content was removed.
In order to solve this, we leveraged the new CDC topics that became available for engineers to use. We listened to changes to the tables related to notifications and if content was removed by moderation, we simply deleted the associated row in the notifications table.
Using the CDC topics to solve a data synchronization problem like this was so easy that it inspired us to convert the entire content moderation service to use CDC under the hood. We also would like to eventually use the CDC topics to create notifications rather than SQS tasks.
The notifications system is a vital component to help keep learners in the loop with what’s happening on our platform. Launching notifications was an interesting and challenging tech project. We found that by generating candidate architectures and considering the various pros and cons, we could come up with a flexible, robust, scalable, and maintainable solution while also documenting potential paths forward.