INC-12 did not receive messages from Kustomer for 20 hours

Incident Report for Notch CX System status

Postmortem

Summary

Between Oct 29th 14:04 GMT+2 and Oct 30th 10:25 GMT+2 webhooks received from Kustomer were not properly ingested, causing our system to not receive or handle any conversations during that period.

Our monitoring systems identified the issue but did not create alerts due to some alerts being silenced.

The issue was raised and remediated due to manual observation.

Following the remediation, we ran scripts to fill back missing data and generate responses where still relevant.

Timeline

  • Oct 28th - Notch dev team performs a day of “tech debt cleanup”. 2 of the changes made during that day are relevant:

    • An alert that is responsible for tracking “amount of inbound messages” is silenced due to too many false positive alerts. A task to fix the underlying issue is opened but deprioritized to be performed a few days later
    • An issue with Kustomer webhook handling logic is identified that causes webhook to be handled without a queue system. This is prioritized as “urgent” since without utilizing a queue, if there is some sort of failure, there will not be any retry mechanism. Crucially, the fix introduced a bug to the system that caused Kustomer webhooks to be written to a temporary cache used by the queue, but never added as a queue task. This meant webhooks are “lined up” to be processed but never sent for processing.
  • Oct 29th 14:04 GMT+2 - The version pending deployment that introduced the 2 regressions passes CI and QA tests and is deployed. There is no test that ensures proper handling of queue processing for Kustomer webhooks. The tests only cover the handling logic without the infrastructure that causes the logic to be triggered.

  • Oct 29th 14:10 GMT+2 - The deployed version is live. At this point in time, webhooks from Kustomer are being ignore by our system

  • Oct 29th 14:48 GMT+2 - Our monitoring system identifies the statistically implausible drop in amount of inbound messages triggers an alert to the on-call dev, but the alert is silenced in our logging aggregator due to the work done on Oct 28th.

    • Click to open monitoring system anomaly identification details

      {
        "level": "error",
        "msg": "Alerts created for tenant maelys:",
        "msgMeta": {
          "alerts": [
            {
              "data": {
                "alertReason": "Statistical anomaly detected: highly significant decrease (z-score: -1.97) (p-value: 0.0488)",
                "channel": "Email (Kustomer)",
                "current": 49,
                "method": "statistical",
                "metric": "inbound",
                "percentChange": -44.9438202247191,
                "period": "hourly",
                "previous": 89
              },
              "type": "hourly_inbound_Email (Kustomer)"
            }
          ]
        },
        "name": "alertsHandler",
        "time": 1761742121275
      }
      
  • Oct 30th 09:00 GMT+2 - Tenant Account Manager (@Tom Fadael ) logs on to the tenant dashboard and identifies an issue with the amount of inbound conversations. An investigation is triggered.

  • Oct 30th 09:45 GMT+2 - Maelys QA personnel notify Notch on Slack that they noticed an issue

  • Oct 30th 09:57 GMT+2 - Root cause is identified. After short discussion regarding the impact of reverting the issue, work to revert begins

    • The discussion was surrounding the impact of starting to handle conversations midway in case there are missing messages. The conclusion was that our system handles it gracefully
  • Oct 30th 10:02 GMT+2 - Notch notifies Maelys through WhatsApp of the incident

  • Oct 30th 10:21 GMT+2 - The fixed version is deployed.

  • Oct 30th 10:25 GMT+2 - The fixed version is live. Messages from Kustomer are now being properly processed in our system again. Work begins on script to complete missing data

  • Oct 30th 16:00 GMT+2 - Script to complete missing data is completed, reviewed and ready to be run

  • Oct 30th 19:00 GMT+2 - Script finished running

Technical description

  • During the regular flow of webhook handling, the “kustomer webhook ingestion” module received a webhook from Kustomer, inserts it into our webhook cache and queues a new queue item to handle it, and then returns 200 (OK) to Kustomer. After the webhook is in the cache and a queue item is in the queue, the “webhook handler” receives an order to handle the webhook.
  • During this incident, the “queueing a new queue item” (step 3) part of the flow was broken. It did not queue a new queue item but also did not throw an error. This caused the Kustomer webhook ingestion to return a 200 (OK) code to Kustomer but no actual work was being done.

Data

  • Timespan of missing webhooks: Oct 29th 14:10 - Oct 30th 10:21 GMT+2
  • Amount of missing conversations (In “Maelys” tenant):

    • According to this Kustomer search

      • 236 Chat conversations
      • 813 Email conversations

Post-Incident Analysis

This incident raised many issues with Notch’s internal and external procedures, tools & workflows.

The incident’s action items must be able to answer and resolve the following issues:

Feature development workflow

  • The development process did not include manual testing of the Kustomer webhook queue logic change.
  • Our unit / end-to-end test suite did not identify the issue. We do not have a test suite that checks the “Kustomer Webhook > Queue > Webhook” handling end-to-end. We only have parts that check the business logic of each part and not the infrastructure that binds the parts together

Incident identification

  • Our monitoring system alerting capabilities were switched off “temporarily” due to a large amount of “false positive” alerts. In this case, this was a wrong decision made by CTO Yuval to allow the system to be switched off while it is being revamped. A better decision would’ve been to switch off the parts that cause “false positive” behavior.
  • Related: Our monitoring system has false positives because it relies on a complex “statistical anomaly” methodology that’s hard to control. This system was created in order to find subtle changes that cause issues like “coverage has been 15% lower than usual”. We need to have a much simpler “rule-of-thumb” based monitoring system that is in charge of identifying catastrophic failures that cause our handling to crash to zero.
  • We do not have any monitoring on “amount of queue items created”. This would’ve been a secondary layer of defense.

Incident remediation

  • Communication with tenants: Notch identified the incident almost an hour before any communication was sent out to Notch’s tenants and in Maelys’s case, they were the first to ask us if we’re aware of an issue instead of us notifying them. We need to have an easy “reach out to tenants” playbook that’s run immediately when we identify an incident, before we perform the full impact + remediation analysis.
  • To preserve: Reverting the commit that caused the bug in handling was quick.

Post incident

  • It took too long to run the script to fetch the missing information into our system. We do not have appropriate backoffice tools for “incident recovery” for each ticketing system we own. We will need to create such tools that can handle various cases like:

    • Webhooks were not received
    • Webhooks were received but no queue item was created
    • Webhooks were received, queue items were created, but the business logic failed
    • etc.

Actions To Take:

Category Task Priority Status
Monitoring Alerts must be turned on 24/7 incl. the switched off “monitoring” alerts Critical Done
Monitoring Implement simpler Alerts on catastrophic reduction in inbound message amount, coverage, tasks done, queue items created/done must be on 24/7 Critical Done
Monitoring Change alerts of “hourly failures” to run every 10 minutes for faster response times to issues High In progress
Monitoring Alert on queue: too many pending, too large lag, big drop in amount of queue items, big drop in amount of queue items done, big increase in amount of queue errors High In progress
Tooling Implement backoffice tooling with premade scripts that allow quick remediation of missing data from ticketing system Medium To do
QA Implement e2e testing of real SORs through queue in dev env to identify cases of SOR infra breaking High In progress
Process Implement playbook for incident management that includes immediate notification of tenants High To do
Posted Nov 03, 2025 - 07:43 UTC

Resolved

Failure receive messages from Kustomer for 20 hours
Posted Oct 29, 2025 - 12:04 UTC