Between Oct 29th 14:04 GMT+2 and Oct 30th 10:25 GMT+2 webhooks received from Kustomer were not properly ingested, causing our system to not receive or handle any conversations during that period.
Our monitoring systems identified the issue but did not create alerts due to some alerts being silenced.
The issue was raised and remediated due to manual observation.
Following the remediation, we ran scripts to fill back missing data and generate responses where still relevant.
Oct 28th - Notch dev team performs a day of “tech debt cleanup”. 2 of the changes made during that day are relevant:
Oct 29th 14:04 GMT+2 - The version pending deployment that introduced the 2 regressions passes CI and QA tests and is deployed. There is no test that ensures proper handling of queue processing for Kustomer webhooks. The tests only cover the handling logic without the infrastructure that causes the logic to be triggered.
Oct 29th 14:10 GMT+2 - The deployed version is live. At this point in time, webhooks from Kustomer are being ignore by our system
Oct 29th 14:48 GMT+2 - Our monitoring system identifies the statistically implausible drop in amount of inbound messages triggers an alert to the on-call dev, but the alert is silenced in our logging aggregator due to the work done on Oct 28th.
Click to open monitoring system anomaly identification details
{
"level": "error",
"msg": "Alerts created for tenant maelys:",
"msgMeta": {
"alerts": [
{
"data": {
"alertReason": "Statistical anomaly detected: highly significant decrease (z-score: -1.97) (p-value: 0.0488)",
"channel": "Email (Kustomer)",
"current": 49,
"method": "statistical",
"metric": "inbound",
"percentChange": -44.9438202247191,
"period": "hourly",
"previous": 89
},
"type": "hourly_inbound_Email (Kustomer)"
}
]
},
"name": "alertsHandler",
"time": 1761742121275
}
Oct 30th 09:00 GMT+2 - Tenant Account Manager (@Tom Fadael ) logs on to the tenant dashboard and identifies an issue with the amount of inbound conversations. An investigation is triggered.
Oct 30th 09:45 GMT+2 - Maelys QA personnel notify Notch on Slack that they noticed an issue
Oct 30th 09:57 GMT+2 - Root cause is identified. After short discussion regarding the impact of reverting the issue, work to revert begins
Oct 30th 10:02 GMT+2 - Notch notifies Maelys through WhatsApp of the incident
Oct 30th 10:21 GMT+2 - The fixed version is deployed.
Oct 30th 10:25 GMT+2 - The fixed version is live. Messages from Kustomer are now being properly processed in our system again. Work begins on script to complete missing data
Oct 30th 16:00 GMT+2 - Script to complete missing data is completed, reviewed and ready to be run
Oct 30th 19:00 GMT+2 - Script finished running
Amount of missing conversations (In “Maelys” tenant):
According to this Kustomer search
This incident raised many issues with Notch’s internal and external procedures, tools & workflows.
The incident’s action items must be able to answer and resolve the following issues:
It took too long to run the script to fetch the missing information into our system. We do not have appropriate backoffice tools for “incident recovery” for each ticketing system we own. We will need to create such tools that can handle various cases like:
| Category | Task | Priority | Status |
|---|---|---|---|
| Monitoring | Alerts must be turned on 24/7 incl. the switched off “monitoring” alerts | Critical | Done |
| Monitoring | Implement simpler Alerts on catastrophic reduction in inbound message amount, coverage, tasks done, queue items created/done must be on 24/7 | Critical | Done |
| Monitoring | Change alerts of “hourly failures” to run every 10 minutes for faster response times to issues | High | In progress |
| Monitoring | Alert on queue: too many pending, too large lag, big drop in amount of queue items, big drop in amount of queue items done, big increase in amount of queue errors | High | In progress |
| Tooling | Implement backoffice tooling with premade scripts that allow quick remediation of missing data from ticketing system | Medium | To do |
| QA | Implement e2e testing of real SORs through queue in dev env to identify cases of SOR infra breaking | High | In progress |
| Process | Implement playbook for incident management that includes immediate notification of tenants | High | To do |