Jin Daily Tech Trivia : What happen to Cloud Flare yesterday?
Jin Daily Tech Trivia : What happen to Cloud Flare yesterday?
-
KUL and HKG entered scheduled maintenance.
-
The team updated a new SQL query on ClickHouse (Cloudflare’s own database).
-
A duplicate RBAC setting started creating duplicate data entries.
-
The Bot Management system, which relies on that database, tried to update the new anti-bot/attack signature file but received duplicated data.
-
Normally there are only about 60 signatures, but the count grew to over 200.
-
There is a hard limit of 200 in the bot defense code for performance optimization.
-
This caused the bot defense system to crash and default to blocking all access.
-
The bot system grabs the new signature setting every 5 minutes from the ClickHouse DB.
-
The Cloudflare team had no idea what went wrong and restarted each system to troubleshoot.
-
After 2 hours (at 13:37 UTC / 21:37 MYT), they finally realized it was the bot signature file issue and froze the good file.
-
By 14:30 UTC (22:30 MYT, about 3 hours after the initial issue), they finished manually deploying the good file to all servers.
Total downtime: 6 hours.
Timeline:
11:20 UTC (19:20 MYT): Issue started 13:37 UTC (21:37 MYT): Root cause found 14:30 UTC (22:30 MYT): Mitigation deployed 17:06 UTC (01:06 next day MYT): All services back online
