Tech Trivia

Jin Daily Tech Trivia : What happen to Cloud Flare yesterday?

Jin Daily Tech Trivia : What happen to Cloud Flare yesterday?

  • KUL and HKG entered scheduled maintenance.

  • The team updated a new SQL query on ClickHouse (Cloudflare’s own database).

  • A duplicate RBAC setting started creating duplicate data entries.

  • The Bot Management system, which relies on that database, tried to update the new anti-bot/attack signature file but received duplicated data.

  • Normally there are only about 60 signatures, but the count grew to over 200.

  • There is a hard limit of 200 in the bot defense code for performance optimization.

  • This caused the bot defense system to crash and default to blocking all access.

  • The bot system grabs the new signature setting every 5 minutes from the ClickHouse DB.

  • The Cloudflare team had no idea what went wrong and restarted each system to troubleshoot.

  • After 2 hours (at 13:37 UTC / 21:37 MYT), they finally realized it was the bot signature file issue and froze the good file.

  • By 14:30 UTC (22:30 MYT, about 3 hours after the initial issue), they finished manually deploying the good file to all servers.

Total downtime: 6 hours.

Timeline:

11:20 UTC (19:20 MYT): Issue started 13:37 UTC (21:37 MYT): Root cause found 14:30 UTC (22:30 MYT): Mitigation deployed 17:06 UTC (01:06 next day MYT): All services back online

Trivia Image