Delays for search indexing, re-calculations and Podio basic/advanced workflows
Incident Report for Podio Status Page
Postmortem

Podio Incident August 20 - 21, 2024

Summary of Impact

On August 20, 2024, Podio customers experienced performance issues in searching items, calculations on fields and webhook events causing delays and disruptions in workflow execution.

On August 21, 2024, Podio customers experienced issues in real time updates for item activities, notifications and chat services.

Root Cause

At August 20, 2024, 13:37 EDT, Podio Automated Alert System noticed a sudden spike occurred in one of our queues responsible for item search, webhook execution, and other tasks. The messages in this queue were not getting processed, which added to the existing load. The issue was traced back to a failure in our event broker service from our third-party vendor, causing the queue system processing events to become inoperable. This failure led to delays in processing messages and impacted various services, including item search, calculations, and webhook executions. ShareFile Engineering engaged our third-party vendor and worked closely with them to troubleshoot and resolve the issue as this is a managed service. After resolution there was a significant backup of the queue which took 2 hours to completely drain.

At August 21, 2024, 00:53 EDT, Podio Automated Alert System triggered a pattern that matched the previous incident causing the same issue to re-occur to our end users of delays in processing search, calculations, and webhook executions. To mitigate, ShareFile Engineering created a replacement cluster to manage events while the team from our third-party vendor continued to resolve the original problem so events would be properly handled.

At August 21, 2024, 13:41 EDT, ShareFile Engineering initiated a new incident to troubleshoot real time updates after receiving feedback in support and community channels on slow updates in the UI. The performance degradation was still affecting real-time updates for item activity streams, chat, and notifications. These systems were still communicating with the older, now non-functional, queue, causing further delays and requiring page refreshes for updates.

Mitigation

After identifying the root cause, ShareFile Engineering took the following steps:

  • At August 20, 2024, 13:37 EDT: Restarted the queuing service multiple times with assistance from the third-party vendor managing it, though it took an unusually long time.
  • At August 21, 2024, 00:53 EDT: Created a new queuing service in parallel to the existing one, routing all incoming traffic to this new service.
  • At August 21, 2024, 13:41 EDT - Updated the systems responsible for chat, notifications, and real-time item updates to communicate with the new queuing service.

All issues were confirmed resolved and fixed all of the above issues as of August 21, 2024 5:30 PM  EDT.

Next Steps

ShareFile Engineering continue to collaborate with our third-party vendor to understand why the queuing service stopped processing messages in the first place. In case it happens again, ShareFile Engineering now have a way to quickly identify the root cause of the issue which will help us in faster remediation. ShareFile Engineering are continuing to work internally to ensure root cause and pursue actions on the following issues:

  1. Work closely with our third-party vendor to determine root cause, and detail around time to recover for our managed service
  2. Actions to improve operational inefficiencies in time to remediate against failed infrastructure related to the event broker service
  3. Determine the root cause of the misconfigured real-time update systems
Posted Aug 23, 2024 - 16:34 EDT

Resolved
This incident has been resolved.
Posted Aug 21, 2024 - 05:50 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 21, 2024 - 00:54 EDT
Identified
The issue has been identified and a fix is being implemented.
Posted Aug 21, 2024 - 00:53 EDT
Monitoring
We are seeing better performance based on team efforts to resolve and are actively monitoring.
Posted Aug 20, 2024 - 17:01 EDT
Identified
The issue has been identified and we are actively working with our vendor on a resolution. Thanks for your patience.
Posted Aug 20, 2024 - 16:01 EDT
Investigating
We are investigating reports of an issue impacting performance of Podio search indexing, re-calculations and Podio basic/advanced workflows causing delays for users. We will provide updates here as soon as possible.
Posted Aug 20, 2024 - 13:37 EDT
This incident affected: Web, API, and Advanced Workflow Automation.