Qoreportal and LicensePortal may not be available
Incident Report for Quest Data Protection
Postmortem

Overview

QorePortal was not available because of issue with KeyVaults and POA Windows updates.

What Happened

The Patch Orchestration Application (a wrapper for the Azure Service Fabric Repair Manager service) has frequency configuration to install Windows updates every Wednesday. During this update, all Service Fabric nodes are restarted. After this restart, some of QorePortal applications didn`t start. After an investigation, an issue with KeyVaults was found and resolved. And the nodes with these applications had to be restarted. But some of the nodes were unable to restart and just stuck with the ‘updating' status. In the background, POA was still updating the cluster and many nodes were in an unhealthy state (updating). Total CPU usage increased and SF created more nodes one by one, but it was all stuck on the 'creating’ status. Even after manually decreasing the number of nodes, the status of the nodes changed to "delete", but still frozen. Accordingly, the healthy nodes had high CPU usage and services cannot start well.

Root Cause

The root cause is not clear for sure, but it is related to overlap KeyVaults issue and POA Windows updates.

Resolution

The ticket with Azure support was created and we received the next information:

  1. Cluster is running on Silver durability and Platinum Reliability level with Auto scale rule configurations.
  2. Patch orchestration application has marked some of the seed nodes as restart and which triggered a fabric cluster upgrade in an attempt to make one the available nodes as seed nodes.
  3. Cluster upgrade from v7 to v9 was completed for four upgrade domains but got stuck on last upgrade domain (UD4), because some of the nodes on UD4 is having high CPU usage (peeking ~100%) due to SF application
  4. Fabric had rolled back from v9 to v7 successfully, however because of POA restart repair task, the new upgrade had kick started.
  5. There were attempts made to add more nodes, delete nodes, restart nodes and all such operations are being blocked by management role on safety check.
  6. The storage account is in different resource group than the SF resource group and traces data were also missing, having a challenge for Microsoft to support and suggest best possible solutions for mitigation.
  7. Considering the time taken to mitigate each issue on trial and error method, Microsoft suggests to recreate the cluster and redeploy the applications.

Recommendations:

According to the Azure support recommendations new cluster was created and all application are deployed on the new cluster. After switching DNS mapping to the new cluster services started to work and synchronize. Than QorePortal was available and working normally.

Impact

QorePortal was temporarily unavailable and a new cluster was created.

How'd We Do?

What Went Well?

We quickly resolved KeyVaults issue and deployed new cluster as Azure support recommended.

What Didn't Go So Well?

It took some time to deploy a new cluster with all applications.

Posted Sep 20, 2021 - 05:41 EDT

Resolved
All services looks good on NewRelic and Azure.
Posted Sep 17, 2021 - 04:37 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 16, 2021 - 13:49 EDT
Investigating
We are currently investigating this issue.
Posted Sep 16, 2021 - 04:29 EDT
This incident affected: Rapid Recovery License Portal (Web User Interface, Backend services) and QorePortal (Web User Interface, Backend services).