Alerting, Debugging, and Troubleshooting
Monitoring and Troubleshooting
Monitoring and troubleshooting are essential aspects of maintaining optimal performance and reliability in Centaur® Data Platform. Suppose any errors are present in the data flowing through the platform. In that case, the FHIR resource isn’t persisted in the FHIR Server with additional details in the rejected containers about the specific validation errors.
-
Monitoring: The Data Observability dashboard, a real-time monitoring tool, is a key component of the Centaur® Data Platform. It offers immediate insights into data processing, resource use, and system performance, enabling the swift detection and resolution of errors. Its analytics features allow users to delve into memory and resource usage trends, aiding in informed decision-making and resource optimization.
-
Troubleshooting: The Data Observability dashboard, a real-time monitoring tool, is a crucial part of the Centaur® Data Platform's troubleshooting process. When unknown or invalid validation errors occur during data processing, the platform swiftly directs the related JSON files to the 'rejected' container in the specified storage account. This container serves as a repository for files containing errors, arranged chronologically in folders to highlight the error occurrence area. By integrating monitoring features with systematic file organization, the Centaur® Data Platform enables effective and efficient troubleshooting through the prompt detection and resolution of issues.
Logging Levels
- INFO: Informational messages that track normal system operation.
- WARNING: Warnings indicating potential issues that may require attention.
- ERROR: Errors encountered during application execution, which may impact functionality.
- DEBUG: Detailed debugging messages used primarily during development or troubleshooting.
Error Codes
The following is the list of potential error codes logged during runtime:
- 171: Received Data is not in allowed ADT Trigger types
- 172: Received HL7™ Data is failed at HL7™ Validator
- 100: Duplicate File
- 101: Converter XML / HL7™ / JSON Catch Error
- 102: Invalid incoming Resource
- 103: Converter JSON Catch Error
- 104: Incoming Source Not Found
- 105: Ignore Process
- 111: Curation Catch Error
- 112: Resource Identifier Not Found
- 113: Ignore Ingestion
- 114: Get Identifier Error when retrieving Unique Identifiers for incoming Resource
- 115: Resource Not Found Reject
- 121: IG Validation Error
- 122: Validation Catch Error
- 123: Resource Not Found Reject
- 151: Resource Identifier Not Found Error
- 152: Retry Rejected Resource
- 153: Multiple Entries Found when trying to replace Reference ID
- 154: Reference Handler Catch Error
- 155: Resource Not Found Reject
- 156: Network Related Error
- 161: FhirIngestion Phase 1 Catch Error
- 162: FhirIngestion Phase 2 Catch Error
- 163: Invalid Operation
- 164: Ignore Resource based on Primary / Secondary Source
- 165: Duplicate Resource
- 166: Resource Identifier Not Found when trying to fetch a resource from FHIR Server
- 167: Source Identifier Not Found when trying to fetch the source of found resource from FHIR Server
- 168: Connection Reset / Network Error
- 169: Multiple Entry in FHIR Server
- 170: Inactive Resource based on Recorded Date
- 171: Recorded Date Attribute Not Found either from Incoming / FHIR Server Resource
- 4XX | 5XX: FHIR Server API Error / Storage Account REST Error / Converter API Error / IG Validation API Error
Logging Destination
All Container Apps logs are captured in Azure Log Analytics Workspace. This platform edits and executes log queries against data within the Azure Monitor Logs store. This service provides an interface for composing queries that retrieve records, subsequently employing Log Analytics features to sort, filter, and analyze them.
Converter, Validation, and Reference Logging
During the Centaur® Data Platform conversion process, if the response status is other than 200, the error will be handled in the catch condition. The file will then be uploaded into the "cont-fhir-rejected" folder as a "ConverterFailure" sub-folder. The defined error codes in the Logging section will help identify the different error scenarios. The data observability dashboard provides a detailed view of the specific error codes and their counts.
In case the IG validation returns as invalid, the file will be placed in the "cont-fhir-rejected" container as an "IGValidationFailure" sub-folder.
If the reference is not found while replacing identifiers, then the message will be placed in a retry queue. The reference handler will execute again after 15 minutes for a total of six times. If the reference is still not found after six times, then the file will be uploaded to the "cont-fhir-rejected" container as a "RetryRejectedFailure" sub-folder.
If the FHIR Server does not find an identifier, then it is assumed that either the data does not have any unique identifiers or identifierMapping isn’t present in the configuration files, so the platform adds the data to the reject container. During the reference handling process, if the retry count reaches the count of 6 then the FHIR resources will be uploaded to the cont-fhir-rejected container as a “ReferenceHandlerFailure” sub-folder.
Alerting and Notifications
Application Alerts
Application alerts can be set for any of the 6 Centaur® Data Platform processing stages:
- Source Ingestion,
- Conversion,
- Curation,
- Validation,
- Reference Handling,
- FHIR Ingestion
There are two types of alerting rules available out of the box:
Anomaly Detection
-
Error Rate Alert: Activates when the percentage of failed ingestions, conversions, or validations surpasses an acceptable limit a. Example: Number of source files ingested vs. number of files rejected
-
Volume Anomaly Alert: This alert activates when the number of records ingested from a source deviates significantly from the expected range. a. Example: Alert if ingestion volume drops below
<xx>% or exceeds <xxx>%
of the daily average
Trigger Alerts based on the Error codes and thresholds
-
Ingestion Failure Alert
a. Example: Data Source Ingestion from source failed today<date_time>
. Error:<Error code details + error code>
-
Data Validation Error a. Example: Data Validation failed for source. Error:
<error code + error code details>
Alerts will contain the following information: Trigger Details, Error code, and Error code description.
Contact Points
- Email: For standard notifications
- Webhooks: To integrate with other systems such as incident management tools like Service Now
- Slack or Microsoft Teams: For real-time notifications
- Escalation Policies: Set up escalation policies to ensure that unacknowledged alerts are escalated to higher-level contacts