At Loyalty Juggernaut Inc. ("LJI"), we strive to maintain smooth, business-as-usual operations for our Clients. In the unlikely event of any disruption or unavailability of LJI Service, our combination of our Incident Management Team and our GRAVTY® Support Portal, NEWTON, ensures to resolve these promptly by using a follow-the-sun model.
LJI aims to fulfill these four main goals:
- Restore normal LJI Service operations as quickly as possible.
- Communicate with Clients throughout the incident resolution process, ideate on/propose a workaround, and continuously update them on the resolution status.
- Analyze the incident to determine the root cause and identify and implement the permanent fixes to prevent recurrence once LJI Service is restored.
- Share lessons learned across the enterprise.
Preparedness and Monitoring
LJI works proactively at incident prevention by using the best-in-class monitoring tools to monitor the infrastructure 24x7, locate any issues before the Client does, and ensure proactive action. LJI Service inculcates the following features to achieve this:
- By using cloud hosting, LJI can proactively detect, react, and recover easily, quickly, and effectively.
- LJI Service uses AWS Key Management Service (KMS) to encrypt sensitive data at rest.
- LJI Service also uses Static Code Analysis tools during development to ensure developer awareness of security aspects.
- LJI Service logs have audit trails for any access to AWS Fargate, AWS CloudTrail, and VPC Flow Logs. They can collect and aggregate the logs centrally for correlation & analysis and detect any suspicious, abusive, or unauthorized activity.
- LJI ensures regular Security Audits of GRAVTY® and its APIs by an external agency for Vulnerability Assessment & Penetration Testing (VAPT).
- LJI Service also uses Sentry, Datadog, and other infrastructure measures to monitor and identify any suspicious, abusive, or unauthorized activity.
Identification
To identify incidents correctly and allocate them to the right team, LJI's Incident Management Team:
- After identifying the incident, the Team works to pinpoint the incident's location. The Team then logs an incident ticket and assigns a priority level to it once they determine the incident's severity. To raise incidents, Clients can also create incidents via NEWTON and assign the severity. In case of Severity 1 or urgent issues, the Clients should call the Account Manager (or Customer Success Representative) to get a support staff engaged on priority. The Client should file a support ticket as well, providing details or evidence of the issue. Post-incident creation, NEWTON automatically keeps track of ETAs for Response Time and Resolution Time, as per the SLAs agreed between LJI and the Client, and notifies the Incident Management Team.
- In case of Severity 1 issues, LJI provides a dedicated number configured on PagerDuty for the Client to call. They intelligently route the Client call to the relevant Support Representative (DevOps, Tech Lead). NEWTON automatically logs a Severity 1 ticket when the Client dials the LJI Sev 1 hotline. The team handles Severity 1 issues with the topmost priority and works upon them until they fully resolve the issue. LJI CS Support Representative continuously updates the Client on LJI's analysis of the issue, status of the resolution throughout the incident resolution process.
- In case of Severity 2, 3, and 4 issues, the Client opens a Support Ticket on NEWTON either by logging to NEWTON or by simply emailing LJI's designated Support Email ID (which automatically creates a Support Request on NEWTON).
GRAVTY® Support Process
Clients can use their dedicated NEWTON access to keep track of their tickets and view the aggregate data/analysis of the tickets logged by the Clients. NEWTON sends out proactive alerts to the respective LJI Support Representatives and the entire chain of command when a specific Ticket's Response or Resolution time is approaching as per the SLAs to ensure that appropriate action ensues.
Resolution
LJI CS Support triages the issue and analyzes a workaround (permanent or interim) to minimize disruption for the Client. If an acceptable workaround exists, LJI Support informs the Client about it via live demonstration, where required, to ensure that the Clients are aware of the workaround and are ready to proceed with it. LJI CS Support may lower the severity of the issue if the workaround eliminates the issue while continuing to work on the permanent solution.
LJI Customer Support loops in relevant teams (DevOps, Web App, Engine, Mobile, Frontend, Customer Success) to assist in a resolution, where deemed necessary. If the Incident occurs due to a Product Defect, the LJI CS Support logs an internal JIRA ticket with the required ETA (based on the SLAs) for engineering to provide a quick fix.
LJI DevOps installs the quick fix once available (only in case of Severity 1 issues and certain Severity 2 issues, which don't have a workaround) after informing the affected Clients. In case of Severity 3 and 4 issues that require code fix, LJI delivers the code fixes in the next Minor Release (scheduled every six weeks).
Closure
After the successful remediation and resolution of the incident, the team updates the incident ticket and communicates the same to the Client. Upon successful validation of the fix/workaround, the Client closes the Support Request.
Root Cause Analysis
Post-incident closure, LJI Support Representative prepares an RCA and shares it with the Client (on Client's demand). The incident response team evaluates the lessons learned from the incident. They devise action plans and mark specific internal practices and processes for improvement or enhancement to prevent similar incidents.