Monitoring and alerting based on logs

Last Updated On Wed Mar 13 2024

Overview

Our ability to monitor and alert any anomaly or issues occurring within our system has been one of the challenges we've faced in the process of building a robust and resilient system.

We have tried a few different approaches to monitor multiple aspects of our system from performance, to conversions, drop off, data integrality, cron job integrity. In this topic, we will write about monitoring based on the logs that are produced from our services and applications.

Objectives

Proactive Monitoring

Eliminate issues before they affect users or cause service interruptions.

Quickly spot patterns across interconnected services, events, and issues with trace information within the content of events and proactively fix the issues.

Alerts

Create alerts based on search patterns, thresholds for specific log metrics, or other conditions.

There are 2 types of rules we are experimenting:

Number of errors occurred within X minutes
Percentage of errors in total requests within X minutes

Implementation

Overview

The combination of Prometheus and Grafana is becoming a more and more common monitoring stack used by DevOps teams for storing and visualizing time series data. Prometheus acts as the storage backend and Grafana as the interface for analysis and visualization.

From the client side, as we are using JS we built a common library to customize our logger to log information as an object with all the meta data required.

Overall Architecture

Customized Logger

Using existing libraries to keep count of the number of Logs recorded and the type accordingly (Info, Error, Warning) so that we can act appropriately.

NoneBashCSSCC#HTMLJavaJavaScriptJSONPHPPowershellPythonRubyTypeScriptCopy

const promBundle = require('express-prom-bundle');
const express = require('express');
const prometheus = require('prom-client');

const metricsRequestMiddleware = promBundle({
  includePath: true,
  includeMethod: true,
  includeStatusCode: true,
  autoregister: false, // Do not register /metrics
  promClient: {
    collectDefaultMetrics: {
    },
  },
});

const logCounter = new prometheus.Counter({
  name: 'winston_logger',
  help: 'Counting logInfo, logWarn, logError method calls',
  labelNames: ['level'],
});

const { promClient, metricsMiddleware } = metricsRequestMiddleware;
const metricsApp = express();
metricsApp.use(metricsMiddleware);

module.exports = {
  notificationServiceSendMessageCounter, logCounter, metricsApp, promClient, metricsRequestMiddleware,
};

You can also filter by Error Log and see when the frequency of the errors become high, in this case you can go to our Logging platform (ELK at Invygo) to filter log by type: error to see specifically what happened, what lines are causing issues and thus address them accordingly.

Alert Manager

Alert Manager does the work of detecting any anomaly and notifying the Engineering Team based on certain rules. In the example below we defined that we should be notified on a Slack channel when Error Count logs from Billing-Service is exceeding X within Y minutes.

NoneBashCSSCC#HTMLJavaJavaScriptJSONPHPPowershellPythonRubyTypeScriptCopy

- alert: invygo_billing_service_error_logs
        expr: |
          sum(rate(winston_logger{name="invygo-billing-service", level="error"}[2m])) * 120 > 10
        for: 1m
        annotations:
          summary: >
            :red_circle:
            invygo-billing-service: error_logs={{ $value }} during 2m,
            <https://grafana-prod.invygo.com/d/lwZbhOFMk/invygo-billing-service|Grafana>

Using Alert Manager correctly is also a challenge when it comes to selecting the right metrics to send. It is important that information dispatched is clear, otherwise the Engineers may start ignoring these alerts which would be recipe for disaster later on.

Conclusion

Putting together Grafana, Prometheus and Alert Manager would create a great foundation for any team to start monitoring their metrics, the log will tell you everything that’s happening.