Autonomous Log Analysis with AI

By Radiostud.io Staff

August 25, 2022

Today’s monitoring practices put too much burden on SREs and developers, who spend countless hours staring at “single pane of glasses”, making sure the “system behaves normally”. Using proprietary NLP (Natural Language Processing) techniques, PacketAI is able to detect 5 types of anomalies in log data in real time, and present insights to users to accelerate troubleshooting. This article presents some insights on how to carry out autonomous log analysis with AI.

This article was originally published by PacketAI.


Metrics mainly describe the outer characteristics of a component and the environmental conditions in which it runs. For example, response time of an app component, throughput of a database component, queue size of streaming engine and cpu usage of python data app. These metrics give the observer “black box” information about the component, which is enough to do quick, somewhat superficial troubleshooting.

Any given component of an infrastructure generate tow types of data : metrics and logs.

However, in order to do more in depth troubleshooting, and understand the exact root cause, logs are necessary since they describe the inner workings of the component (same as traces, which can be categorised as logs or/and metrics sometimes).

The problem is that at scale, the sheer amount of log data makes it unfeasible to be used for real time troubleshooting of incident, and when actually used in troubleshooting, it generally leads to very long MTTRs (Mean Time To Resolve). For this reason, logs are usually looked at well after-the-fact for postmortem investigations.

Human Analysis of Logs is Infeasible at Scale

Many years ago, when applications used to be small monolithic chunks of code lines generating a few tens of log data, manual analysis of logs was more than enough to understand the inner workings of the system. Today however, with TBs of logs per day generated by hundreds or thousands of ephemeral containers, it is absolutely impossible to rely on human investigators to troubleshoot a system in real time, or acceptable time frames.

Traditionally, engineers use static rules to generate alerts from logs, mostly by employing regex and simple count rules. For example, a common alerting rule is to count the number of occurrences of a specific category of logs such as Error and Warning logs. You can do this by either extracting the severity field when present in logs or searching the words “error” and “warning” in the content of log lines. This approach could work at small scales but is unfeasible at larger scales for the following reasons:

  • The maintainability issue

Systems change and evolve, and their logs change accordingly, which means rules need to be maintained continuously. At scale, this can be time consuming and prone to human errors.

  • The portability issue

Systems are not equal, so rules that work on component A would not necessarily work on component B, hence engineers need to build custom rules for different components and systems. For example, some databases use the word “issue” and others use “problem” in their logs.

  • The exhaustivity issue

It is very hard to build a database of all interesting regexps and maintain it. For example, what is the exhaustive list of interesting “key words” to look for ? like: error, warning, fatal, problem, not connected, impossible to connect … This issue is even larger when dealing with custom application logs, where developers do not necessary follow logging best practices and are free to use any wording of their choice.

Note that these limitations can be amplified when combining alerting on logs and metrics. For example, alert when count of log lines containing the word “error” is more than 500 in 5min. In this case, limitations of static thresholding on metrics are combined with limitations of rule-based log alerts, which generally results in a flood of false positive alerts, all while being late to spot the real problems.

The ultimate goal of an autonomous log analyser is to extract insights from logs in real time and provide it to engineers during incident troubleshooting.

(Auto) Analysing Logs with PacketAI

As explained above, because of their large volumes, logs are (generally) only used for postmortem analysis and auditing. Which is a shame because of the richness of information they contain. Thus the ultimate goal of an autonomous log analyser is to extract insights from logs in real time and provide it to engineers during incident troubleshooting. What’s even better, provide insights about hidden events that might generate impactful incidents in the future.

This is exactly what PacketAI has built ; an AIOps platform that extracts insights from metrics and logs, and present them to the user in real time. In the following, we will present how PacketAI log engine works, and detail some log-related use cases where PacketAI outperforms classic log analysis tools:

At the heart of PacketAI logs engine, we find LAD (Log Anomaly Detection engine), which as its name suggests, detects anomalies in log data. But before going deep into LAD, let’s lay down some terminology :

  • Log line : just a log line
  • Anomaly : we use the term anomaly (sometimes dubbed as outlier) in ti’s most intuitive definition : “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”.  In our case, the different mechanism could be a bug, a config change, a human error, a hardware defect etc…
  • Log header and log content : A raw line usually contains a header (timestamp, severity ..) and a content, which is the core message of the log line.
  • Log template : A raw log content contains a constant part and a variable part, the variable part could be numerical values, IP addresses, port numbers, URLs .. etc. The constant part comprises tokens that describe a system operation template (i.e., log event). For example, here is the set of all possible log templates generated by a Rabbitmq component (where variable parts are either masked or replaced by *) :
E1: accepting AMQP connection < PID > ( IPPORT - > IPPORT )
E2: closing AMQP connection < PID > ( IPPORT - > IPPORT )
E3: connection < PID > ( IPPORT - > IPPORT ) : user <*> ' authenticated and granted access to vhost 'PATH '
E4: client unexpectedly closed TCP connection
E5: closing AMQP connection < PID > ( IPPORT - > IPPORT , vhost : 'PATH ' , user : <*> ' )
E6: missed heartbeats from client , timeout : 60s
E7: Supervisor { < PID > , rabbit_channel_sup_sup } had child channel_sup started with rabbit_channel_sup : start_link ( ) at undefined exit with reason <*> in context shutdown_error

Btw, you can transform millions of log lines into templates in one click on PacketAI (beta release soon), but the details of how we achieve this will make up for another post :

LAD is capable of detecting 5 types of anomalies :

1. Outlier log anomaly

Outlier log lines are simply ones that do not resemble the rest of log lines. The execution of any software component generates a finite “normal” set of log templates over time, which corresponds to the “normal” execution paths. Branching to an “abnormal” execution path generates one or more outlier log lines that can be captured as outlier, hence abnormal log templates.

For example, suppose this is the normal execution sequence of the same Rabbitmq component mentioned above:

E5, E1, E5, E1, E3, E7, E4, E2, E1, E1, E3, E1, E3, E1, E1, E2, E6, E6, E6, E2

Now, suppose that for some reason, after E2 the following log line is generated :

cannot execute command, aborting task 15009

with the following template :

E8 : cannot execute command, aborting task <*>

This log line is an outlier because its template does not belong to the normal template set, that corresponds to the normal behaviour of the Rabbitmq component. However, this template is added to the existing ones and would be considered normal as long as it keeps recurring.

2. Log pattern anomaly

A normal execution of a software component, or a set of software components, generates a normal log set pattern. A log set pattern is a set of log templates that repeats over time. Take the example of running the same Rabbitmq component mentioned above. One line corresponds to one cycle (one minute for example):

E5, E1, E5, E1, E3, E7, E4, E2, E1, E1, E3, E1, E3, E1, E1, E2, E6, E6, E6, E2, E5, E1, E5, E1, E3, E7, E4, E2, E1, E1, E3, E1, E3, E1, E1, E2, E6, E6, E6, E2, E5, E1, E5, E1, E3, E7, E4, E2, E1, E1, E3, E1, E3, E1, E1, E2, E6, E6, E6, E2

Suppose for some reason, the next minute exhibits the following pattern instead:

E5, E6, E5, E1, E6, E1, E1, E2, E7, E6, E7, E1, E3, E1, E6, E2, E6, E6, E6, E2

this set of templates is abnormal and would be detected as a log set pattern anomaly, or simply, log pattern anomaly. Note that only the set of templates is considered and the order in which the templates occur is not consider here.

3. Log sequence anomaly

The difference to log pattern anomaly is that log sequence anomaly considers log order. Consider the following normal log sequences reflecting a normal execution of the same Rabbitmq component:

E5, E1, E5, E1, E3, E7, E4, E2, E1, E1, E3, E1, E3, E1, E1, E2, E6, E6, E6, E2, E5, E1, E5, E1, E3, E7, E4, E2, E1, E1, E3, E1, E3, E1, E1, E2, E6, E6, E6, E2, E5, E1, E5, E1, E3, E7, E4, E2, E1, E1, E3, E1, E3, E1, E1, E2, E6, E6, E6, E2

Then, for some reason, the following sequences shows up in the logs:

E5, E2, E5, E6, E6, E6, E4, E2, E1, E1, E3, E1, E3, E1, E1, E2, E1, E3, E7, E1

The templates of this sequence are exactly the same as normal sequences, and the frequency of each template is exactly the same as in normal sequences. However, the order in which templates (hence log lines) occur is abnormal. This would be detected as a log sequence anomaly.

4. Log frequency anomaly

A log frequency anomaly is simply when the number of logs per unit of time is abnormal (too high, too low, follows a different time series pattern …). This kind of anomalies belong to time series anomaly detection.

Consider the following chart: normal number of log lines generated per minute from a linux host :

then, a log frequency appear here :

As you can see, the abnormal point corresponds to a much higher number of logs than usual, leading to an anomaly alert.

5. Negative log detection

Negative logs are log lines that contain negative sentiment. Example :

client unexpectedly TCP connection

and

missed heartbeats from client , timeout : 60s

Traditional methods to detect such log lines include manual keyword search, rule-based alerting using regex etc. As mentioned above, this is not feasible at scale as the list of negative keywords can be huge and varies from component to component. Using advanced ML techniques, PacketAI is agnostic to keywords and automatically spot negative sentiment in logs.

Conclusion 

In this post we explained why logs are paramount for observability, how teams traditionally analyse logs, and how PacketAI’s approach to log analysis could help teams accelerate troubleshooting and MTTR. We detailed terminology around log templates and presented the different anomaly types that PacketAI can detect in logs. In future posts, we will dig deeper into each anomaly type and present PacketAI features around them.

Radiostud.io Staff

About the author

Showcasing and curating a knowledge base of tech use cases from across the web.

TechForCXO Weekly Newsletter
TechForCXO Weekly Newsletter

TechForCXO - Our Newsletter Delivering Technology Use Case Insights Every Two Weeks

>