What you need to know about AIOps
Content provided by IBM and TNW
As our lives become more digitized, the IT infrastructure that supports the applications and services we use has become increasingly complex. There are several options to run services in the cloud, on-premise, serverless and hybrid, making it possible to accommodate different types of applications, environments and audiences.
However, managing such complex IT architectures is becoming increasingly difficult. There are too many moving parts, making it difficult to optimize IT, predict and prevent failures, and respond to incidents after they happen.
Fortunately, AIOps – the use of AI in IT operations – is a rapidly developing field that can address some of these challenges through automation. Here’s everything you need to know about AIOps and what it can do for your organization.
The challenges of modern IT
“The industry is facing three major trends, and first is complexity,” said Pratik Gupta, CTO at IBM Automation.
More and more organizations are using cloud IT, and in many cases in combination with on-premise servers. This is in addition to a variety of serverless technologies, APIs, microservices, and the like that are being integrated into applications.
“Many organizations use multiple clouds — up to five. You have on-prem environments, you have cloud environments. It’s much more complex than it used to be,” says Gupta.
People need to understand that this is a way to increase their job.
The second trend is scale.
“During the covid pandemic, we have experienced ten years of digitization in one year. Organizations are turning to more digital experiences and more applications to get work done. There are many more applications in this hybrid cloud state,” said Gupta.
And third? Well, those are skills.
“Most C-level executives don’t have the time or talent to manually manage IT environments, which, as we know, are becoming extremely complex,” says Gupta.
These trends drive interest in automating the IT environment and getting help from AI.
“AI and automation, also known as intelligent automation, are no longer nice-to-have. It’s a necessity and it basically sets companies apart, and those who use automation and AI will do much better,” says Gupta.
This is where AIOps comes into the picture. AIOps is a suite of tools and services that use AI to automate all IT operations, from monitoring and gathering information to optimizing machines and services, and predicting and resolving incidents.
Observability
“We view applying AIOps as a transformation not only for technology but also for people,” says Gupta. “People need to understand that this is a way to expand their jobs, not replace them.”
In short, AIOps helps IT staff do things that were impossible with their previous tools. The first step in implementing AIOps is collecting quality information about your IT infrastructure and operations. This is important not only to give you a better picture of your IT infrastructure, but also to train and guide AI systems to optimize and monitor it. This first phase of AIOps is called ‘observability’.
“Observability differs from previous application performance monitoring (APM) in that observability is about collecting all the data,” says Gupta. “While old APM legacy tools can collect information purely from a performance management perspective, observability is capturing information to do AIOps.”
An example of observation tools is IBM Instana Observability, a solution that can capture statistics, traces, and logs from applications running on a variety of computing platforms, from mobile devices to on-premise servers to mainframes and virtual machines running in the cloud. . According to Gupta:
One of the things that observability tools like Instana help you with is to find root causes faster, which application or microservice is causing errors, and pinpoint them instantly using very strong heuristics and algorithms and AI.
AI-powered observability can lead to huge profits. Consider ExaVault (acquired by Files.com), a company that provides file transfer services to large organizations. ExaVault’s API receives 35,000 requests per minute and more than 50 million calls per day. Availability is very important to ExaVault, but since each customer uses the service and API in different ways, it is very difficult for the company to oversee all activities through traditional monitoring methods.
Using Instana, ExaVault was able to establish observability in its API to monitor and control availability in a way that was impossible with previous APM tools. This allowed them to find and fix problems faster than before. They achieved 99.99 percent availability and reduced the mean time to resolution (MTTR) by approximately 57 percent.
optimization
“In today’s complex environments of cloud, on-prem, and hybrid, once an app is deployed, a human cannot mentally monitor and manage how things need to be set up, configured correctly, and ensure they have the right performance, the have the right server. size, memory allocation, and so on,” says Gupta. “These are currently managed through smart guesses.”
Another important aspect of AIOps is the optimization of IT resources. An example is IBM Turbonomic, a tool that analyzes end-to-end environments and creates a single-view topology of the system. Turbonomic can process data from various aspects of the system, including service level objects, application configurations, and pricing and contracts. It takes all this information and helps you optimize the components of your IT ecosystem to achieve various goals, such as improving availability or reducing waste and costs. Depending on your requirements, Turbonomic can automatically optimize your IT components or provide you with recommendations.
A Forrester total economic impact study found that the application of Turbonomic results on average in a return on investment of 471% and the payback time is less than six months. Automation tools like Turbonomic help IT departments avoid infrastructure over-provisioning, which on average results in a 75% reduction in IT spending.
The benefits of AIOps can go beyond reducing IT costs and downtime.
For example, BBC Studios used Turbonomic to manage its network of over 1,000 virtual machines. By applying Turbonomic, the BBC team was able to get a full picture of their environment, helping them better understand the root cause of performance issues and get their environment back to a maximally efficient and performant state. Turbonomic provided them with action recommendations and predictions about the impact of each action.
The team started by reviewing Turbonomic’s recommendations and manually implementing them. Eventually, they automated some of the resizing actions without manual intervention. Through the smart and automated optimization of their IT resources, they were able to reclaim hundreds of gigabytes of memory and dozens of virtual CPUs in one month.
Incident prevention and resolution
One of the challenges of complex IT infrastructures is predicting when and where failures will occur — and taking the right actions to prevent them. Another challenge is to find the cause of malfunctions and to respond to them in a timely manner. Fortunately, this is another area where AIOps can help.
An example is IBM Cloud Pak for Watson AIOps, a solution that collects and analyzes all incidents, metrics, traces, logs and tickets from an IT system in a common AI framework with machine learning models. Cloud Pak for Watson AIOps can help predict blast radius, the effect the failure of one component will have on other parts of the system. Accordingly, it can make recommendations to prevent such incidents. As Gupta explains:
It is a tool that provides a general framework to understand what is happening in the system and to take actions in response to incidents, both predictably and proactively.
Incident prediction is especially useful for organizations responsible for critical infrastructure. For example, Taiwan’s National Center for High-performance Computing (NCHC) operates dozens of supercomputers and provides computing resources for all kinds of operations, including drug research and scientific projects. NCHC used Cloud Pak for Watson AIOps to build an AI-based automation system for predicting incidents and improving resilience.
Cloud Pak for Watson AIOps used structured and unstructured data from NCHC’s computer network to train AI models to automatically and proactively manage issues and incidents. Automation enabled NCHC to achieve a 55% reduction in Mean Time to Detect (MTTD) issues that would impact service. They were also able to detect potential outages 25 hours in advance, giving them crucial time to resolve incidents before they happened.
Beyond IT
The benefits of AIOps can go beyond reducing IT costs and downtime, creating better applications and serving customers. According to Gupta:
We see a shift in thinking from managing IT as a cost center to managing IT as a revenue facilitator. AIOps not only dynamically optimizes IT infrastructure and results in savings, but it also frees people up to do more business-critical work.
For example, AIOps can help developer teams understand bottlenecks and the consequences of outages in advance. This helps them design their applications and systems with built-in resiliency, rather than reacting ad hoc to failures.
“If you slide to the left and say how a development team should build their application to be more resistant to failures, one of the things we’re looking at is how code changes affect the quality of the released release,” says Gupta.
By spending less time troubleshooting technical glitches, developers can focus more on making better products that solve customer problems.
“Several studies show that AIOps drives more customers to web applications,” says Gupta. “The reason is that the people in IT were now more focused on doing work that is aligned with the business and generates revenue.”
The field is just starting to take off and there are many advancements in artificial intelligence research that could find their way into AIOps.
“We started with advanced heuristics, added machine learning models, and we’re seeing more and more basic models in IT and AIOps,” says Gupta.
In the future, we will see much more use of natural language processing and basic models that influence how IT is managed. We are going to see a huge amount of intelligence and AI being used in the management of IT systems. We see an exciting path ahead with this evolution of the use of AI in IT. We need to stay tuned as the next few years are going to be very exciting in terms of AI’s impact on IT.
Contents