What is monitoring?
November 20, 2020
The perfect IT systems that do their services reliably and without errors does not exist. A functioning IT system is not a condition, but a process that needs to be constantly monitored by people (administrators).
Numerous events repeatedly ensure that an IT system fails its service. Wearing parts such as hard disks, faulty operation, malicious attacks or the omission of regular care tasks are just a few reasons why errors and failures occur. And at the latest, if your customer finds out faster than you that a system is no longer working, you need monitoring.
The following tasks should be done by a monitoring system:
- enter the status of all components
- Prepare, sort and rate data
- clear, abstract present
- Detecting deviations from the normal state
- Trigger alarm
- Log states and changes
- Monitor and record compliance or a deviation
Monitoring is more than an alarm in case of error
The bigger an IT system is, the harder it is to keep track of the health of the entire system and all its components. Accordingly, the monitoring system must fulfill more complex tasks than those described above.
Sending an alarm when an error occurs is an important but by far not the only task of a monitoring system. Monitoring means collecting a lot of data and automatically drawing the right conclusions. If one component fails, it is not difficult to conclude that there is a problem! Somebody should take care of it! From a certain number of systems, messages from the monitoring system are part of everyday life. The monitoring system should distinguish harmless from serious errors and, depending on the severity, use different media for notification.
In addition to the detection of errors, a monitoring system should allow conclusions or concrete statements about the reliability of systems and components. This requires the storage of historical data. The system should provide an interface and a so-called user interface to be able to evaluate the stored data quickly and conveniently.
IT managers and system administrators also want to use a monitoring system to prevent a component or service from failing. This usually requires the evaluation of many data. The performance of components and services and the utilization of the infrastructure must also be permanently measured and graphically displayed. A simple example is the free space on a hard disk. If the monitoring system calculates an increase in used memory of X GB per day, it is not difficult to predict when the disk will be full.
Now, if a service accesses five servers with a total of 20 disks, you do not want to be disturbed in the weekend rest on a Sunday night just because a disk is full. Now the monitoring system has to accomplish a complex task and has to process the data of 20 hard disks, 5 servers, one service, the day of the week and the time for a "decision": Is an alarm going out or not?
Performance data is not needed to predict the next outage. A monitoring system collects a lot of data for suspicion without being automatically evaluated. You need this data to explain unpredictable accidents. A simple example is the number of visitors to a website. If the web server "crashes", you can view the traffic as a graph. If the failure of the web server preceded an unusually high increase in visitor numbers, this would be a plausible explanation for the failure. The high number of visitors could have caused such a heavy load that the server crashed.
It is also important to know how much the hardware was used in the past when planning and expanding hardware.
Customers often request an availability report. Or maybe you calculate resources according to consumption to customers. This too is a task of the monitoring system.
The requirements for an IT monitoring system can be summarized in five categories
1. Watch the condition of the system
- End-to-end monitoring, where the delivered service is tested for functionality as close as possible to the end user
- Status recording of all services, software and hardware
- Long-term storage of information about the availability of services and components
- require manual intervention in the system
- Inform an employee as much as possible about the cause of a mistake.
- Document response times and troubleshooting
- Gather enough information to allow a detailed root cause analysis
- Collection of information for decisions
4. Quality measurement
- Data collection on the performance and throughput of the system and subcomponents
- Recording of agreed limits and their compliance
- Identification of bottlenecks, overloads and implementation errors
- Monitoring of standardized configurations
- Warn in case of deviations from a standardized procedure
Especially the last point, the monitoring of more standardized configurations, is often neglected. However, a configuration according to the agreed standard is essential for a stable system. In other words, the cause of problems is often changes to the environment! Where does the saying "Never touch a running system" often cited in IT circles come from? The reason is that once well-running systems often continue to run for years without problems. Correctly configured systems minimize the risk of breakdowns.
Your monitoring system should be able to document the following aspects of the system configuration and alert you to deviations:
- When were changes made to the configuration? For example, if the change to an Apache configuration file and the subsequent failure of the web server fit into a common small window of time, it is reasonable to assume that the change is responsible for the failure.
- Is the right (agreed) software used? Some employees also experiment with critical systems. Do not just monitor that some mail server is running. Monitor that the default mail server agreed in your company is running.
- When did updates and patches come in? The monitoring should therefore always document which version and which release of a software was in use.
- Are there any software and operating system security updates and when were these updates downloaded?
Do you think monitoring is complicated? No it is not! It's easy with the right software. Cloud radar is a monitoring software as a service that fulfills the aforementioned goals.
Now you could argue that you do not need special software for the tasks mentioned above. A few scripts do it too. If you want to monitor a single web server, then you'll come to acceptable solutions with a script. But when it comes to a network and servers in productive use, scripts are not enough. A service like Cloudradar can do more:
- Not only is the final product, e.g. monitors the availability of a website. All subcomponents such as hardware and software, operating systems and network infrastructure are monitored.
- By monitoring many subcomponents, e.g. Free hard disk space can prevent errors. Routine tasks are no longer forgotten.
- Resource bottlenecks are detected early.
- A uniform setup is guaranteed. Monitoring detects immediately if a colleague did not adhere to the agreed conventions when installing a new server. The monitoring provides a todo list, which has to be changed.
- The alarm is targeted. Only the relevant data will be sent. The admin knows immediately where to start troubleshooting. (A router fails, and you do not want to receive tons of text messages informing you of which websites are now offline because the corresponding web server hangs behind the failed router.)