CN EN
Industry News
How to prevent and recover from server failures
Aug 20.2020

Hardware, software, and facility problems can cause server failures. With the right protocols and preventive maintenance, organizations can reduce the number of failures and eliminate them.


Server failure is a common problem that affects all types and sizes of organizations. Server downtime may include several days, while the system cannot access key business data. This can lead to operational problems, service disruption and maintenance costs.


The potential cause of the failure can be server hardware, software, or data center facilities. If the organization understands the causes of possible server failures, it can avoid problems and downtime completely before the problems develop. However, if server failures do occur, it is better to develop contingency plans.




What causes the server to fail?


If an alarm is received or a failure is found, the first step to solve the server failure is to determine the mode and cause of the server failure; The speed of organization operations may be the difference between downtime and minutes and days. Common causes of server failures include:


•overheated. If the temperature of the server is too high, it may cause performance degradation or complete failure.


• Hardware problems. Sometimes, hardware components can be damaged. This may be due to the failure of the actual components, such as battery failure or disk failure, cooling system failure, or equipment life.


• Software problems. Outdated operating systems may crash under high traffic operations, and unapproved patches may cause errors or data corruption. Software upgrades and updates can also fail and cause new problems.


• The system is overloaded. Peak traffic periods and complete server logs can cause system overloads and failures.


• Network attack. Lack of network security or outdated unsupported operating systems may make the server vulnerable to network attacks, which may paralyze or crash the server.


• natural disaster. Earthquake, fire, flood and thunderstorm may cause serious damage to the network system and cause service interruption.




How to prevent common server failures


Persistent reboots and sudden slowdowns indicate a server failure. The more clearly an organization sees these signs, the faster it can take action. Server monitoring software can help organizations maintain server status, closely monitor critical systems, and get alerts of any potential problems.


In addition to the monitoring tool set, you can also perform preventive maintenance steps to ensure that the server is running and healthy.


1. Ensure optimum ambient temperature. The server needs proper ventilation and temperature control to avoid overheating. Check the interior and exterior surfaces for dirt and adjust the temperature setting as required.


2. Perform routine maintenance. Hardware problems are often the most difficult to predict and prevent because they may occur randomly. Please pay attention to the life of each server, perform routine disk check and regularly update/upgrade the system. Time has come to completely replace obsolete parts or machines. Predictive analysis can also help determine when a part may fail.


3. Install updates regularly. Install software, operating system updates and patches regularly. This maintains performance and protects the server from exploitable software vulnerabilities.


4. Maintain strict access control and detailed event logs. It is almost impossible to eliminate human errors. Automation can minimize human errors, but human intervention is still required. To reduce risks, please strictly record who can access the server room and management software. The organization should also maintain a detailed event log and check it regularly.


5. Monitor performance trends. Through continuous performance monitoring checks, organizations can better predict the resources required during peak periods and identify poor performance, which may indicate an impending failure. These trends may also reveal potential hardware and software problems or server room areas that require additional cooling. Be sure to maintain log files, empty the Recycle Bin, delete files in temporary folders, and defragment hard drive tasks to maintain performance levels and avoid overloading the system.


6. Develop server contingency plans. Redundancy is an important part of preventing server failures from causing downtime. The server emergency plan shall establish available auxiliary hardware, such as multiple power supplies, redundant memory and backup servers.


7. Design disaster and data recovery plans. In the event of natural disasters or security vulnerabilities, disaster recovery plans and data recovery plans will protect the organization from prolonged downtime and catastrophic data loss. For the worst case scenario, it is critical to have a backup plan.




How to resolve and recover from server failures


Even if the server has undergone preventive maintenance and even if the server fails, some steps can be taken to recover effectively. In addition to restarting, visual prompts and diagnostic software can also be used to narrow down the range of possible causes.


Once the root cause is determined, you can switch to the backup server and take the necessary steps to repair the computer failure.


Source: computer room 360 Reprinted: server industrial personal computer D1net