Last October hpc-ch organized a Forum dedicated to Monitoring and reporting in HPC.
Running an HPC system is like commanding a starship: We are sitting on the control bridge and have to look at different displays, monitors, radars and alerts to understand the general health of the spaceship and make sure we can continue our journey and arrive safely at our destination. There are hundreds of components that contribute to the functioning of the whole system; there are alerts at different levels being raised from hardware and software components; there are early detection and diagnostics tools that could help us reduce downtimes and ensure robust operations.
On this blog we will publish the slides with the different presentations of our members.
24/7 monitoring and on call intervention @ CSCS, Sadaf Alam (CSCS)