hpc-ch Forum on monitoring and reporting in HPC

Dear members and guests of the hpc-ch community,

We are pleased to invite you to the upcoming hpc-ch forum on

Monitoring and reporting in HPC

to be held on Thursday, October 23th, 2014 from 9:30 until 16:45, kindly hosted by University of Bern.

Please, register via the website until October 10th, 2014, and let me know via e-mail if you’re willing to contribute with a short presentation (20 minutes including Q&A).

Running an HPC system is like commanding a starship: We are sitting on the control bridge and have to look at different displays, monitors, radars and alerts to understand the general health of the spaceship and make sure we can continue our journey and arrive safely at our destination. There are hundreds of components that contribute to the functioning of the whole system; there are alerts at different levels being raised from hardware and software components; there are early detection and diagnostics tools that could help us reduce downtimes and ensure robust operations.

This hpc-ch Forum will give us the possibility to discuss different aspects of monitoring and reporting in the HPC field.

Key Questions

What elements of our HPC systems does it make sense to monitor? (CPU, RAM, Ethernet & Infiniband network, I/O interfaces, services, end-to-end cases, storage, middleware, …)
What tools are available to collect, analyze and display this information? What information do we make available to our users and to the public? Can ITIL Service Design concepts offer some guidance for ensuring robust HPC services?
What are the best practices for gathering error logs and warning messages? What are efficient mechanisms for generating alerts? What escalation levels do we support? How can we react during the night, weekends and longer holidays?
How can monitoring improve the security of our systems? (Netflow statistics, …)
What kind of reports do we provide routinely (current load, available resources, …)? Which reports are relevant to whom?
Which tools and techniques are widely used for statistical analysis of system logs? What strategies are used to correlate hardware and application errors? In an RFP and procurement of an HPC system, are there typical requirements or metrics for monitoring and reporting needs?

Location

Universität Bern, Campus von Roll
Fab6 Hörsaalgebäude, Room 104-6
Fabrikstrasse 6
3012 Bern

Chairmanship

Nina Mujkanovic, University of Bern
Michael Rolli, University of Bern
Michele De Lorenzi, hpc-ch / CSCS

Latest Posts

Senior Storage & Data Engineer – Open Position

Systems Engineer – Platform Automation – Open Position

CSCS User Lab Day 2026

DevOps Engineer – Open Position

Call for Participation: hpc-ch forum on Improving Access to HPC

2026 ETH Summer School “Beyond the Visible: AI, Sensing, and the Future of Terrestrial Resources”

HPC-AI Advisory Council Swiss Conference 2026

Head of Research Data Management Facility – Open Postion

End-of-Year Wrap-Up 2025

Insights and Exchange at the HPC-CH Forum on Financial Aspects of HPC

Apertus: A fully open, transparent, multilingual language model

Call for Presentations and Participation: hpc-ch forum on Financial Aspects of HPC

hpc-ch Forum on monitoring and reporting in HPC

Key Questions

Location

Chairmanship