Upcoming downtime of cluster compute and interactive nodes - both general and protected environments
Posted: July 3rd, 2018
Through the months of July and August there will be a series of outages involving the interactive and compute nodes in both the general and protected environment. This is to allow for an the application of kernel and driver update, including patching the compute nodes to address further meltdown and spectre vulnerabilities , as well as to allow for the migration to the use SSSD – System Security Services Daemon – for user account authentication, to improve security and scalability. Note that the last item will not change how users access CHPC systems.
We will be doing these in updates in groups, according to the following schedule, starting at 8am:
Tuesday, July 17:All interactive nodes (includes the atmos, meteo, wx, frisco, apollo nodes) and resource managers on all general clusters AND compute nodes on tangent, lonepeak, AND redwood
- Reservation will drain tangent, lonepeak, and redwood of running jobs before the 8am starting time
- The update of the resource managers of the other clusters will requiring the scheduling of jobs to be paused, so there will be a window of time when no jobs will be started, but will not require the batch queue to be drained.
Tuesday, July 31: Compute nodes on ember (Reservation will drain ember of running jobs before the 8am starting time)
Tuesday, August 21: Compute nodes on ash and notchpeak (Reservation will drain ash and notchpeak of running jobs before the 8am starting time)
August 28: Compute nodes on kingspeak (Reservation will drain kingspeak of running jobs before the 8am starting time)
We anticipate that on July 17th, tangent, lonepeak, and redwood compute nodes will be down most of the day. Work on the interactive nodes and resource manager will be completed first and then returned to service as soon as the work is completed. Based on the time the work takes on July 17th, we will provide a more specific time windows for the remaining dates, all of which should be partial day outages.
Please note that if there any issues due to these updates arise after the changes are made on tangent and lonepeak on July 17, then we will address these issues before continuing on to the remaining cluster, potentially rescheduling the dates of the remaining changes. Therefore we request users to take advantage of the time between July 17 and July 31 to test your applications on the updated clusters of lonepeak and tangent.
If there are any questions or concerns, please let us know.