Hot Topics in High-Performance Computing: Networking and Fault Tolerance

Topics: Hot Topics in High Performance Parallel Computing: Networks and Fault Tolerance. Large-scale computer systems such as Petascale or upcoming Exascale machines pose significant challenges on the system and software designers. In this course, we will address to very important topics in this design: HPC networking and Fault Tolerance. The network will soon be the most expensive and critical part of large machines and fault tolerance is needed to ensure correct operation under the increasing probability of failures of single elements. This course requires basic knowledge in graph theory and system architecture. This section is for undergraduate or graduate students offering 3 or 4 credits respectively.


1. Introduction to Parallel Computer Architecture (I) [Lecture 1 – (897.01 kb)]
2. Introduction to Parallel Computer Architecture (II) [Lecture 2 – (989.46 kb)]
3. A Network-centric View on HPC [Lecture 3 – (423.94 kb)]
4. HPC Networking Basics [Lecture 4 – (299.45 kb)]
5. Advanced Network Models (I) [Lecture 5 – (233.22 kb)]
6. Advanced Network Models (II) [Lecture 6 – (461.95 kb)]
7. Network Topology (I) [Lecture 7 – (631.89 kb)]
8. Network Topology (II) [Lecture 8 – (473.95 kb)]
9. Routing [Lecture 9 – (199.01 kb)]
10. Routing Examples, Flow Control, Blue Waters Topology [Lecture 10 – (2066.6 kb)]


Permanent link to this article: