12572/62577 Seminar Distributed Learning Systems
Today's computing systems (e.g., Deep learning systems), become ever more complex, due to the rapid development of hardware and software technology. It is challenging to design and run computing systems that guarantee users' performance requirements in a resource-efficient way. Various quantitative methods are applied to capture such complex system dynamics and predict metrics of interests, from the designing phase of the systems to the runtime performance, e.g., job response times and system anomalies. To optimize the performance of computing systems, a deep understanding of those methods and their applications on the system design is essential. Having practical hand-on experience on designing experiments, deriving models, and validating results with benchmark systems will prepare students to tackle challenges of real-world systems.
Course topics include
Design of experiments and statistical tests.
Operational laws and queueing methods for modelling computing systems.
Scheduling and load balancing.
Machine learning methods for modelling computing systems.
System dependability and scalability analysis.
Optimization and resource management.
Offline