Dell Technologies HPC Community Event
June 9th at 10:00am CDT
Quieting Noisy Neighbors: Addressing Network Congestion in Multi-Workload Environments
presented Paul Calleja, Director of Research Computing, University of Cambridge; Jeff Kirk, Storage Networking Engineer, Dell Technologies; John Lockman, AI Systems Engineer, Dell Technologies; John McCalpin, Research Scientist, TACC; and Matthew Williams, CTO, Rockport Networks
About the Event
HPC clusters deliver performance based on the computational capabilities within the server compute nodes, but achieving scalable performance is critically dependent on the HPC interconnect between these nodes (and a software stack that realizes the performance potential of both). For most parallel applications, HPC interconnects need to be very low latency and high bandwidth to enable performance to scale, but there is more to it than latency and bandwidth per node. Switches, protocols (and implementations), and topologies, in particular, play important roles in enabling application performance to scale, and in enabling multiple concurrent applications to scale and perform. Congestion becomes important for both individual applications at scale, and especially for many concurrent applications on the same system, all demanding interconnect resources for data from other nodes, to/from IO, etc. There is quite a difference between congestion on a lightly loaded fabric and a more fully utilized fabric. An analogous situation is a freeway at rush hour: as traffic gets denser, net speeds often slow down, and a wreck during rush hour can have far-reaching consequences including bringing traffic to a crawl. There are also implementation details around buffers in switches that can make this worse or better.
Join us for the panel conversation about the issues that can create congestion and rob parallel applications of their performance, and learn about practices and technologies that can help mitigate the effects of congestion and achieve more of the potential performance of HPC applications and workloads.
About the Speakers
Paul Calleja, Director of Research Computing, University of Cambridge
Dr. Calleja is Director of Research Computing at the University of Cambridge where he oversees one of the UK’s leading large-scale National HPC centers supporting a diverse community of UK frontier science, engineering, and medical research programs. Dr. Calleja has a strong academic/industrial HPC co-design background focusing on commodity open standards-based solutions. Recently he has pioneered the convergence of OpenStack and Research Computing use-cases, working with industry partners to develop the “Scientific OpenStack”, a software-defined supercomputing middleware solution making large scale cloud-native supercomputing a reality.
Dr. Calleja also heads up the Cambridge Open Exascale Lab a prominent UK academic/industrial collaboration aimed at the development and democratization of exascale computing solutions. Dr. Calleja obtained his Ph.D. in computational biophysics at Bath University. After filling a post-doctoral research position at Birkbeck College, he moved into private industry, where he spearheaded early commercialization of HPC cluster solutions before moving back into academia heading up HPC provision at Imperial College then Cambridge.
Jeff Kirk, Storage Networking Engineer, Dell Technologies
Jeff has spent his career working at the leading edge of computing and networking. Jeff is currently working in the Dell Technologies ISG CTIO Office and is an HPC and AI Technology Strategist where he has helped grow the HPC program with a new vision, strategy, business development efforts, and new partnerships and solutions. Prior to joining Dell EMC, Jeff worked at several cutting edge semi-conductor companies. At AMD he specialized in superscalar RISC and x86 platforms for high performance computation (1999). At Mellanox he worked on some of the first InfiniBand HPC installations, including the Virginia Tech cluster that reached number three on the top 500 (Big Mac) using Apple workstations (Nov 2004). While at Mellanox he supported Dr. D.K. Panda and the first implementation of MVAPICH at his alma mater, The Ohio State University. Later at Solarflare his focus was OS bypass technology and financial markets (2010).
After moving to Dell, Jeff worked in Dell Networking implementing their first Fibre Channel over Ethernet systems and he holds several patents on FCoE (2013). His interest in low latency high performance fabrics has grown as the size of HPC systems has increased. Since transferring to the Server CTO office, he has been thinking hard about how to improve HPC networking and lower latency with innovative new architectures. He believes the future of HPC is better than it ever has been.
John Lockman, Chief Technology Officer and HPC Software Specialist, Vizias
Programmer, Developer, and Evangelist for containerization and orchestration with Kubernetes, John Lockman works in the Dell EMC HPC and AI Innovation Lab. He specializes in nature-inspired programming, deep learning, and artificial intelligence. John brings a passion for building tools to make advanced computing accessible to a larger audience.
John McCalpin, Research Scientist, TACC
John joined TACC in 2009 as a Research Scientist in the High-Performance Computing Group after a twelve-year career in performance analysis and system architecture in the computer industry. His industrial experience includes 3 years at SGI (performance analysis and optimization on the Origin2000 and performance lead on the architecture team for the Altix3000), 6 years at IBM (performance analysis for HPC, processor and system design for Power4/4+ and Power5/5+), and 3 years at AMD (accelerated computing technologies and performance analysis). Prior to his industrial career, John was an oceanographer (Ph.D., Florida State), spending six years as an assistant professor at the University of Delaware engaged in research and teaching on numerical simulation of the large-scale circulation of the oceans.
Matthew Williams, CTO, Rockport Networks
Matthew Williams is CTO of Rockport Networks and has 25 years of technical leadership and engineering experience, 14 years as CTO of successful network technology companies and has 21 issued US patents. He is an expert strategist, analyst and visionary who has delivered on transformational product concepts. Matthew is an insightful and energetic communicator who enjoys product evangelization and inspiring global business and technical audiences.
Matthew has a B.Sc. in Electrical Engineering with First Class Honours from Queen’s University, Kingston, Canada and is a registered P.Eng.