SC2000 TECHNICAL PAPERS

TECHNICAL PAPERS

This year's technical program continues the tradition of a diverse set of strong technical papers. The Program Committee selected 62 papers from 179 submissions, including six finalists for Gordon Bell prizes. These papers were chosen based on the significance of their contributions to the field of high performance networking and computing. The papers will be presented in 21 sessions, covering a wide range of topics, including cluster computing, networking, computational girds, data grids, QoS/fault tolerance, and a myriad of applications. Also, don't miss the plenary awards session on Thursday, where you can find out the winners of all of the conference awards (including Best Paper and Best Student Paper) and several national awards.

Note: for all papers listed, the first author is the presenter, unless another author in the list has an asterisk after his/her name. There is also an Author Index available.

Sessions Cronological
Sessions Alphabetical
MPIAPPLICATIONS I
NUMERICAL ALGORITHMSAPPLICATIONS II
SCHEDULINGAWARDS SESSION
MPI/OPENMPBIOMEDICAL APPLICATIONS
POTPOURRICLUSTER INFRASTRUCTURE
CLUSTER INFRASTRUCTURECOMPILER OPTIMIZATION
QOS/FAULT TOLERANCEDATA GRID
BIOMEDICAL APPLICATIONSGORDON BELL I
APPLICATIONS IGORDON BELL II
VISUALIZATIONGRID MIDDLEWARE
COMPILER OPTIMIZATIONHARDWARE BASED TOOLS
APPLICATIONS IIMPI
NETWORKINGMPI/OPENMP
HARDWARE BASED TOOLSNETWORKING
GORDON BELL INUMERICAL ALGORITHMS
PARALLEL PROGRAMMINGPARALLEL PROGRAMMING
SOFTWARE TOOLSPOTPOURRI
DATA GRIDQOS/FAULT TOLERANCE
GORDON BELL IISCHEDULING
AWARDS SESSIONSCIENCE APPLICATIONS SUPPORT
SCIENCE APPLICATIONS SUPPORTSOFTWARE TOOLS
GRID MIDDLEWAREVISUALIZATION
Questions:techpapers@sc2000.org
TECHNICAL PAPERS CHAIR
ROD OLDEHOEFT, LOS ALAMOS NATIONAL LABORATORY
CONFERENCE VICE CHAIR TECHNICAL PROGRAM
JIM McGRAW, LAWRENCE LIVERMORE NATIONAL LABORATORY
TECHNICAL PAPERS COMMITTEE


TUESDAY NOVEMBER 7
MPI

Chair: Harvey Wasserman, Los Alamos National Laboratory
Time: 10:30 AM - Noon
Room: D 271/273

The Implementation of MPI-2 One-Sided Communication for the NEC SX-5
Jesper Larsson Traff, Hubert Ritzdorf, Rolf Hempel, C&C Research Laboratories, NEC Europe Ltd.

We describe the MPI/SX implementation of the MPI-2 standard for one-sided communication (Remote Memory Access) for the NEC SX-5 vector supercomputer. MPI/SX is a non-threaded implementation of the full MPI-2 standard. Essential features of the implementation are presented, including the synchronization mechanisms, the handling of communication windows in global shared and in process local memory, as well as the handling of MPI derived data types. In comparative benchmarks the data transfer operations for one-sided communication and point-to-point message passing show very similar performance, both when data reside in global shared and when in process local memory. Derived data types, which are of particular importance for applications using one-sided communications, impose only a modest overhead and can be used without any significant loss of performance. Thus, the MPI/SX programmer can freely choose either the message passing or the one-sided communication model, whichever is most convenient for the given application.

Single sided MPI implementations for SUN MPI
S. Booth, E. Mourao, EPCC, The University of Edinburgh

This paper describes an implementation of generic MPI-2 single-sided communications for SUN-MPI. Our implementation is layered on top of point-to-point MPI communications and therefore can be adapted to other MPI implementations.

The code is designed to co-exist with other MPI-2 single-sided implementations (for example, direct use of shared memory) providing a generic fall-back implementation for those communication paths where an optimized single-sided implementation is not available. MPI-2 single-sided communications require the transfer of data-type information as well as user data. We describe a type packing and caching mechanism used to optimize the transfer of data-type information.

The performance of this implementation is measured in comparison to equivalent point-to-point operations and the shared memory implementation provided by SUN.

Automatically Tuned Collective Communications
Sathish S. Vadhiyar, Graham E. Fagg , Jack Dongarra, University of Tennessee

The performance of the MPI's collective communications is critical in most MPI-based applications. A general algorithm for a given collective communication operation may not give good performance on all systems due to the differences in architectures, network parameters and the storage capacity of the underlying MPI implementation. In this paper we discuss an approach in which the collective communications are tuned for a given system by conducting a series of experiments on the system. We also discuss a dynamic topology method that uses the tuned static topology shape, but re-orders the logical addresses to compensate for changing run time variations. A series of experiments were conducted comparing our tuned collective communication operations to various native vendor MPI implementations. The use of the tuned collective communications resulted in about 30 percent to 650 percent improvement in performance over the native MPI implementations.

NUMERICAL ALGORITHMS

Chair: Ricky Kendall, Ames Laboratory
Time: 10:30 AM - Noon
Room: D 274

Landing CG on EARTH: A Case Study of Fine-Grained Multithreading on an Evolutionary Path
Kevin B. Theobald, Gagan Agrawal, Rishi Kumar, Gerd Heber, Cornell University, Guang R. Gao, University of Delaware, Paul Stodghill, Keshav Pingali, Cornell University

We report on our work in developing a fine-grained multithreaded solution for the communication-intensive Conjugate Gradient (CG) problem. In our recent work, we have developed a simple, yet very efficient, solution to executing matrix-vector multiply on a multithreaded system. This paper presents an effective mechanism for the reduction-broadcast phase, which is implemented and integrated with the sparse MVM resulting in a scalable implementation of the complete CG application.

Three major observations from our experiments on the EARTH multithreaded testbed are: (1) The scalability of our CG implementation is impressive, e.g., speedup is 90 on 120 processors for the NAS CG class B input. (2) Our dataflow-style reduction-broadcast network based on fine-grain multithreading is twice as fast as a serial reduction scheme on the same system. (3) By slowing down the network by a factor of 2, no notable degradation of overall CG performance was observed.

Parallel Smoothed Aggregation Multigrid: Aggregation Strategies on Massively Parallel Machines
Ray S. Tuminaro, Sandia National Laboratories Charles Tong, Lawrence Livermore Labs

Algebraic multigrid methods offer the hope that multigrid convergence can be achieved (for at least some important applications) without a great deal of effort from engineers and scientists wishing to solve linear systems. In this paper we consider parallelization of the smoothed aggregation multigrid method. Smoothed aggregation is one of the most promising algebraic multigrid methods. Therefore, developing parallel variants with both good convergence and efficiency properties is of great importance. However, parallelization is nontrivial due to the somewhat sequential aggregation (or grid coarsening) phase. In this paper, we discuss three different parallel aggregation algorithms and illustrate the advantages and disadvantages of each variant in terms of parallelism and convergence. Numerical results will be shown on the Intel Teraflop computer for some large problems coming from nontrivial codes: quasi-static electric potential simulation and a fluid flow calculation.

Scalable Algorithms for Adaptive Statistical Designs
Robert Oehmke, Janis Hardwick, Quentin F. Stout, University of Michigan

We present a scalable, high-performance solution to multidimensional recurrences that arise in adaptive statistical designs. Adaptive designs are an important class of learning algorithms for a stochastic environment, and we focus on the problem of optimally assigning patients to treatments in clinical trials. While adaptive designs have significant ethical and cost advantages, they are rarely utilized because of the complexity of optimizing and analyzing them. Computational challenges include massive memory requirements, few calculations per memory access, and multiply-nested loops with dynamic indices. We analyze the effects of various parallelization options, and while standard approaches do not work well, with effort an efficient, highly scalable program can be developed. This allows us to solve problems thousands of times more complex than those solved previously, which helps make adaptive designs practical. Further, our work applies to many other problems involving neighbor recurrences, such as generalized string matching.

SCHEDULING

Chair: Dennis Gannon, Indiana University
Time: 1:30 - 3:00 PM
Room: D 267

Randomization, Speculation, and Adaptation in Batch Schedulers
Dejan Perkovic , Peter J Keleher, University of Maryland

This paper proposes extensions to the backfilling job-scheduling algorithm that significantly improve its performance. We introduce variations that sort the "backfilling order" in priority-based and randomized fashions. We examine the effectiveness of guarantees present in conservative backfilling and find that initial guarantees have limited practical value, while the performance of a "no-guarantee" algorithm can be significantly better when combined with extensions that we introduce. Our study differs from many similar studies in using traces that contain user estimates. We find that actual overestimates are large and significantly different from simple models. We propose the use of speculative backfilling and speculative test runs to counteract these large over-estimations. Finally, we explore the impact of dynamic, system-directed adaptation of application parallelism. The cumulative improvements of these techniques decrease the bounded slowdown, our primary metric, to less then 15 percent of conservative backfilling.

An Object-Oriented Job Execution Environment
Lance Smith, Rod Fatoohi*, San Jose State University

This is a project for developing a distributed job execution environment for highly iterative jobs. An iterative job is one where the same binary code is run hundreds of times with incremental changes in the input values for each run. An execution environment is a set of resources on a computing platform that can be made available to run the job and hold the output until it is collected. The goal is to design a complete, object-oriented scheduling system that will run a variety of jobs with minimal changes. Areas of code that are unique to one specific type of job are decoupled from the rest. The system allows for fine-grained job control, timely status notification and dynamic registration and deregistration of execution platforms depending on resources available. Several objected-oriented technologies are employed: Java, CORBA, UML, and software design patterns. The environment has been tested using a CFD code, INS2D.

Towards an Integrated, Web-executable Parallel Programming Tool Environment
Insung Park, Nirav H. Kapadia, Renato J. Figueiredo, Rudolf Eigenmann*, José A. B. Fortes, Purdue University

We present a new parallel programming tool environment that is (1) accessible and executable ''anytime, anywhere,'' through standard Web browsers and (2) integrated in that it provides tools which adhere to a common underlying methodology for parallel programming and performance tuning. The environment is based on a new network computing infrastructure developed at Purdue University.

We evaluate our environment qualitatively by comparing our tool access method with conventional schemes of software download and installation. We also quantitatively evaluate the efficiency of interactive tool access in our environment. We do this by measuring the response times of various functions of the Ursa Minor tool and compare them with those of a Java Applet-based "anytime, anywhere" tool access method. We found that our environment offers significant advantages in terms of tool accessibility, integration, and efficiency.

MPI/OPENMP

Chair: Jeff Hollingsworth, University of Maryland
Time: 1:30 - 3:00 PM
Room: D 271/273

Performance of Hybrid Message-Passing and Shared-Memory Parallelism for Discrete Element Modeling
D. S. Henty, Edinburgh Parallel Computing Centre

The current trend in HPC hardware is towards clusters of shared-memory (SMP) compute nodes. For applications developers the major question is how best to program these SMP clusters. To address this we study an algorithm from Discrete Element Modeling, parallelized using both the message-passing and shared-memory models simultaneously ("hybrid'' parallelization). The natural load-balancing methods are different in the two parallel models, the shared-memory method being in principle more efficient for very load-imbalanced problems. It is therefore possible that hybrid parallelism will be beneficial on SMP clusters. We benchmark MPI and OpenMP implementations of the algorithm on MPP, SMP and cluster architectures, and evaluate the effectiveness of hybrid parallelism. Although we observe cases where OpenMP is more efficient than MPI on a single SMP node, we conclude that our current OpenMP implementation is not yet efficient enough for hybrid parallelism to outperform pure message-passing on an SMP cluster.

A Comparison of Three Programming Models for Adaptive Applications on the Origin2000
Hongzhang Shan, Jaswinder P. Singh, Princeton University, Leonid Oliker, Lawrence Berkeley National Laboratory, Rupak Biswas, NASA Ames Research Center

Adaptive applications have computational workloads and communication patterns which change unpredictably at runtime, requiring load balancing to achieve scalable performance on parallel machines. Efficient parallel implementation of such adaptive application is therefore a challenging task. In this paper, we compare the performance of and the programming effort required for two major classes of adaptive applications under three leading parallel programming models on an SGI Origin 2000 system, a machine which supports all three models efficiently. Results indicate that the three models deliver comparable performance. However, the implementations differ significantly beyond merely using explicit messages versus implicit loads/stores even though the basic parallel algorithms are similar. Compared with the message-passing (using MPI) and SHMEM programming models, the cache-coherent shared address space (CC-SAS) model provides substantial ease of programming at both the conceptual level and program orchestration levels, often accompanied by performance gains. However, CC-SAS currently has portability limitations and may suffer from poor spatial locality of physically distributed shared data on large numbers of processors.

MPI versus MPI+OpenMP on IBM SP for the NAS Benchmarks
Franck Cappello, Daniel Etiemble, LRI

The hybrid memory model of clusters of multiprocessors raises two issues: programming model and performance. Many parallel programs have been written by using the MPI standard. To evaluate the pertinence of hybrid models for existing MPI codes, we compare a unified model (MPI) and a hybrid one (OpenMP fine grain parallelization after profiling) for the NAS 2.3 benchmarks on two IBM SP systems. The superiority of one model depends on 1) the level of shared memory model parallelization, 2) the communication patterns and 3) the memory access patterns. The relative speeds of the main architecture components (CPU, memory, and network) are of tremendous importance for selecting one model. With the used hybrid model, our results show that a unified MPI approach is better for most of the benchmarks. The hybrid approach becomes better only when fast processors make the communication performance significant and the level of parallelization is sufficient.

POTPOURRI

Chair: Bob Lucas, NERSC
Time: 1:30 - 3:00 PM
Room: D 274

A Wrapper Generator for Wrapping High Performance Legacy Codes as Java/CORBA Components
M. Li, O. F. Rana, M. S. Shields, Cardiff University, UK,
D. W. Walker, Oak Ridge National Laboratory

This paper describes a Wrapper Generator for wrapping high performance legacy codes as Java/CORBA components for use in a distributed component-based problem-solving environment. Using the Wrapper Generator we have automatically wrapped an MPI-based legacy code as a single CORBA object, and implemented a problem-solving environment for molecular dynamic simulations. Performance comparisons between runs of the CORBA object and the original legacy code on a cluster of workstations and on a parallel computer are also presented.

A Scalable SNMP-Based Distributed Monitoring System For Heterogeneous Network Computing
Rajesh Subramanyan, José Miguel-Alonso, José A. B. Fortes, Purdue University

Traditional centralized monitoring systems do not scale to present-day large, complex, network-computing systems. Based on recent SNMP standards for distributed management, this paper addresses the scalability problem through distribution of monitoring tasks, applicable for tools such as SIMONE (SNMP-based monitoring prototype implemented by the authors).

Distribution is achieved by introducing one or more levels of a dual entity called the Intermediate Level Manager (ILM) between a manager and the agents. The ILM accepts monitoring tasks described in the form of scripts and delegated by the next higher entity. The solution is flexible and integratable into a SNMP tool without altering other system components.

A testbed of up to 1024 monitoring elements is used to assess scalability. Noticeable improvements in the round trip delay (from seconds to less than one tenth of a second) were observed when more than 200 monitoring elements are present and as few as two ILM's are used.

ESP: A System Utilization Benchmark
Adrian T. Wong, Leonid Oliker, William T.C. Kramer, Teresa L. Kaltz, David H. Bailey*, Lawrence Berkeley National Laboratory

This article describes a new benchmark, called the Effective System Performance (ESP) test, which is designed to measure system-level performance, including such factors as job scheduling efficiency, handling of large jobs and shutdown-reboot times. In particular, this test can be used to study the effects of various scheduling policies and parameters. We present here some results that we have obtained so far on the Cray T3E and IBM SP systems, together with insights obtained from simulations.

CLUSTER INFRASTRUCTURE

Chair: Sam Uselton, Lawrence Livermore National Laboratory
Time: 3:30 - 5:00 PM
Room: D 271/273

PM2: A High Performance Communication Middleware for Heterogeneous Network Environments
Toshiyuki Takahashi, Shinji Sumimoto, Atsushi Hori, Hiroshi Harada, Yutaka Ishikawa, Real World Computing Partnership

This paper introduces a high performance communication middle layer, called PM2, for heterogeneous network environments. PM2 currently supports Myrinet, Ethernet, and SMP. Binary code written in PM2 or written in a communication library, such as MPICH-SCore on top of PM2, may run on any combination of those networks without re-compilation. According to a set of NAS parallel benchmark results, MPICH-SCore performance is better than dedicated communication libraries such as MPICH-BIP/SMP and MPICH-GM when running some benchmark programs.

Performance and Interoperability Issues in Incorporating Cluster Management Systems Within a Wide-Area Network-Computing Environment
Sumalatha Adabala, Nirav H. Kapadia, José A. B Fortes, Purdue University

This paper describes the performance and interoperability issues that arise in the process of integrating cluster management systems into a wide-area network-computing environment, and provides solutions in the context of the Purdue University Network Computing Hubs (PUNCH). The described solution provides users with a single point of access to resources spread across administrative domains, and an intelligent translation process makes it possible for users to submit jobs to different types of cluster management systems in a transparent manner. The approach does not require any modifications to the cluster management software. However, call-back and caching capabilities that would improve performance and make such systems more interoperable with wide-area computing systems are discussed.

Architectural and Performance Evaluation of GigaNet and Myrinet Interconnects on Clusters of Small-Scale SMP Servers
Jenwei Hsieh, Tau Leng, Victor Mashayekhi, Reza Rooholamini, Dell Computer Corporation

GigaNet and Myrinet are two of the leading interconnects for clusters of commodity computer systems. Both provide memory-protected user-level network interface access, and deliver low-latency and high-bandwidth communication to applications. GigaNet is a connection-oriented interconnect based on a hardware implementation of Virtual Interface (VI) Architecture and Asynchronous Transfer Mode (ATM) technologies. Myrinet is a connection-less interconnect which leverages packet switching technologies from experimental Massively Parallel Processors (MPP) networks. This paper investigates their architectural differences and evaluates their performance on two commodity clusters based on two generations of Symmetric Multiple Processors (SMP) servers. The performance measurements reported here suggest that the implementation of Message Passing Interface (MPI) significantly affects the cluster performance. Although MPICH-GM over Myrinet demonstrates lower latency with small messages, the polling-driven implementation of MPICH-GM often leads to tight synchronization between communication processes and higher CPU overhead.

QOS/FAULT TOLERANCE

Chair: Wu-chun Feng, Los Alamos National Laboratory
Time: 3:30 - 5:00 PM
Room: D 274

MPICH-GQ: Quality-of-Service for Message Passing Programs
Alain J. Roy, University of Chicago, Ian Foster, Argonne National Laboratory and University of Chicago, William Gropp, Argonne National Laboratory, Nicholas Karonis, Northern Illinois University, Volker Sander, Forschungszentrum Juelich GmbH, Brian Toonen, Argonne National Laboratory

Parallel programmers typically assume that all resources required for a program's execution are dedicated to that purpose. However, in local and wide area networks, contention for shared networks, CPUs, and I/O systems can result in significant variations in availability, with consequent adverse effects on overall performance. We describe a new message-passing architecture, MPICH-GQ, that uses quality of service (QoS) mechanisms to manage contention and hence improve performance of message passing interface (MPI) applications. MPICH-GQ combines new QoS specification, traffic shaping, QoS reservation, and QoS implementation techniques to deliver QoS capabilities to the high-bandwidth bursty flows, complex structures, and reliable protocols used in high-performance applications-characteristics very different from the low-bandwidth, constant bit-rate media flows and unreliable protocols for which QoS mechanisms were designed. Results obtained on a differentiated services testbed demonstrate our ability to maintain application performance in the face of heavy network contention.

Scalable Fault-Tolerant Distributed Shared Memory
Florin Sultan, Thu Nguyen, Liviu Iftode, Rutgers University

This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a home-based lazy release consistency (HLRC) DSM system with independent checkpointing and logging to volatile memory, targeting shared-memory computing on very large LAN-based clusters. In these environments, where global coordination may be expensive, independent checkpointing becomes critical to scalability. However, independent checkpointing is only practical if we can control the size of the log and checkpoints in the absence of global coordination. In this paper we describe the design of our fault-tolerant DSM system and present our solutions to the problems of checkpoint and log management. We also present experimental results showing that our fault tolerance support is light-weight, adding only low messaging, logging and checkpointing overheads, and that our management algorithms can be expected to effectively bound the size of the checkpoints and logs for real applications.

Realizing Fault Resilience in Web-Server Cluster
Chu-Sing Yang, Mon-Yen Luo, National Sun Yat-Sen University

Today, a successful Internet service is absolutely critical to be up 100 percent of the time. Server clustering is the most promising approach to meet this requirement. However, the existing Web server-clustering solutions merely can provide high availability derived from their redundancy nature, but offer no guarantee about fault resilience for the service. In this paper, we address this problem by implementing an innovative mechanism which enables a Web request to be smoothly migrated and recovered on another working node in the presence of server failure. We will show that request migration and recovery could be efficiently achieved in the manner of user transparency. The achieved capability of fault resilience is important and essential for a variety of critical services (e.g., E-commerce), which are increasingly widespread used. Our approach takes an important step toward providing a highly reliable Web service.

BIOMEDICAL APPLICATIONS

Chair: Padma Raghavan, University of Tennessee
Time: 3:30 - 5:00 PM
Room: D 267

Data Access Performance in a Large and Dynamic Pharmaceutical Drug Candidate Database
Zina Ben-Miled, Yang Liu, Indiana University, Purdue University, Dave Powers, Eli Lilly, Omran Bukhres, Indiana University, Purdue University, Michael Bem, Robert Jones, Robert Oppelt, Samuel Milosevich, Eli Lilly

An explosion in the amount of data generated through chemical and biological experimentation has been observed in recent years. This rapid proliferation of vast amounts of data has led to a set of cheminformatics and bioinformatics applications that manipulate dynamic, heterogeneous and massive data. An example of such applications in the pharmaceutical industry is the computational process involved in the early discovery of lead drug candidates for a given target disease. This computational process includes repeated sequential and random accesses to a drug candidate database.

Using the above pharmaceutical application, an experimental study was conducted which shows that for optimal performance, the degree of parallelism exploited in the application should be adjusted according to the drug candidate database instance size and the machine size. Additionally, different degrees of parallelism should be used depending on whether the access to the drug candidate database is random or sequential.

Real-Time Biomechanical Simulation of Volumetric Brain Deformation for Image Guided Neurosurgery
Simon K. Warfield, Harvard Medical School, Matthieu Ferrant, Xavier Gallez, Universitie Catholique de Louvain, Arya Nabavi, Ferenc A. Jolesz, Harvard Medical School, Ron Kikinis Harvard Medical School,

We aimed to study the performance of a parallel implementation of an intraoperative nonrigid registration algorithm that accurately simulates the biomechanical properties of the brain and its deformations during surgery. The algorithm was designed to allow for improved surgical navigation and quantitative monitoring of treatment progress in order to improve the surgical outcome and to reduce the time required in the operating room. We have applied the algorithm to two neurosurgery cases with promising results.

High performance computing is a key enabling technology that allows the biomechanical simulation to be executed quickly enough for the algorithm to be practical. Our parallel implementation was evaluated on a symmetric multi-processor and two clusters and exhibited similar performance characteristics on each. The implementation was sufficiently fast to be used in the operating room during a neurosurgery procedure. It allowed a three-dimensional volumetric deformation to be simulated in less than ten seconds.

Computer Simulations of Cardiac Electrophysiology
John B. Pormann, Craig S. Henriquez, John A. Board Jr., Donald J. Rose, David M. Harrild, Duke University, Alexandra P. Henriquez, North Carolina Supercomputing Center

CardioWave is a modular system for simulating wavefront conduction in the heart. These simulations may be used to investigate the factors that generate and sustain life-threatening arrhythmias such as ventricular fibrillation. The user selects a set of modules which most closely reflects the simulation they are interested in and the simulator is built automatically. Thus, we do not present one monolithic simulator, but rather a simulator-generator which allows the researcher to make the trade-offs of complexity versus performance. The results presented here are from simulations run on an IBM SP parallel computer and a cluster of workstations. The performance numbers show excellent scalability up through 128 processors. With the larger memory of the parallel machines, we have been able to perform highly realistic simulations of the human atria. These simulations include realistic, 3-D geometries with inhomogeneity and anisotropy as well as highly complex membrane dynamics.


WEDNESDAY NOVEMBER 8
APPLICATIONS I

Chair: Michael Berry, University of Tennessee
Time: 10:30 AM - Noon
Room: D 274

Parallel Algorithms for Radiation Transport on Unstructured Grids
Steve Plimpton, Bruce Hendrickson, Shawn Burns, Sandia National Labs, Will McLendon III, Texas A&M University

The method of discrete ordinates is commonly used to solve the Boltzmann radiation transport equation for applications ranging from simulations of fires to weapons effects. The equations are most efficiently solved by sweeping the radiation flux across the computational grid. For unstructured grids this poses several interesting challenges, particularly when implemented on distributed-memory parallel machines where the grid geometry is spread across processors. We describe an asynchronous, parallel, message-passing algorithm that performs sweeps simultaneously from many directions across unstructured grids. We identify key factors that limit the algorithm's parallel scalability and discuss two enhancements we have made to the basic algorithm: one to prioritize the work within a processor's subdomain and the other to better decompose the unstructured grid across processors. Performance results are given for the basic and enhanced algorithms implemented within a radiation solver running on hundreds of processors of Sandia's Intel Tflops machine and DEC-Alpha CPlant cluster.

A Parallel Dynamic-Mesh Lagrangian Method for Simulation of Flows with Dynamic Interfaces
James F. Antaki, University of Pittsburgh Medical Center, Guy E. Blelloch, Omar Ghattas, Ivan Malcevic*, Gary L. Miller, Noel J. Walkington, Carnegie Mellon University

Many important phenomena in science and engineering, including our motivating problem of microstructural blood flow, can be modeled as flows with dynamic interfaces. The major challenge faced in simulating such flows is resolving the interfacial motion. Lagrangian methods are ideally suited for such problems, since interfaces are naturally represented and propagated. However, the material description of motion results in dynamic meshes, which become hopelessly distorted unless they are regularly regenerated. Lagrangian methods are particularly challenging on parallel computers, because scalable dynamic mesh methods remain elusive. Here, we present a parallel dynamic mesh Lagrangian method for flows with dynamic interfaces. We take an aggressive approach to dynamic meshing by triangulating the propagating grid points at every timestep using a scalable parallel Delaunay algorithm. Contrary to conventional wisdom, we show that the costs of the dynamic mesh components (triangulation, coarsening, refinement, and partitioning) can be made small relative to the flow solver.

Self-Consistent Langevin Simulation of Coulomb Collisions in Charged-Particle Beams
Ji Qiang, Robert D. Ryne, Salman Habib, Los Alamos National Laboratory

In many plasma physics and charged-particle beam dynamics problems, Coulomb collisions are modeled by a Fokker-Planck equation. In order to incorporate these collisions, we present a three-dimensional parallel Langevin simulation method using a Particle-In-Cell (PIC) approach implemented on high-performance parallel computers. We perform, for the first time, a fully self-consistent simulation, in which the friction and diffusion coefficients are computed from first principles. We employ a two-dimensional domain decomposition approach within a message passing programming paradigm along with dynamic load balancing. Object oriented programming is used to encapsulate details of the communication syntax as well as to enhance reusability and extensibility. Performance tests on the SGI Origin 2000, IBM SP RS/6000 and the Cray T3E-900 have demonstrated good scalability. As a test example, we demonstrate the collisional relaxation to a final thermal equilibrium of a beam with an initially anisotropic velocity distribution.

VISUALIZATION

Chair: Patricia Crossno, Sandia National Laboratories
Time: 10:30 AM - Noon
Room: D 271/273

Using High-Speed WANs and Network Data Caches to Enable Remote and Distributed Visualization
Wes Bethel, Brian Tierney, Jason Lee, Dan Gunter, Stephen Lau, Lawrence Berkeley National Laboratory

Visapult is a prototype application and framework for remote visualization of large scientific datasets. We approach the technical challenges of tera-scale visualization with a unique architecture which employs high speed WANs and network data caches for data staging and transmission. This architecture allows for the use of available cache and compute resources at arbitrary locations on the network. High data throughput rates and network utilization are achieved by parallelizing I/O at each stage in the application, and by pipelining the visualization process. On the desktop, the graphics interactivity is effectively decoupled from the latency inherent in network applications. We present a detailed performance analysis of the application, and improvements resulting from field-test analysis conducted as part of the DOE Combustion Corridor project.

High Performance Visualization of Time-Varying Volume Data over a Wide-Area Network Status
Kwan-Liu Ma, David M. Camp, University of California, Davis

This paper presents an end-to-end, low-cost solution for visualizing time-varying volume data rendered on a parallel computer located at a remote site. Pipelining and careful grouping of processors are used to hide I/O time and to maximize processor utilization. Compression is used to significantly cut down the cost of transferring output images from the parallel computer to a display device through a wide-area network. This complete rendering pipeline makes possible highly efficient rendering and remote viewing of high-resolution time-varying data sets in the absence of high-speed network and parallel I/O support. To study the performance of this rendering pipeline and to demonstrate high-performance remote visualization, tests were conducted on a PC cluster in Japan as well as an SGI Origin 2000 operated at the NASA Ames Research Center with the display located at UC Davis.

Distributed Rendering for Scalable Displays
Greg Humphreys, Ian Buck, Matthew Eldridge, Pat Hanrahan, Stanford University

We describe a novel distributed graphics system that allows an application to render to a large tiled display. Our system, called WireGL, uses a cluster of off-the-shelf PCs connected with a high-speed network. WireGL allows an unmodified existing application to achieve scalable output resolution on such a display. This paper presents an efficient sorting algorithm which minimizes the network traffic for a scalable display. We will demonstrate that for most applications, our system provides scalable output resolution with minimal performance impact.

COMPILER OPTIMIZATION

Chair: Guang Gao, University of Delaware
Time: 10:30 AM - Noon
Room: D 267

Tiling Imperfectly-Nested Loop Nests
Nawaaz Ahmed, Nikolay Mateev, Keshav Pingali, Cornell University

Tiling is one of the more important transformations for enhancing locality of reference in programs. Intuitively, tiling a set of loops achieves the effect of interleaving iterations of these loops. Tiling of perfectly-nested loop nests (which are loop nests in which all assignment statements are contained in the innermost loop) is well understood. In practice, many loop nests are imperfectly nested, so existing compilers use heuristics to try to find a sequence of transformations that convert such loop nests into perfectly-nested ones, but these heuristics do not always succeed. In this paper, we propose a novel approach to tiling imperfectly-nested loop nests. The key idea is to embed the iteration space of every statement in the imperfectly-nested loop nest into a special space called the product space which is tiled to produce the final code. We evaluate the effectiveness of this approach for dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks. No other single approach in the literature can tile all these codes automatically.

Tiling Optimizations for 3D Scientific Computations
Gabriel Rivera, Chau-Wen Tseng, University of Maryland

Compiler transformations can significantly improve data locality for many scientific programs. In this paper, we show that iterative solvers for partial differential equations (PDEs) in three dimensions require new compiler optimizations not needed for 2D codes, since reuse along the third dimension cannot fit in cache for larger problem sizes. Tiling is a program transformation compilers can apply to capture this reuse, but successful application of tiling requires selection of non-conflicting tiles and/or padding array dimensions to eliminate conflicts. We present new algorithms and cost models for selecting tiling shapes and array pads. We explain why tiling is rarely needed for 2D PDE solvers, but can be helpful for 3D stencil codes. Experimental results show tiling 3D codes can reduce miss rates and achieve performance improvements of 17-121 percent for key scientific kernels, including a 27 percent average improvement for the key computational loop nest in the SPEC/NAS benchmark mgrid.

Improving Fine-Grained Irregular Shared-Memory Benchmarks by Data Reordering
Y. Charlie Hu, Alan Cox, Willy Zwaenepoel, Rice University

We demonstrate that data reordering can substantially improve the performance of fine-grained irregular shared-memory benchmarks, on both hardware and software shared-memory systems. In particular, we evaluate two distinct data reordering techniques that seek to co-locate in memory objects in close proximity in the physical system modeled by the computation. The effects of these techniques are increased spatial locality and reduced false sharing.

We evaluate the effectiveness of the data reordering techniques on a set of five irregular applications from SPLASH-2 and Chaos. We implement both techniques in a small library, allowing us to enable them in an application by adding less than 10 lines of code. Our results on one hardware and two software shared-memory systems show that, with data reordering during initialization, the performance of these applications is improved by 12 percent to 99 percent on the Origin 2000, 30 percent to 366 percent on TreadMarks, and 14 percent to 269 percent on HLRC.

APPLICATIONS II

Chair: David Walker, Oak Ridge National Laboratory
Time: 1:30 - 3:00 PM
Room: D 274

Performance Modeling and Tuning of an Unstructured Mesh CFD Application
William D. Gropp, Dinesh K. Kaushik*, Argonne National Laboratory, David E. Keyes, Old Dominion University, Barry F. Smith, Argonne National Laboratory

This paper describes performance tuning experiences with a three-dimensional unstructured grid Euler flow code from NASA, which we have reimplemented in the PETSc framework and ported to several large-scale machines, including the ASCI Red and Blue Pacific machines, the SGI Origin, the Cray T3E and Beowulf clusters. The code achieves a respectable level of performance for sparse problems, typical of scientific and engineering codes based on partial differential equations, and scales well up to thousands of processors. Since the gap between CPU speed and memory access rate is widening, the code is analyzed from a memory-centric perspective (in contrast to traditional flop-orientation) to understand its sequential and parallel performance. Performance tuning is approached on three fronts: data layouts to enhance locality of reference, algorithmic parameters and parallel programming model. This effort was guided partly by some simple performance models developed for the sparse matrix-vector product operation.

Parallel Phylogenetic Inference
Quinn Snell, Michael Whiting, Mark Clement, David McLaughlin, Brigham Young University

Recent advances in DNA sequencing technology have created large data sets upon which phylogenetic inference can be performed. However, current research is limited by the prohibitive time necessary to perform tree search on even a reasonably-sized data set. Some parallel algorithms have been developed but the biological research community does not use them because they do not trust the results from newly developed parallel software. This paper presents a new phylogenetic algorithm that allows existing, trusted phylogenetic software packages to be executed in parallel using the DOGMA parallel processing system. The results presented here indicate that data sets that currently take as much as 11 months to search using current algorithms, can be searched in as little as two hours using as few as eight processors. This reduction in the time necessary to complete a phylogenetic search allows new research questions to be explored in many of the biological sciences.

Parallel Unsteady Turbo-Pump Simulations For Liquid Rocket Engines
Cetin C. Kiris, ELORET/NASA-Ames, Dochan Kwak, NASA-Ames, William Chan, ELORET/NASA-Ames

This paper reports the progress being made towards complete turbo-pump simulation capability for liquid rocket engines. The Space Shuttle Main Engine (SSME) turbo-pump impeller is used as a test case for the performance evaluation of the MPI, hybrid MPI/Open-MP, and MLP versions of the INS3D code. Then, a computational model of a turbo-pump has been developed for the shuttle upgrade program. Relative motion of the grid system for rotor-stator interaction was obtained by employing overset grid techniques. Unsteady computations for SSME turbo-pump, which contains 101 zones with 31 million grid points, are carried on Origin 2000 systems at NASA Ames Research Center. The approach taken for these simulations, and the performance of the parallel versions of the code are presented.

NETWORKING

Chair: John Mellor-Crummey, Rice University
Time: 1:30 - 3:00 PM
Room: D 267

The Failure of TCP in High-Performance Computational Grids
W. Feng, Los Alamos National Laboratory, P. Tinnakornsrisuphap, University of Wisconsin

Distributed computational grids depend on TCP to ensure reliable end-to-end communication between nodes across the wide-area network (WAN). Unfortunately, TCP performance can be abysmal even when buffers on the end hosts are manually optimized. Recent studies blame the self-similar nature of aggregate network traffic for TCP's poor performance because such traffic is not readily amenable to statistical multiplexing in the Internet, and hence computational grids.

In this paper, we identify a source of self-similarity previously ignored, a source that is readily controllable-TCP. Via an experimental study, we examine the effects of the TCP stack on network traffic using different implementations of TCP. We show that even when aggregate application traffic ought to smooth out as more applications' traffic are multiplexed, TCP induces burstiness into the aggregate traffic load, thus adversely impacting network performance. Furthermore, our results indicate that TCP performance will worsen as WAN speeds continue to increase.

PSockets: The Case for Application-level Network Striping for Data Intensive Applications using High Speed Wide Area Networks
H. Sivakumar, University of Illinois at Chicago, S. Bailey, Infoblox Inc., R. L. Grossman, University of Illinois at Chicago

Transmission Control Protocol (TCP) is used by various applications to achieve reliable data transfer. TCP was originally designed for unreliable networks. With the emergence of high-speed wide area networks various improvements have been applied to TCP to reduce latency and achieve improved bandwidth. The improvement is achieved by having system administrators tune the network and can take a considerable amount of time. This paper introduces PSockets (Parallel Sockets), a library that achieves an equivalent performance without manual tuning. The basic idea behind PSockets is to exploit network striping. By network striping we mean striping partitioned data across several open sockets. We describe experimental studies using PSockets over the Abilene network. We show in particular that network striping using PSockets is effective for high performance data intensive computing applications using geographically distributed data.

Efficient Wire Formats for High Performance Computing
Fabian E. Bustamante, Greg Eisenhauer, Karsten Schwan, Patrick Widener, Georgia Institute of Technology

High performance computing is being increasingly utilized in non-traditional circumstances where it must interoperate with other applications. For example, online visualization is being used to monitor the progress of applications, and real-world sensors are used as inputs to simulations. Whenever these situations arise, there is a question of what communications infrastructure should be used to link the different components. Traditional HPC-style communications systems such as MPI offer relatively high performance, but are poorly suited for developing these less tightly-coupled cooperating applications. Object-based systems and meta-data formats like XML offer substantial plug-and-play flexibility, but with substantially lower performance. We observe that the flexibility and baseline performance of all these systems is strongly determined by their "wire format," or how they represent data for transmission in a heterogeneous environment. We examine the performance implications of different wire formats and present an alternative with significant advantages in terms of both performance and flexibility.

HARDWARE BASED TOOLS

Chair: Al Malony, University of Oregon
Time: 1:30 - 3:00 PM
Room: D 271/273

Using Hardware Performance Monitors to Isolate Memory Bottlenecks
Bryan R. Buck, Jeffrey K. Hollingsworth, The University of Maryland, College Park

In this paper, we present and evaluate two techniques that use different styles of hardware support to provide data structure specific processor cache information. In one approach, hardware performance counter overflow interrupts are used to sample cache misses. In the other, cache misses within regions of memory are counted to perform an n-way search for the areas in which the most misses are occurring. We present a simulation-based study and comparison of the two techniques. We find that both techniques can provide accurate information, and describe the relative advantages and disadvantages of each.

Hardware Prediction for Data Coherency of Scientific Codes on DSM
J.T. Acquaviva, CEA DAM/PRISM, French Atomic Energy Commission, W. Jalby, PRISM, Versailles University

This paper proposes a hardware mechanism for reducing coherency overhead occurring in scientific computations within DSM systems. A first phase aims at detecting, in the address space regular patterns (called streams) of coherency events (such as requests for exclusive, shared or invalidation).

Once a stream is detected at a loop level, regularity of data access can be exploited at the loop level (spatial locality) but also between loops (temporal locality). We present a hardware mechanism capable of detecting and exploiting efficiently these regular patterns.

Expectable benefits as well as hardware complexity are discussed and the limited drawbacks and potential overheads are exposed.

For a benchmarks suite of typical scientific applications results are very promising, both in terms of coherency streams and the effectiveness of our optimizations.

A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters
S. Browne, University of Tennessee, J. Dongarra, University of Tennessee and Oak Ridge National Laboratory, N. Garner, K. London, P. Mucci*, University of Tennessee

The purpose of the PAPI project is to specify a standard API for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count ''events'', which are occurrences of specific signals and states related to the processor's function. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. This correlation has a variety of uses in performance analysis and tuning. The PAPI project has proposed a standard set of hardware events and a standard cross-platform library interface to the underlying counter hardware. The PAPI library has been or is in the process of being implemented on all major HPC platforms. The PAPI project is developing end-user tools for dynamically selecting and displaying hardware counter performance data. PAPI support is also being incorporated into a number of third-party tools.

GORDON BELL I

Chair: Rusty Lusk, Argonne National Laboratory Time: 3:30 - 5:00 PM Room: D 271/273

A 1.349 Tflops Simulation of Black Holes in a Galactic Center on GRAPE-6
Junichiro Makino, Toshiyuki Fukushige, Masaki Koga, University of Tokyo

As an entry for the 2000 Gordon Bell performance prize, we report the performance achieved on a prototype GRAPE-6 system. GRAPE-6 is a special-purpose computer for astrophysical N-body calculations. The present configuration has 96 custom pipeline processors, each containing six pipeline processors for the calculation of gravitational interactions between particles. Its theoretical peak performance is 2.889 Tflops. The complete GRAPE-6 system will consist of 3072 pipeline chips and will achieve a peak speed of 100 Tflops. The actual performance obtained on the present 96-chip system was 1.349 Tflops, for a simulation of massive black holes embedded in the core of a galaxy with 786,432 stars. For a short benchmark run with 1,400,000 particles, the average speed was 1.640 Tflops.

98˘/Mflops/s, Ultra-Large-Scale Neural-Network Training on a PIII Cluster
Douglas A. Aberdeen, Jonathan Baxter, Robert Edwards, Australian National University

Artificial neural networks with millions of adjustable parameters and a similar number of training examples are a potential solution for difficult, large-scale pattern recognition problems in areas such as speech and face recognition, classification of large volumes of web data and finance. The bottleneck is that neural network training involves iterative gradient descent and is extremely computationally intensive. In this paper we present a technique for distributed training of Ultra Large Scale Neural Networks (ULSNN) on Bunyip, a Linux-based cluster of 196 Pentium III processors. To illustrate ULSNN training we describe an experiment in which a neural network with 1.73 million adjustable parameters was trained to recognize machine-printed Japanese characters from a database containing 9 million training patterns. The training runs with a average performance of 163.3 Gflops/s (single precision). With a machine cost of $150,913, this yields a price/performance ratio of 92.4˘ /Mflops/s (single precision). For comparison purposes, training using double precision and the ATLAS DGEMM produces a sustained performance of 70 Mflops/s or $2.16 /Mflop/s (double precision).

Scalable Molecular Dynamics for Large Biomolecular Systems
Robert K. Brunner, James C. Phillips, Laxmikant V. Kale, University of Illinois at Urbana-Champaign

We present an optimized parallelization scheme for molecular dynamics simulations of large biomolecular systems, implemented in the production-quality molecular dynamics program NAMD. With an object-based hybrid force and spatial decomposition scheme, and an aggressive measurement-based predictive load-balancing framework, we have attained speeds and speedups that are much higher than any reported in literature so far.

The paper first summarizes the broad methodology we are pursuing, and the basic parallelization scheme we used. It then describes the optimizations that were instrumental in increasing performance, and presents performance results on benchmark simulations.

PARALLEL PROGRAMMING

Chair: Barbara Chapman, University of Houston
Time: 3:30 - 5:00 PM
Room: D 274

A Comparative Study of the NAS MG Benchmark across Parallel Languages and Architectures
Bradford L. Chamberlain, Steven J. Deitz, Lawrence Snyder, University of Washington

Hierarchical algorithms such as multigrid applications form an important cornerstone for scientific computing. In this study, we take a first step toward evaluating parallel language support for hierarchical applications by comparing implementations of the NAS MG benchmark in several parallel programming languages: Co-Array Fortran, High Performance Fortran, Single Assignment C, and ZPL. We evaluate each language in terms of its portability, its performance, and its ability to express the algorithm clearly and concisely. Experimental platforms include the Cray T3E, IBM SP, SGI Origin, Sun Enterprise 5500 and a high-performance Linux cluster. Our findings indicate that while it is possible to achieve good portability, performance, and expressiveness, most languages currently fall short in at least one of these areas. We find a strong correlation between expressiveness and a language's support for a global view of computation, and we identify key factors for achieving portable performance in multigrid applications.

Is Data Distribution Necessary in OpenMP?
Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, University of Patras, Constantine D. Polychronopoulos, University of Illinois at Urbana-Champaign, Jesus Labarta, Eduard Ayguade, Technical University of Catalunya

This paper investigates the performance implications of data placement in OpenMP programs running on modern ccNUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of state-of-the-art ccNUMA architectures, reasonably balanced page placement schemes-such as round-robin or random distribution of pages-incur modest performance losses. We also show that performance leaks stemming from suboptimal page placement schemes can be remedied with a smart user-level page migration engine. The main body of the paper describes how the OpenMP runtime environment can use page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results support the effectiveness of these mechanisms and provide a proof of concept that there is no need to introduce data distribution directives in OpenMP and warrant the portability of the programming model.

Extending OpenMP for NUMA Machines
John Bircsak, Peter Craig, RaeLyn Crowell, Zarka Cvetanovic, Jonathan Harris*, C. Alexander Nelson, Carl D. Offner, Compaq Computer

This paper describes extensions to OpenMP which implement data placement features needed for NUMA architectures. OpenMP is a collection of compiler directives and library routines used to write portable parallel programs for shared-memory architectures. Writing efficient parallel programs for NUMA architectures, which have characteristics of both shared-memory and distributed-memory architectures, requires that a programmer control the placement of data in memory and the placement of computations that operate on that data. Optimal performance is obtained when computations occur on processors that have fast access to the data needed by those computations. OpenMP-designed for shared-memory architectures-does not by itself address these issues.

The extensions to OpenMP Fortran presented here have been mainly taken from High Performance Fortran. The paper describes some of the techniques that the Compaq Fortran compiler uses to generate efficient code based on these extensions. It also describes some additional compiler optimizations, and concludes with some preliminary results.


THURSDAY NOVEMBER 9
SOFTWARE TOOLS

Chair: Al Geist, Oak Ridge National Laboratory
Time: 10:30 AM - Noon
Room: D 274

A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software with Templates
Kathleen A. Lindlan, Janice Cuny, Allen D. Malony, Sameer Shende, CIS, University of Oregon, Bernd Mohr, ZAM, Forschungszentrum Juelich, Reid Rivenburgh, CRAG, Los Alamos National Laboratory, Craig Rasmussen, ACL, Los Alamos National Laboratory

The developers of high-performance scientific applications often work in complex computing environments that place heavy demands on program analysis tools. The developers need tools that interoperate, are portable across machine architectures, and provide source-level feedback. In this paper, we describe a tool framework, the Program Database Toolkit (PDT), that supports the development of program analysis tools meeting these requirements. PDT uses compile-time information to create a complete database of high-level program information that is structured for well-defined and uniform access by tools and applications. PDT's current applications make heavy use of advanced features of C++, in particular, templates. We describe the toolkit, focussing on its most important contribution-its handling of templates-as well as its use in existing applications.

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems
C. Eric Wu, Anthony Bolmarcich, Marc Snir , David Wootton, Farid Parpia, IBM T.J. Watson Research Center, David Wootton, Farid Parpia, IBM RS6000 Division, Anthony Chan, Ewing Lusk, William Gropp, Argonne National Laboratory

In this paper we describe a trace analysis framework, from trace generation to visualization. It includes a unified tracing facility on IBM SP systems, a self-defining interval file format, an API for framework extensions, utilities for merging and statistics generation, and a visualization tool with preview and multiple time-space diagrams. The trace environment is extremely scalable, and combines MPI events with system activities in the same set of trace files, one for each SMP node. Since the amount of trace data may be very large, utilities are developed to convert and merge individual trace files into a self-defining interval trace file with multiple frame directories. The interval format allows the development of multiple time-space diagrams, such as thread-activity view, processor-activity view, etc., from the same interval file. A visualization tool, Jumpshot, is modified to visualize these views. A statistics utility is developed using the API, along with its graphics viewer.

Dynamic Software Testing of MPI Applications with Umpire
Jeffrey S. Vetter, Bronis R de Supinski, Center for Applied Scientific Computing, Lawrence Livermore National Laboratory

As evidenced by the popularity of MPI (Message Passing Interface), message passing is an effective programming technique for managing coarse-grained concurrency on distributed computers. Unfortunately, debugging message-passing applications can be difficult. Software complexity, data races, and scheduling dependencies can make programming errors challenging to locate with manual, interactive debugging techniques. This article describes Umpire, a new tool for detecting programming errors at runtime in message passing applications. Umpire monitors the MPI operations of an application by interposing itself between the application and the MPI runtime system using the MPI profiling layer. Umpire then checks the application's MPI behavior for specific errors. Our initial collection of programming errors includes deadlock detection, mismatched collective operations, and resource exhaustion. We present an evaluation on a variety of applications that demonstrates the effectiveness of this approach.

DATA GRID

Chair: Abdullah Almojel, King Fahd University
Time: 10:30 AM - Noon
Room: D267

Computing and Data Grids for Science and Engineering
William E. Johnston, NASA Ames and Lawrence Berkeley National Laboratory, Dennis Gannon, NASA Ames and University of Indiana, Bill Nitzberg, Veridian Systems, PBS Products, Leigh Ann Tanner, Bill Thigpen, Alex Woo, NASA Ames

We use the term "Grid" to refer to a software system that provides uniform and location independent access to geographically and organizationally dispersed, heterogeneous resources which are persistent and supported. While, in general, Grids will provide the infrastructure to support a wide range of services in the scientific environment (e.g. collaboration and remote instrument control) in this paper we focus on services for high performance computing and data handling. We describe the services and architecture of NASA's Information Power Grid ("IPG")-an early example of a large-scale Grid-and some of the issues that have come up in its implementation.

The MicroGrid: a Scientific Tool for Modeling Computational Grids
H. J. Song, X. Liu*, D. Jakobsen, R. Bhagwan, X. Zhang, Unversity of California, San Diego, K. Taura, University of California, San Diego, Unversity of Tokyo, A. Chien, Unversity of California, San Diego

The complexity and dynamic nature of the Internet (and the emerging Computational Grid) demand that middleware and applications adapt to the changes in configuration and availability of resources. However, to the best of our knowledge there are no simulation tools which support systematic exploration of dynamic Grid software (or Grid resource) behavior.

We describe our vision and initial efforts to build tools to meet these needs. Our MicroGrid simulation tools enable Globus applications to be run in arbitrary virtual grid resource environments, enabling broad experimentation. We describe the design of these tools, and their validation on micro-benchmarks, the NAS parallel benchmarks, and an entire Grid application. These validation experiments show that the MicroGrid can match actual experiments within a few percent (2 percent to 4 percent).

GORDON BELL II

Chair: Randy Bramley, Indiana University
Time: 10:30 AM - Noon
Room: D 271/273

1.34 Tflops Molecular Dynamics Simulation for NaCl with a Special-Purpose Computer: MDM
Tetsu Narumi, Ryutaro Susukita, Takahiro Koishi, RIKEN, Kenji Yasuoka, Keio University, Hideaki Furusawa, Atsushi Kawai, Toshikazu Ebisuzaki, RIKEN

We performed molecular dynamics (MD) simulation of 9 million pairs of NaCl ions with the Ewald summation and obtained a calculation speed of 1.34 Tflops. In this calculation we used a special-purpose computer, MDM, which we are developing for the calculations of the Coulomb and van der Waals forces. The MDM enabled us to perform large-scale MD simulations without truncating the Coulomb force. It is composed of WINE-2, MDGRAPE-2 and a host computer. WINE-2 accelerates the calculation for wavenumber-space part of the Coulomb force, while MDGRAPE-2 accelerates the calculation for real-space part of the Coulomb and van der Waals forces. The host computer performs other calculations. We performed MD simulation with the early version of the MDM system: 45 Tflops of WINE-2 and 1 Tflops of MDGRAPE-2. The peak performance of the final MDM system will reach 75 Tflops in total by the end of the year 2000.

High-Cost CFD on a Low-Cost Cluster
Thomas Hauser, Timothy I. Mattox, Raymond P. LeBeau, Henry G. Dietz, P. George Huang, University of Kentucky

Direct numerical simulation of the Navier-Stokes equations (DNS) is an important technique for the future of computational fluid dynamics (CFD) in engineering applications. However, DNS requires massive computing resources. This paper presents a new approach for implementing high-cost DNS CFD using low-cost cluster hardware.

After describing the DNS CFD code DNSTool, the paper focuses on the techniques and tools that we have developed to customize the performance of a cluster implementation of this application. This tuning of system performance involves both recoding of the application and careful engineering of the cluster design. Using the cluster KLAT2 (Kentucky Linux Athlon Testbed 2), while DNSTool cannot match the $0.64 per Mflops that KLAT2 achieves on single precision ScaLAPACK, it is very efficient; DNSTool on KLAT2 achieves price/performance of $2.75 per Mflops double precision and $1.86 single precision. Further, the code and tools are all, or will soon be, made freely available as full source code.

High Performance Reactive Fluid Flow Simulations Using Adaptive Mesh Refinement on Thousands of Processors
A.C. Calder, University of Chicago, B.C. Curtis, Lawrence Livermore National Laboratory, L.J. Dursi*, B. Fryxell, University of Chicago, G. Henry, Intel Corporation, P. MacNeice, K. Olson, NASA Goddard, P. Ricker, R. Rosner, F.X. Timmes, University of Chicago, H.M. Tufo, University of Chicago/Argonne National Laboratory, J.W. Turan, M. Zingale, University of Chicago

We present simulations and performance results of nuclear burning fronts in supernovae on the largest domain and at the finest spatial resolution studied to date. These simulations were performed on the Intel ASCI-Red machine at Sandia National Laboratories using FLASH, a code developed at the Center for Astrophysical Thermonuclear Flashes at the University of Chicago. FLASH is a modular, adaptive mesh, parallel simulation code capable of handling compressible, reactive fluid flows in astrophysical environments. FLASH is written primarily in Fortran 90, uses the Message-Passing Interface library for inter-processor communication and portability, and employs the PARAMESH package to manage a block-structured adaptive mesh that places blocks only where resolution is required and tracks rapidly changing flow features, such as detonation fronts, with ease. We describe the key algorithms and their implementation as well as the optimizations required to achieve sustained performance of 238 GFLOPS on 6420 processors of ASCI-Red in 64 bit arithmetic.

AWARDS SESSION

Chair: Sally Haerer, Oregon State University
Time: 1:30 - 3:00 PM
Room: Ballroom C

A variety of achievements will be recognized and honored at the Awards Session during SC2000. The second annual IEEE computer Society Seymour Cray Computer Engineering Award will be presented in recognition of innovative contributions to high performance computing systems that best exemplify Seymour Cray's creative spirit. The IEEE Computer Society Sidney Fernbach Memorial Award will be presented for an outstanding contribution in the application of high performance computers using innovative approaches.

In addition to these society awards, the SC2000 Conference will present several other winners. The Gordon Bell Awards were established to reward practical uses of parallel processing and will be given for the best performance improvements in an application within several categories relating to hardware and software advancements. A newly created award for Best Network Application will be presented highlighting the most innovative, bandwidth-intensive application demonstration at SC2000. Awards will also be given for the best technical paper of the conference, the best student technical paper (with a student as principal author), the HPC Games winners, and the best Research Gem.

This session will feature a talk by Prof. Geoffrey M. Voelker, University of California, San Diego. Prof. Voelker is a Computing Research Association Digital Fellow. He will be speaking "On the Scale and Performance of Cooperative Web Proxy Caching."

SCIENCE APPLICATIONS SUPPORT

Chair: Bernd Mohr, Research Centre Juelich
Time: 3:30 - 5:00 PM
Room: D 274

Integrating Parallel File I/O and Database Support for High-Performance Scientific Data Management
Jaechun No, Rajeev Thakur, Argonne National Laboratory, Alok Choudhary, Northwestern University

Many scientific applications have large I/O requirements, in terms of both the size of data and the number of files or data sets. Management, storage, efficient access, and analysis of these data present an extremely challenging task. Traditionally, two different solutions are used for this problem: file I/O or databases. File I/O can provide high performance but is tedious to use with large numbers of files and large and complex data sets. Databases can be convenient, flexible, and powerful but do not perform and scale well for parallel supercomputing applications. We have developed a software system, called Scientific Data Manager (SDM), which aims to combine the good features of both file I/O and databases. SDM provides a high-level API to the user and, internally, uses a parallel file system to store real data and a database to store application-related metadata. SDM takes advantage of various I/O optimizations available in MPI-IO, such as collective I/O and noncontiguous requests, in a manner that is transparent to the user. As a result, users can write and retrieve data with the performance of parallel file I/O, without having to bother with the details of actually performing file I/O.

In this paper, we describe the design and implementation of SDM. With the help of two parallel application templates, ASTRO3D and an Euler solver, we illustrate how some of the design criteria affect performance.

A Framework for Sparse Matrix Code Synthesis from High-level Specifications
Nawaaz Ahmed, Nikolay Mateev*, Keshav Pingali, Paul Stodghill Cornell University

We present compiler technology for synthesizing sparse matrix code from (i) dense matrix code, and (ii) a description of the index structure of a sparse matrix. Our approach is to embed statement instances into a Cartesian product of statement iteration and data spaces, and to produce efficient sparse code by identifying common enumerations for multiple references to sparse matrices. The approach works for imperfectly-nested codes with dependences, and produces sparse code competitive with hand-written library code for the Basic Linear Algebra Subroutines (BLAS).

A Unified Algorithm for Load-balancing Adaptive Scientific Simulations
Kirk Schloegel, George Karypis, Vipin Kumar, Army HPC Research Center; University of Minnesota

Adaptive scientific simulations require that periodic repartitioning occur dynamically throughout the course of the computation. The repartitionings should be computed so as to minimize both the inter-processor communications incurred during the iterative mesh-based computation and the data redistribution costs required to balance the load. Recently developed schemes for computing repartitionings provide the user with only a limited control of the tradeoffs among these objectives. This paper describes a new Unified Repartitioning Algorithm that can tradeoff one objective for the other dependent upon a user-defined parameter describing the relative costs of these objectives. We show that the Unified Repartitioning Algorithm is able to reduce the precise overheads associated with repartitioning as well as or better than other repartitioning schemes for a variety of problems, regardless of the relative costs of performing inter-processor communication and data redistribution. Our experimental results show that this scheme is extremely fast and scalable to large problems.

GRID MIDDLEWARE

Chair: Wolfgang Gentzsch, Sun Microsystems
Time: 3:30 - 5:00 PM
Room: D 271/273

The AppLeS Parameter Sweep Template: User-Level Middleware for the Grid
Henri Casanova, Graziano Obertelli, Francine Berman, University of California, San Diego, Rich Wolski, University of Tennessee, Knoxville

The Computational Grid is a promising platform for the efficient execution of parameter sweep applications over large parameter spaces. To achieve performance on the Grid, such applications must be scheduled so that shared data files are strategically placed to maximize re-use, and so that the application execution can adapt to the deliverable performance potential of target heterogeneous, distributed and shared resources. Parameter sweep applications are an important class of applications and would greatly benefit from the development of Grid middleware that embeds a scheduler for performance and targets Grid resources transparently.

In this paper we describe a user-level Grid middleware project, the AppLeS Parameter Sweep Template (APST), that uses application-level scheduling techniques and various Grid technologies to allow the efficient deployment of parameter sweep applications over the Grid. We discuss several possible scheduling algorithms and detail our software design. We then describe our current implementation of APST using systems like Globus, NetSolve and the Network Weather Service, and present experimental results.

Requirements for and Evaluation of RMI Protocols for Scientific Computing
Madhusudhan Govindaraju, Aleksander Slominski, Venkatesh Choppella, Randall Bramley, Dennis Gannon, Indiana University

Distributed software component architectures provide a promising approach to the problem of building large scale, scientific Grid applications. Communication in these component architectures is based on Remote Method Invocation (RMI) protocols that allow one software component to invoke the functionality of another. Examples include Java remote method invocation (Java RMI) and the new Simple Object Access Protocol (SOAP). SOAP has the advantage that many programming languages and component frameworks can support it. This paper describes experiments showing that SOAP by itself is not efficient enough for large scale scientific applications. However, when it is embedded in a multi-protocol RMI framework, SOAP can be effectively used as a universal control protocol, that can be swapped out by faster, more special purpose protocols when large data transfer speeds are needed.

Expressing and Enforcing Distributed Resource Sharing Agreements
Tao Zhao, Vijay Karamcheti, New York University

Advances in computing and networking technology, and an explosion in information sources has resulted in a growing number of distributed systems being constructed out of resources contributed by multiple sources. Use of such resources is typically governed by sharing agreements between owning principals, which limit both who can access a resource and in what quantity. Despite their increasing importance, existing resource management infrastructures offer only limited support for the expression and enforcement of sharing agreements, typically restricting themselves to identifying compatible resources. In this paper, we present a novel approach building on the concepts of tickets and currencies to express resource sharing agreements in an abstract, dynamic, and uniform fashion. We also formulate the allocation problem of enforcing these agreements as a linear-programming model, automatically factoring the transitive availability of resources via chained agreements. A case study modeling resource sharing among ISP-level web proxies shows the benefits of enforcing transitive agreements: worst-case waiting times of clients accessing these proxies improves by up to two orders of magnitude.