5.3. Hardware Issues for Scalability

This section discusses some key hardware aspects which affect DMP performance.

5.3.1. Multicore Processors

Though multicore processors have extended parallel processing to virtually all computer platforms, some multicore processors have insufficient memory bandwidth to support all of the cores functioning at peak processing speed simultaneously. This can cause the performance efficiency of the Mechanical APDL program to degrade significantly when all of the cores within a processor are used during solution. For some processors, we recommend only using a maximum of half the cores available in the processor. For example, on a cluster with two quad-core processors per node, we might recommend using at most four cores per node.

Given the wide range of CPU products available, it is difficult to determine the maximum number of cores to use for peak efficiency as this can vary widely. However, it is important to understand that having more cores or faster clock speeds does not always translate into proportionate speedups. Memory bandwidth is an important hardware issue for multicore processors and has a significant effect on the scalability of the program. Finally, some processors run at faster clock speeds when only one core is used compared to when all cores are used. This can also have a negative effect on the scalability as more cores are used.

5.3.2. Interconnects

One of the most important factors to achieving good scalability performance when running across multiple machines is to have a good interconnect. Various forms of interconnects exist to connect multiple nodes on a cluster. Each type of interconnect will pass data at different speeds with regards to latency and bandwidth. See Hardware Terms and Definitions for more information on this terminology. A good interconnect is one that transfers data as quickly between cores on different machines as data moves between cores within a single machine.

The interconnect is essentially the path for one machine to access memory on another machine. When a processing core needs to compute data, it has to access the data for the computations from some form of memory. That memory can either come from the local RAM on that machine or can come across the interconnect from another node on the cluster. When the interconnect is slow, DMP performance is degraded as the core must wait for the data. This "waiting" time causes the overall solution time to increase since the core cannot continue to do computations until the data is transferred. The more nodes used in a cluster, the more the speed of the interconnect will make a difference. For example, when using a total of eight cores with two nodes in a cluster (that is, four cores on each of two machines), the interconnect does not have as much of an effect on performance as another cluster configuration consisting of eight nodes using a single core on each node.

Typically, a DMP solution achieves the best scalability performance when the communication bandwidth is above 1000 MB/s. This interconnect bandwidth value is printed out near the end of the DMP output file and can be used to help compare the interconnect of various hardware systems.

5.3.3. I/O Configurations

5.3.3.1. Single Machine

The I/O system used on a single machine (workstation, server, laptop, etc.) can be very important to the overall scalability of the Mechanical APDL program. When running a job using parallel processing, the I/O system can be a sequential bottleneck that drags down the overall performance of the system.

Certain jobs perform more I/O than others. The sparse solver (including both the shared and distributed memory versions) running in the out-of-core memory mode along with the Block Lanczos eigensolver commonly perform the most amounts of I/O in the program. Also, the distributed-memory version of the program can perform higher numbers of I/O requests than the shared-memory version because each distributed process writes its own set of files. For jobs that perform a large amount of I/O, having a slow file system can affect scalability because the elapsed time spent doing I/O does not decrease as more CPU cores are utilized. If this I/O time is a significant portion of the overall runtime, then the scalability is significantly affected.

Users should be aware of this potential bottleneck when running their jobs. It is highly recommended that a RAID0 array consisting of multiple hard drives be used for jobs that perform a large amount of I/O. One should also consider the choice of SSDs (solid state drives) when trying to minimize any I/O bottlenecks. SSDs, while more expensive than conventional hard drives, can have dramatically reduced seek times and demonstrate some impressive I/O transfer speeds when used in a RAID0 configuration.

5.3.3.2. Clusters

There are various ways to configure I/O systems on clusters. However, these configurations can generally be grouped into two categories: shared disk resources and independent disk resources. Each has its advantages and disadvantages, and each has an important effect on DMP scalability.

For clusters, the administrator(s) of the cluster will often setup a shared disk resource (SAN, NAS, etc.) where each node of a cluster can access the same disk storage location. This location may contain all of the necessary files for running a DMP solution across the cluster. While this is convenient for users of the cluster and for applications whose I/O configuration is setup to work in such an environment, this configuration can severely limit DMP scalability, particularly when using the distributed sparse solver. The reason is twofold. First, this type of I/O configuration often uses the same interconnect to transfer I/O data to the shared disk resource as is used by the DMP analysis to transfer computational data between machines. This can place more demands on the interconnect since a DMP solution often does a lot of data transfer across the interconnect and often requires a large amount of I/O. Second, each distributed process creates and acesses its own set of files. Each process writes its own .esav file, .full file, .mode file, results file, solver files, etc. In the case of running the distributed sparse solver in the out-of-core memory mode, this can be a huge amount of I/O and can cause a bottleneck in the interconnect used by the shared disk resource, thus limiting DMP scalability.

Alternatively, clusters might employ using local hard drives on each node of the cluster. In other words, each node has its own independent disk(s) shared by all of the cores within the node. Typically, this configuration is ideal assuming that (1) a limited number of cores are accessing the disk(s) or (2) multiple local disks are used in a RAID0 configuration. For example, if eight cores are used on a single node, then there are eight processes all trying to write their own set of I/O data to the same hard drive. The time to create and access this I/O data can be a big bottleneck to DMP scalability. When using local disks on each node of a cluster, or when using a single box server, you can improve performance by either limiting the number of cores used per machine or investing in an improved configuration consisting of multiple disks and a good RAID0 controller.

It is important to note that there are some very good network-attached storage solutions for clusters that employ separate high speed interconnects between processing nodes and a central disk resource. Often, the central disk resource has multiple disks that can be accessed independently by the cluster nodes. These I/O configurations can offer both the convenience of a shared disk resource visible to all nodes, as well as high speed I/O performance that scales nearly as well as independent local disks on each node. The best choice for an HPC cluster solution may be a combination of network-attached storage and local disks on each node.

5.3.4. GPUs

The GPU accelerator capability only supports the highest end GPU cards. The reason is that these high-end cards can offer some acceleration relative to the latest CPU cores. Older or less expensive graphics cards cannot typically offer this acceleration over CPU cores. When measuring the scalability of the GPU accelerator capability, it is important to consider not only the GPU being used, but also the CPU cores. These products (both CPUs and GPUs) are constantly evolving, and new products emerge on a regular basis. The number of available CPU cores along with the total peak computational rate when using those cores can affect the scalability. Older, slower CPUs often have less cores and show better GPU speedups, as the peak speed of the GPU will be greater relative to the peak speed of the older CPU cores. Likewise, as newer CPUs come to market (often with more cores), the GPU speedups may degrade since the peak speed of the GPU will be lessened relative to the peak speed of the new CPU cores.