This section discusses terms and definitions that are commonly used to describe current hardware capabilities.
CPUs and Cores
The advent of multicore processors has introduced some ambiguity about the definition of a CPU. Historically, the CPU was the central processing unit of a computer. However, with multicore CPUs each core is really an independently functioning processor. Each multicore CPU contains 2 or more cores and is, therefore, a parallel computer competing with the resources for memory and I/O on a single motherboard.
The ambiguity of CPUs and cores often occurs when describing parallel algorithms or parallel runs. In the context of an algorithm, CPU almost always refers to a single task on a single processor. In this document we will use core rather than CPU to identify independent processes that run on a single CPU core. CPU will be reserved for describing the socket configuration. For example, a typical configuration today contains two CPU sockets on a single motherboard with 4 or 8 cores per socket. Such a configuration could support a distributed-memory parallel (DMP) processing run of up to 16 cores. We will describe this as a 16-core run, not a 16-processor run.
GPUs
While graphics processing units (GPUs) have been around for many years, only recently have they begun to be used to perform general purpose computations. GPUs offer a highly parallel design which often includes hundreds of compute units and have their own dedicated physical memory. Certain high-end graphics cards, the ones with the most amount of compute units and memory, can be used to accelerate the computations performed during a simulation. In this document, we will use the term GPU to refer to these high-end cards that can be used as accelerators to speedup certain portions of a simulation.
Threads and MPI Processes
Two modes of parallel processing are supported and used throughout Mechanical APDL simulations. Details of parallel processing are described in a later chapter, but the two modes introduce another ambiguity in describing parallel processing. For the shared memory implementation, one instance of the Mechanical APDL executable (that is, one Mechanical APDL process) spawns multiple threads for parallel regions. However, for the distributed memory implementation, multiple instances of the Mechanical APDL executable run as separate MPI tasks or processes.
In this document, “threads” refers to the shared memory tasks that the program uses when running in parallel under a single process. “MPI processes” refers to the distributed memory tasks in a DMP run. MPI processes serve the same function as threads but are multiple processes running simultaneously that can communicate through MPI software.
Memory Vocabulary
Two common terms used to describe computer memory are physical and virtual memory. Physical memory is essentially the total amount of RAM (Random Access Memory) available. Virtual memory is an extension of physical memory that is actually reserved on disk storage. It allows applications to extend the amount of memory address space available at the cost of speed since addressing physical memory is much faster than accessing virtual (or disk) memory. The appropriate use of virtual memory is described in later chapters.
I/O Vocabulary
I/O performance is an important component of computer systems for Mechanical APDL users. Advances in desktop systems have made high performance I/O available and affordable to all users. A key term used to describe multidisc, high performance systems is RAID (Redundant Array of Independent Disks). RAID arrays are common in computing environments, but have many different uses and can be the source of yet another ambiguity.
For many systems, RAID configurations are used to provide duplicate files systems that maintain a mirror image of every file (hence, the word redundant). This configuration, normally called RAID1, does not increase I/O performance, but often increases the time to complete I/O requests. An alternate configuration, called RAID0, uses multiple physical disks in a single striped file system to increase read and write performance by splitting I/O requests simultaneously across the multiple drives. This is the RAID configuration recommended for optimal I/O performance. Other RAID setups use a parity disk to achieve redundant storage (RAID5) or to add redundant storage as well as file striping (RAID10). RAID5 and RAID10 are often used on much larger I/O systems.
Interconnects
Distributed memory parallel processing relies on message passing hardware and software to communicate between MPI processes. The hardware components on a shared memory system are minimal, requiring only a software layer to implement message passing to and from shared memory. For multi-machine or multi-node clusters with separate physical memory, several hardware and software components are required. Usually, each compute node of a cluster contains an adapter card that supports one of several standard interconnects (for example, GigE, Myrinet, Infiniband). The cards are connected to high speed switches using cables. Each interconnect system requires a supporting software library, often referred to as the fabric layer. The importance of interconnect hardware in clusters is described in later chapters. It is a key component of cluster performance and cost, particularly on large systems.
Within the major categories of GigE, Myrinet, and Infiniband, new advances can create incompatibilities with application codes. It is important to make sure that a system that uses a given interconnect with a given software fabric is compatible with Mechanical APDL. Details of the requirements for hardware interconnects are found in the Parallel Processing Guide.
The performance terms used to discuss interconnect speed are latency and bandwidth. Latency is the measured time to send a message of zero length from one MPI process to another (that is, overhead). It is generally expressed as a time, usually in micro seconds. Bandwidth is the rate (MB/sec) at which larger messages can be passed from one MPI process to another (that is, throughput). Both latency and bandwidth are important considerations in distributed-memory parallel processing.
Many switch and interconnect vendors describe bandwidth using Gb or Mb units. Gb stands for Gigabits, and Mb stands for Megabits. Do not confuse these terms with GB (GigaBytes) and MB (MegaBytes). Since a byte is 8 bits, it is important to keep the units straight when making comparisons. Throughout this guide we consistently use GB and MB units for both I/O and communication rates.