Appendix A. Glossary

This appendix contains a glossary of terms used in the Performance Guide.

Cache: High speed memory that is located on a CPU or core. Cache memory can be accessed much faster than main memory, but it is limited in size due to high cost and the limited amount of space available on multicore processors. Algorithms that can use data from the cache repeatedly usually perform at a much higher compute rate than algorithms that access larger data structures that cause cache misses.
Clock cycle: The time between two adjacent pulses of the oscillator that sets the tempo of the computer processor. The cycle time is the reciprocal of the clock speed, or frequency. A 1 GHz (gigahertz) clock speed has a clock cycle time of 1 nanosecond (1 billionth of a second).
Clock speed: The system frequency of a processor. In modern processors the frequency is typically measured in GHz (gigahertz - 1 billion clocks per second). A 3 GHz processor producing 2 adds and 2 multiplies per clock cycle can achieve 12 Gflops.
Cluster system: A system of independent processing units, called blades or nodes, each having one or more independent processors and independent memory, usually configured in a separate chassis rack-mounted unit or as independent CPU boards. A cluster system uses some sort of interconnect to communicate between the independent nodes through a communication middleware application.
Core: A core is essentially an independent functioning processor that is part of a single multicore CPU. A dual-core processor contains two cores, and a quad-core processor contains four cores. Each core can run an individual application or run a process in parallel with other cores in a parallel application. Cores in the same multicore CPU share the same socket in which the CPU is plugged on a motherboard.
CPU time: As reported in the solver output, CPU time generally refers to the time that a processor spends on the user's application; it excludes system and I/O wait time and other idle time. For parallel systems, CPU time means different things on different systems. Some systems report CPU time summed across all threads, while others do not. It is best to focus on "elapsed" or "wall" time for parallel applications.
Database space: The block of memory that Mechanical APDL uses to store the database (model geometry, material properties, boundary conditions, and a portion of the results).
DIMM: A module (double in-line memory module) containing one or several random access memory (RAM) chips on a small circuit board with pins that connect to the computer motherboard.
Distributed-memory parallel (DMP) processing: This term refers to running across multiple cores on a single machine (for example, a desktop workstation or a single compute node of a cluster) or across multiple machines (for example, a cluster). Distributed-memory parallelism is invoked, and each process communicates data needed to perform the necessary parallel computations through the use of MPI (Message Passing Interface) software. With distributed-memory parallel processing, all computations in the solution phase are performed in parallel (including the stiffness matrix generation, linear equation solving, and results calculations). Pre- and postprocessing do not make use of the distributed-memory parallel processing, but these steps can exploit shared-memory parallelism. For details, see Using Distributed-Memory Parallel (DMP) Processing in the Parallel Processing Guide.
Distributed memory parallel (DMP) system: A system in which the physical memory for each process is separate from all other processes. A communication middleware application is required to exchange data between the processors.
Gflops: A measure of processor compute rate in terms of billions of floating point operations per second. 1 Gflop equals 1 billion floating point operations in one second.
Gigabit (abbreviated Gb): A unit of measurement often used by switch and interconnect vendors. One gigabit = 1024x1024x1024 bits. Since a byte is 8 bits, it is important to keep units straight when making comparisons. Throughout this guide we use GB (gigabytes) rather than Gb (gigabits) when comparing both I/O rates and communication rates.
Gigabyte (abbreviated GB): A unit of computer memory or data storage capacity equal to 1,073,741,824 (2³⁰) bytes. One gigabyte is equal to 1,024 megabytes (or 1,024 x 1,024 x 1,024 bytes).
Graphics processing unit (GPU): A graphics processing unit (GPU) is a specialized microprocessor that offloads and accelerates 3D or 2D graphics rendering from the microprocessor. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. In a personal computer, a GPU can be present on a video card, or it can be on the motherboard. Integrated GPUs are usually far less powerful than those on a dedicated video card.
Head compute node: In a DMP run, the machine or node on which the master process runs (that is, the machine on which the job is launched). The head compute node should not be confused with the host node in a Windows cluster environment. The host node typically schedules multiple applications and jobs on a cluster, but does not always actually run the application.
High performance computing (HPC): The use of parallel processing software and advanced hardware (for example, large memory, multiple CPUs) to run applications efficiently, reliably, and quickly.
Hyperthreading: An operating system form of parallel processing that uses extra virtual processors to share time on a smaller set of physical processors or cores. This form of parallel processing does not increase the number of physical cores working on an application and is best suited for multicore systems running lightweight tasks that outnumber the number of physical cores available.
In-core mode: A memory allocation strategy in the shared memory and distributed memory sparse solvers that will attempt to obtain enough memory to compute and store the entire factorized matrix in memory. The purpose of this strategy is to avoid doing disk I/O to the matrix factor file.
Interconnect: A hardware switch and cable configuration that connects multiple cores (CPUs) or machines together.
Interconnect Bandwidth: The rate (MB/sec) at which larger-sized messages can be passed from one MPI process to another.
Interconnect Latency: The measured time to send a message of zero length from one MPI process to another.
Master process: The first process started in a DMP run (also called the rank 0 process). This process reads the user input file, decomposes the problems, and sends the FEA data to each remaining MPI process in a DMP run to begin the solution. It also handles any pre- and postprocessing operations.
Megabit (abbreviated Mb): A unit of measurement often used by switch and interconnect vendors. One megabit = 1024x1024 bits. Since a byte is 8 bits, it is important to keep units straight when making comparisons. Throughout this guide we use MB (megabytes) rather than Mb (megabits) when comparing both I/O rates and communication rates.
Megabyte (abbreviated MB): A unit of computer memory or data storage capacity equal to 1,048,576 (2²⁰) bytes (also written as 1,024 x 1,024 bytes).
Memory bandwidth: The amount of data that the computer can carry from one point to another inside the CPU processor in a given time period (usually measured by MB/second).
Mflops: A measure of processor compute rate in terms of millions of floating point operations per second; 1 Mflop equals 1 million floating point operations in one second.
Multicore processor: An integrated circuit in which each processor contains multiple (two or more) independent processing units (cores).
MPI software: Message passing interface software used to exchange data among processors.
NFS: The Network File System (NFS) is a client/server application that lets a computer user view and optionally store and update files on a remote computer as if they were on the user's own computer. On a cluster system, an NFS system may be visible to all nodes, and all nodes may read and write to the same disk partition.
Node: When used in reference to hardware, a node is one machine (or unit) in a cluster of machines used for distributed memory parallel processing. Each node contains its own processors, memory, and usually I/O.
Non-Uniform Memory Architecture (NUMA): A memory architecture for multi-processor/core systems that includes multiple paths between memory and CPUs/cores, with fastest memory access for those CPUs closest to the memory. The physical memory is globally addressable, but physically distributed among the CPU. NUMA memory architectures are generally preferred over a bus memory architecture for higher CPU/core counts because they offer better scaling of memory bandwidth.
OpenMP: A programming standard which allows parallel programming with SMP architectures. OpenMP consists of a software library that is usually part of the compilers used to build an application, along with a defined set of programing directives which define a standard method of parallelizing application codes for SMP systems.
Out-of-core mode: A memory allocation strategy in the shared memory and distributed memory sparse solvers that uses disk storage to reduce the memory requirements of the sparse solver. The very large matrix factor file is stored on disk rather than stored in memory.
Parallel processing: Running an application using multiple cores, or processing units. Parallel processing requires the dividing of tasks in an application into independent work that can be done in parallel.
Physical memory: The memory hardware (normally RAM) installed on a computer. Memory is usually packaged in DIMMS (double in-line memory module) which plug into memory slots on a CPU motherboard.
Processor: The computer hardware that responds to and processes the basic instructions that drive a computer.
Processor speed: The speed of a CPU (core) measured in MHz or GHz. See "clock speed" and "clock cycle."
RAID: A RAID (redundant array of independent disks) is multiple disk drives configured to function as one logical drive. RAID configurations are used to make redundant copies of data or to improve I/O performance by striping large files across multiple physical drives.
SAS drive: Serial-attached SCSI drive is a method used in accessing computer peripheral devices that employs a serial (one bit at a time) means of digital data transfer over thin cables. This is a newer version of SCSI drive found in some HPC systems.
SATA drive: Also known as Serial ATA, SATA is an evolution of the Parallel ATA physical storage interface. Serial ATA is a serial link; a single cable with a minimum of four wires creates a point-to-point connection between devices.
SCSI drive: The Small Computer System Interface (SCSI) is a set of ANSI standard electronic interfaces that allow personal computers to communicate with peripheral hardware such as disk drives, printers, etc.
Scalability: A measure of the ability of an application to effectively use parallel processing. Usually, scalability is measured by comparing the time to run an application on p cores versus the time to run the same application using just one core.
Scratch space: The block of memory used by the program for all internal calculations: element matrix formulation, equation solution, and so on.
Shared-memory parallel (SMP) processing: This term refers to running across multiple cores on a single machine (for example, a desktop workstation or a single compute node of a cluster). Shared-memory parallelism is invoked, which allows each core involved to share data (or memory) as needed to perform the necessary parallel computations. When run within a shared-memory architecture, most computations in the solution phase and many pre- and postprocessing operations are performed in parallel. For more information, see Using Shared-Memory Parallel (SMP) Processing in the Parallel Processing Guide.
Shared memory parallel (SMP) system: A system that shares a single global memory image that may be distributed physically across multiple nodes or processors, but is globally addressable.
SIMM: A module (single inline memory module) containing one or several random access memory (RAM) chips on a small circuit board with pins that connect to the computer motherboard.
Socket configuration: A set of plug-in connectors on a motherboard that accepts CPUs. Each multicore CPU on a motherboard plugs into a separate socket. Thus, a dual socket CPU on a motherboard accepts two dual or quad core CPUs for a total of 4 or 8 cores. On a single mother board, the cores available are mapped to specific sockets and numbered within the CPU.
Solid state drive (SSD): A solid-state drive (SSD) is a data storage device that uses solid-state memory to store data. Unlike traditional hard disk drives (HDDs), SSDs use microchips and contain no moving parts. Compared to traditional HDDs, SSDs are typically less susceptible to physical shock, quieter, and have lower access time and latency. SSDs use the same interface as hard disk drives, thus SSDs can easily replace HHDs in most applications.
Terabyte (abbreviated TB): A unit of computer memory or data storage capacity equal to 1,099,511,627,776 (2⁴⁰) bytes. One terabyte is equal to 1,024 gigabytes.
Wall clock time: Total elapsed time it takes to complete a set of operations. Wall clock time includes processing time, as well as time spent waiting for I/O to complete. It is equivalent to what a user experiences in real-time waiting for the application to run.
Worker process: A DMP process other than the master process.
Virtual memory: A portion of the computer's hard disk used by the system to supplement physical memory. The disk space used for system virtual memory is called swap space, and the file is called the swap file.