Running HPC Diagnostics
The Ansys EM HPC diagnostics tool simplifies HPC troubleshooting by automating diagnosis of routine issues. The diagnostics tool is run on the cluster as a scheduler managed job. Using its HTML based diagnostics report, cluster administrator or Ansys support staff can either resolve the issue, or guide the user with steps for further troubleshooting. In some cases, Ansys support staff may request to rerun the diagnostics with additional diagnostics tests. The user may extend the diagnostic scripts to suite their HPC environment.
This note describes how to use the diagnostics tool.
- Supported schedulers
- Running the diagnostics job
- Standard diagnostic job
- Using diagnostics scripts on Linux clusters
- Using Windows HPC job file
- Diagnostic report
- Site-specific diagnostics job
- Environment variables
- ANSYSEM_DIAG_PROD_DIR contents
- ANSYSEM_DIAG_RESULTS_DIR contents
- How does the diagnostic tool work
The tool supports diagnosis of issues on Linux and windows clusters managed by the following schedulers:
- LSF
- SGE
- PBS/Torque
- Windows HPC
For the above schedulers (see High Performance Computing (HPC) Integration), the tool includes basic diagnostic scripts. Further, if password-ssh has been enabled, it also supports generic Linux clusters using ssh. Please note that currently diagnostics tool does not support PBSPro and LSF/Windows.
The diagnostics are run as a scheduler managed job. Once the job finishes, you locate the resulting HTML file and provide it to the cluster administrator or to Ansys support staff. In case, there are any job or test failures, please also provide the networking*.json files from the Hosts subdirectory as well.
To run the basic diagnostics, submit a diagnostic job to the scheduler using a provided job submission script. Each basic diagnostic job is a 12 core job with 4 cores per host. On Linux, running this script submits a scheduler job to run the diagnostic tool on the cluster. On Windows, you need to submit a job using a job file.
Basic scripts for each supported scheduler are available in diagnostics subdirectory of schedulers directory.
Linux:
.../Linux64/schedulers/diagnostics
Windows:
...\Win64\schedulers\diagnostics
Using diagnostics scripts on Linux clusters
The following basic scripts are provided in the diagnostics directory (.../Linux64/schedulers/diagnostics):
These job submission scripts are scheduler specific.
Scheduler |
Basic job submission script |
Comment |
LSF |
test_lsf |
Supports both lsrun and blaunch |
SGE |
test_sge |
Supports both qrsh and rsh |
PBS/Torque |
test_torque |
Requires changing the PATH and PBS_BINARY_PATH environment variable |
Generic Linux cluster |
test_ssh |
Supports only ssh. Requires password-less ssh. Requires creating a file with the names of hosts and saving it in ${HOME}/ansysem_hostfile |
A sample job file winhpctest.xml is available in the diagnostics directory:
...\Win64\schedulers\diagnostics.
To submit this diagnostic job, you must change the job description to suite your environment as following:
- Select a directory for saving the diagnostic results. This directory must be accessible at the same path from all the hosts of the cluster.
- Locate the directory for Ansys EM installation. This directory also must be accessible at the same path from all the hosts of the cluster.
- Locate the winhpctest.xml in the diagnostics subdirectory of schedulers directory in Ansys EM installation.
- Start Windows HPC job manager, and choose "New job from XML File…" action.
- Select the winhpctest.xml job file.
- Change the value of both the following
environment variable with the directories located in the first two steps:
ANSYSEM_DIAG_PROD_DIR
.ANSYSEM_DIAG_RESULTS_DIR
Now submit the job.
After making the above changes, you can also save the resulting XML file using "Submit Job XML File…". Then you can submit the job using the job command as following:
job submit /jobfile: XMLfile name
The diagnostic report is an HTML file which (along with other related diagnostics results) is placed in the following directory
Linux:
${HOME}/Ansoft/HPCDiag/Results/JOBID
Windows:
%ANSYSEM_DIAG_DIR%\Results\JOBID
Report file:
.../HTML/report.html
where JOBID is the job ID assigned by the scheduler. On Windows, the user must specify ANSSEM_DIAG_DIR directory.
Site-specific diagnostics job
To run a diagnostic job with job submission parameters of your choice, you need to create your own job submission script. For example, you may want to specify a different LSF queue, or select a different SGE parallel environment. To run such a job, you need to create your own job submission script starting from the basic diagnostic scripts with the following steps:
- Locate the relevant basic diagnostic script in the diagnostics subdirectory of schedulers directory in Ansys EM installation.
- Make a copy of the diagnostics script into a directory that is accessible from a submit host for the cluster.
- Edit the script file to change the value of ANSYSEM_DIAG_PROD_DIR environment variable to point it to the installation directory (See below).
- Modify the job submission parameters as needed.
- Optionally, copy any site-specific diagnostic tests provided by Ansys support staff in the ../Custom subfolder of the ANSYSEM_DIAG_RESULTS_DIR directory.
- Run the diagnostics script from a submit host for the cluster.
The following environment variables are applicable for both Linux and Windows environment.
Environment variable |
ANSYSEM_DIAG_PROD_DIR |
Description |
Location of the Ansys EM installation. This must be available at the same path from all the hosts of the cluster. |
Windows example |
\\filer\AnsyEM\v242\Win64 |
Linux example |
/shared/AnsysEM/v242/Linux64 |
Comments |
Windows: Required. Linux: Optional. Export this environment variable if you make a copy of the diagnostic script. |
Environment variable |
ANSYSEM_DIAG_RESULTS_DIR |
Description |
Location of the diagnostic report and other results on a shared drive. This must be available at the same path from all the hosts of the cluster |
Example |
\\filer\Home\User\Ansoft\HPCDiag |
Linux example |
/shared/home/user/Ansoft/HPCDiag |
Comments |
Windows: Required. Linux: Optional. Export this environment variable if the home directory for the user is not accessible from the cluster. |
ANSYSEM_DIAG_CUSTOM_DIR
Environment variable |
ANSYSEM_DIAG_CUSTOM_DIR |
Description |
Location of the configuration of product tests and other custom site-specific tests. This location must be on a shared drive that is available at the same path from all the hosts of the cluster |
Example |
\\filer\Home\User\Ansoft\HPCDiag\Custom |
Linux example |
/shared/home/user/Ansoft/HPCDiag/Custom |
Comments |
Windows: Optional. You may want to specify it if the path %ANSYSEM_DIAG_RESULTS_DIR%\..\Custom is not suitable Linux: Optional. Export this environment variable if the home directory for the user is not accessible from the cluster. |
The diagnostics are run as a scheduler managed job. Running the diagnostic script submits a scheduler job that runs the diagnostic tool on the hosts allocated to the job. Once the diagnostic job starts, the tool executes a set of diagnostic tests. These tests run on each host allocated to the job, and collect diagnostic information relevant for running HPC jobs. The tool combines the diagnostic information to produce an HTML report. The tool saves HTML diagnostic report and other results in a shared drive, which must be available at the same path from all the hosts of the cluster. On Linux, the default is Ansoft/HPCDiag subdirectory under user's home directory. On Windows, the user must specify this location using ANSYSEM_DIAG_RESULTS_DIR environment variable.