Running HPC Diagnostics

The Ansys EM HPC diagnostics tool simplifies HPC troubleshooting by automating diagnosis of routine issues. The diagnostics tool is run on the cluster as a scheduler managed job. Using its HTML based diagnostics report, cluster administrator or Ansys support staff can either resolve the issue, or guide the user with steps for further troubleshooting. In some cases, Ansys support staff may request to rerun the diagnostics with additional diagnostics tests. The user may extend the diagnostic scripts to suite their HPC environment.

This note describes how to use the diagnostics tool.

Supported schedulers

The tool supports diagnosis of issues on Linux and windows clusters managed by the following schedulers:

For the above schedulers (see High Performance Computing (HPC) Integration), the tool includes basic diagnostic scripts. Further, if password-ssh has been enabled, it also supports generic Linux clusters using ssh. Please note that currently diagnostics tool does not support PBSPro and LSF/Windows.

Running the diagnostics job

The diagnostics are run as a scheduler managed job. Once the job finishes, you locate the resulting HTML file and provide it to the cluster administrator or to Ansys support staff. In case, there are any job or test failures, please also provide the networking*.json files from the Hosts subdirectory as well.

Basicdiagnostic job

To run the basic diagnostics, submit a diagnostic job to the scheduler using a provided job submission script. Each basic diagnostic job is a 12 core job with 4 cores per host. On Linux, running this script submits a scheduler job to run the diagnostic tool on the cluster. On Windows, you need to submit a job using a job file.

Basic scripts for each supported scheduler are available in diagnostics subdirectory of schedulers directory.

Linux:

.../Linux64/schedulers/diagnostics

Windows:

...\Win64\schedulers\diagnostics

Using diagnostics scripts on Linux clusters

The following basic scripts are provided in the diagnostics directory (.../Linux64/schedulers/diagnostics):

These job submission scripts are scheduler specific.



Using Windows HPC job file

A sample job file winhpctest.xml is available in the diagnostics directory:

...\Win64\schedulers\diagnostics.

To submit this diagnostic job, you must change the job description to suite your environment as following:

  1. Select a directory for saving the diagnostic results. This directory must be accessible at the same path from all the hosts of the cluster.
  2. Locate the directory for Ansys EM installation. This directory also must be accessible at the same path from all the hosts of the cluster.
  3. Locate the winhpctest.xml in the diagnostics subdirectory of schedulers directory in Ansys EM installation.
  4. Start Windows HPC job manager, and choose "New job from XML File…" action.
  5. Select the winhpctest.xml job file.
  6. Change the value of both the following environment variable with the directories located in the first two steps:

    ANSYSEM_DIAG_PROD_DIR

    .

    ANSYSEM_DIAG_RESULTS_DIR

Now submit the job.

Note:

After making the above changes, you can also save the resulting XML file using "Submit Job XML File…". Then you can submit the job using the job command as following:

job submit /jobfile: XMLfile name

Diagnostic report

The diagnostic report is an HTML file which (along with other related diagnostics results) is placed in the following directory

Linux:

${HOME}/Ansoft/HPCDiag/Results/JOBID

Windows:

%ANSYSEM_DIAG_DIR%\Results\JOBID

Report file:

.../HTML/report.html

where JOBID is the job ID assigned by the scheduler. On Windows, the user must specify ANSSEM_DIAG_DIR directory.

Site-specific diagnostics job

To run a diagnostic job with job submission parameters of your choice, you need to create your own job submission script. For example, you may want to specify a different LSF queue, or select a different SGE parallel environment. To run such a job, you need to create your own job submission script starting from the basic diagnostic scripts with the following steps:

  1. Locate the relevant basic diagnostic script in the diagnostics subdirectory of schedulers directory in Ansys EM installation.
  2. Make a copy of the diagnostics script into a directory that is accessible from a submit host for the cluster.
  3. Edit the script file to change the value of ANSYSEM_DIAG_PROD_DIR environment variable to point it to the installation directory (See below).
  4. Modify the job submission parameters as needed.
  5. Optionally, copy any site-specific diagnostic tests provided by Ansys support staff in the ../Custom subfolder of the ANSYSEM_DIAG_RESULTS_DIR directory.
  6. Run the diagnostics script from a submit host for the cluster.

Environment variables

The following environment variables are applicable for both Linux and Windows environment.

ANSYSEM_DIAG_PROD_DIR



ANSYSEM_DIAG_RESULTS_DIR



ANSYSEM_DIAG_CUSTOM_DIR



How the diagnostic tool works

The diagnostics are run as a scheduler managed job. Running the diagnostic script submits a scheduler job that runs the diagnostic tool on the hosts allocated to the job. Once the diagnostic job starts, the tool executes a set of diagnostic tests. These tests run on each host allocated to the job, and collect diagnostic information relevant for running HPC jobs. The tool combines the diagnostic information to produce an HTML report. The tool saves HTML diagnostic report and other results in a shared drive, which must be available at the same path from all the hosts of the cluster. On Linux, the default is Ansoft/HPCDiag subdirectory under user's home directory. On Windows, the user must specify this location using ANSYSEM_DIAG_RESULTS_DIR environment variable.