2.3. Troubleshooting

Troubleshooting instructions for the following common issues are covered in this section:

2.3.1. Memory Related Errors

For dealing with memory-related errors, such as "insufficient memory", check the memory usage reported in memory_diagnostics.csv prior to the crash. If running the Forte simulation on a shared-memory Personal Computer, check the total RAM usage and see if it exceeds the system's limit. If running on a Linux cluster where each node has its own RAM, check the RAM usage of each rank (process) of the Forte simulation and see if they exceed the RAM limit of each node.

Generally speaking, larger mesh size, larger number of species and larger number of reactions result in larger RAM usage. Therefore, there are several ways to reduce the memory usage in a Forte simulation:

2.3.2. Memory Error Fixed by Simulation Restart

In a case where a simulation runs for a long time, and then crashes due to the memory exceeding a limit, you may find that the memory usage is much lower when restarting the simulation near the crash point where the mesh size is similar.

At the point of restart, the memory usage is the true memory usage for that specific mesh size.

In comparison, in a regular run that takes a long time to reach this restart point, the memory usage is the true memory usage for that specific mesh size plus the accumulated memory (over the simulation up to this point) that is idle due to memory fragmentation.

When Forte deallocates a piece of dynamically allocated memory, the freed memory does not go back to the operating system as free RAM. It is released to Forte only so that Forte can use it to allocate another piece of memory. So, even if Forte has a perfect memory allocation scheme, its "true" RAM usage will never go down. However, if Forte cannot use a piece of deallocated memory for any reason, that piece of memory becomes idle. These idled memory pieces will accumulate over the duration of the simulation.

Unfortunately, there is no fix to the root cause of the crash caused by memory fragmentation. The workaround is to do a restart using a restart file created before the crash.