Skip to content

test: 2.3x faster test suite via direct MPI execution and reduced timesteps#1331

Draft
sbryngelson wants to merge 17 commits intoMFlowCode:masterfrom
sbryngelson:threading
Draft

test: 2.3x faster test suite via direct MPI execution and reduced timesteps#1331
sbryngelson wants to merge 17 commits intoMFlowCode:masterfrom
sbryngelson:threading

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Mar 20, 2026

Summary

  • Direct MPI execution in test runner: Instead of spawning ./mfc.sh run per test (which re-bootstraps the Python toolchain, re-parses the case file, generates a shell script, runs syscheck, etc.), the test runner now generates .inp files directly from the in-memory parameter dict and calls the MPI launcher on binaries directly.
  • Template-based MPI config: Each .mako template now defines a mpi_config dict (binary, flags, env) that both the template and the test runner read from, keeping MPI configuration in one place per system. No hardcoded duplication.
  • Halved timesteps: Reduced Nt from 50 to 25 for all tests, and capped example test timesteps at 25. Reduced IBM STL 3D grid from 99x49x49 to 29x29x29.
  • Regenerated all golden files for the new parameters.

Benchmarks (Phoenix, -j 10, full suite, 551 tests)

Master This PR Speedup
Wall time 29m 26s 12m 54s 2.3x
Avg per-test 29.88s 12.62s 2.4x
Simple test overhead ~12s ~2s 6x
CPU time 188 min 75 min 2.5x

Key changes

  • toolchain/mfc/case.py: Added _get_inp_content() (quiet version of get_inp() for thread-safe use)
  • toolchain/mfc/test/case.py: Replaced ./mfc.sh run subprocess with direct .inp generation + mpirun/srun calls. Reads MPI config from templates via _extract_mpi_config().
  • toolchain/templates/*.mako: Each template defines mpi_config dict. Simple templates use new helpers.mpi_cmd(). Complex templates (frontier, santis, tuo) reference config fields directly.
  • toolchain/templates/include/helpers.mako: Added mpi_cmd() helper.
  • toolchain/mfc/test/cases.py: Reduced IBM STL grid, added grid reduction functor.
  • toolchain/mfc/test/case.py: Nt = 50 -> Nt = 25.

Notes

  • The ./mfc.sh run command is completely unchanged -- this only affects the test path.
  • Modules must be loaded before ./mfc.sh test (CI already does this via source ./mfc.sh load).
  • Adding a new system still requires editing only one .mako file -- the test runner extracts mpi_config automatically.
  • Depends on fix: guard QBMM boundary extrapolation with present(pb_in) #1332 for macOS debug CI (pre-existing QBMM boundary bug).

Test plan

  • Full 1D suite: 156/156 passed
  • Full test suite: 551/551 passed
  • Golden files regenerated and verified
  • CI validation on Phoenix
  • CI validation on Frontier

sbryngelson and others added 7 commits March 20, 2026 12:58
Instead of spawning `./mfc.sh run` per test (which re-bootstraps the
Python toolchain, re-parses the case file, generates a shell script,
runs syscheck, etc.), the test runner now:
- Generates .inp files directly from the in-memory parameter dict
- Calls the MPI launcher on pre_process/simulation binaries directly
Each .mako template now defines a `mpi_config` dict (binary, flags,
env) that both the template and the test runner read from, keeping
MPI configuration in one place per system.
Full 1D suite (156 tests, -j 10): 3m33s -> 1m03s (3.4x faster)
Full test suite (551 tests, -j 10): 29m26s -> 21m27s (27% faster)
Per-test overhead: ~12s -> ~2s for simple tests
- Reduce Nt from 50 to 25 (halves all test timesteps)
- Reduce example test timestep cap from 50 to 25
- Cap IBM STL 3D grid from 99x49x49 to 29x29x29
- Regenerate all golden files
Full suite (551 tests, -j 10): 21m27s -> 12m54s
The coverage cache builder had its own hardcoded MPI launcher detection
that ignored the -c <computer> passthrough. On systems where flux is
required (e.g. Tuolumne), it would silently use mpirun instead.
Now uses _get_mpi_config() and _mpi_cmd() from case.py, consistent
with TestCase.run().
- _extract_mpi_config: use 'mpi_config =' pattern to avoid matching
  usage sites; wrap ast.literal_eval in try/except for malformed
  templates; use find() instead of index() to avoid ValueError
- TestCase.run(): add try/except for TimeoutExpired, SubprocessError,
  OSError around subprocess.run(); add 3600s timeout
- _get_mpi_config: defensive 'ARG("--") or []' for None safety
- test.py: add missing returncode check for --test-all post_process
  path (pre-existing bug)
The helpers.mako mpi_cmd hardcoded 'flux run' but tuo.mako's
mpi_config flags already include 'run'. Let flags carry the
subcommand, matching how _mpi_cmd in Python handles it.
post_process calls s_populate_variables_buffers without the optional
pb_in/mv_in arguments, but the QBMM guard (qbmm .and. .not. polytropic)
could still evaluate to true, causing access to an absent optional
argument. This is undefined behavior caught by gfortran -fcheck=bounds.

Add present(pb_in) to all 6 call sites of s_qbmm_extrapolation.
Post_process computes the number of save files from t_stop/t_save.
With t_step_stop=25, the simulation produces fewer saves than the
original t_stop implies, causing 'lustre_N.dat is missing' errors.
Clamp t_stop = t_save so post_process only reads saves that exist.
The test_all path runs post_process as a best-effort smoke test.
On master, post_process failures (e.g. missing restart files for
adaptive-dt examples) are silently ignored. Match this behavior
to avoid surfacing pre-existing post_process issues.
Adaptive-dt examples use t_stop/t_save instead of t_step_stop.
With large adaptive steps, save indices can be non-consecutive,
causing post_process to abort on missing restart files.

Clamp t_stop = t_save so only save indices 0 and 1 are produced
(always consecutive). Regenerate golden files for affected tests.

Also re-add returncode check for --test-all post_process pass,
now that the underlying issue is fixed.
Missed this adaptive-dt example in the previous golden file
regeneration.
@sbryngelson sbryngelson marked this pull request as ready for review March 21, 2026 03:05
@codecov
Copy link

codecov bot commented Mar 21, 2026

Codecov Report

❌ Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.73%. Comparing base (6120810) to head (31e47ee).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/common/m_boundary_common.fpp 0.00% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1331      +/-   ##
==========================================
- Coverage   45.01%   44.73%   -0.28%     
==========================================
  Files          70       70              
  Lines       20562    20590      +28     
  Branches     1962     1962              
==========================================
- Hits         9255     9210      -45     
- Misses      10179    10254      +75     
+ Partials     1128     1126       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson marked this pull request as draft March 21, 2026 14:04
Spencer Bryngelson added 3 commits March 21, 2026 14:38
Two root causes for 5 failing 3D Cylindrical tests on Frontier CCE:

1. Post-process segfault (non-viscous tests 301B9153, 128954AD):
   s_axis() declared pb_in/mv_in as non-optional, but they are passed
   as absent optionals from s_populate_variables_buffers in post_process.
   CCE dereferences the null descriptor; gfortran tolerates it. Added
   'optional' attribute and present() guards, matching peer routines
   like s_symmetry and s_periodic.

2. Simulation FPE (viscous tests 07C33719, 939D6718, 09623DE3):
   The direct MPI execution path in the test runner bypasses the batch
   template which sets 'ulimit -s unlimited'. CCE-compiled viscous
   cylindrical binaries overflow the default stack. Set the soft stack
   limit to the hard limit at module load via resource.setrlimit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant