test: 2.3x faster test suite via direct MPI execution and reduced timesteps#1331
Draft
sbryngelson wants to merge 17 commits intoMFlowCode:masterfrom
Draft
test: 2.3x faster test suite via direct MPI execution and reduced timesteps#1331sbryngelson wants to merge 17 commits intoMFlowCode:masterfrom
sbryngelson wants to merge 17 commits intoMFlowCode:masterfrom
Conversation
Instead of spawning `./mfc.sh run` per test (which re-bootstraps the Python toolchain, re-parses the case file, generates a shell script, runs syscheck, etc.), the test runner now: - Generates .inp files directly from the in-memory parameter dict - Calls the MPI launcher on pre_process/simulation binaries directly Each .mako template now defines a `mpi_config` dict (binary, flags, env) that both the template and the test runner read from, keeping MPI configuration in one place per system. Full 1D suite (156 tests, -j 10): 3m33s -> 1m03s (3.4x faster) Full test suite (551 tests, -j 10): 29m26s -> 21m27s (27% faster) Per-test overhead: ~12s -> ~2s for simple tests
- Reduce Nt from 50 to 25 (halves all test timesteps) - Reduce example test timestep cap from 50 to 25 - Cap IBM STL 3D grid from 99x49x49 to 29x29x29 - Regenerate all golden files Full suite (551 tests, -j 10): 21m27s -> 12m54s
The coverage cache builder had its own hardcoded MPI launcher detection that ignored the -c <computer> passthrough. On systems where flux is required (e.g. Tuolumne), it would silently use mpirun instead. Now uses _get_mpi_config() and _mpi_cmd() from case.py, consistent with TestCase.run().
- _extract_mpi_config: use 'mpi_config =' pattern to avoid matching
usage sites; wrap ast.literal_eval in try/except for malformed
templates; use find() instead of index() to avoid ValueError
- TestCase.run(): add try/except for TimeoutExpired, SubprocessError,
OSError around subprocess.run(); add 3600s timeout
- _get_mpi_config: defensive 'ARG("--") or []' for None safety
- test.py: add missing returncode check for --test-all post_process
path (pre-existing bug)
The helpers.mako mpi_cmd hardcoded 'flux run' but tuo.mako's mpi_config flags already include 'run'. Let flags carry the subcommand, matching how _mpi_cmd in Python handles it.
post_process calls s_populate_variables_buffers without the optional pb_in/mv_in arguments, but the QBMM guard (qbmm .and. .not. polytropic) could still evaluate to true, causing access to an absent optional argument. This is undefined behavior caught by gfortran -fcheck=bounds. Add present(pb_in) to all 6 call sites of s_qbmm_extrapolation.
Post_process computes the number of save files from t_stop/t_save. With t_step_stop=25, the simulation produces fewer saves than the original t_stop implies, causing 'lustre_N.dat is missing' errors. Clamp t_stop = t_save so post_process only reads saves that exist.
The test_all path runs post_process as a best-effort smoke test. On master, post_process failures (e.g. missing restart files for adaptive-dt examples) are silently ignored. Match this behavior to avoid surfacing pre-existing post_process issues.
Adaptive-dt examples use t_stop/t_save instead of t_step_stop. With large adaptive steps, save indices can be non-consecutive, causing post_process to abort on missing restart files. Clamp t_stop = t_save so only save indices 0 and 1 are produced (always consecutive). Regenerate golden files for affected tests. Also re-add returncode check for --test-all post_process pass, now that the underlying issue is fixed.
Missed this adaptive-dt example in the previous golden file regeneration.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1331 +/- ##
==========================================
- Coverage 45.01% 44.73% -0.28%
==========================================
Files 70 70
Lines 20562 20590 +28
Branches 1962 1962
==========================================
- Hits 9255 9210 -45
- Misses 10179 10254 +75
+ Partials 1128 1126 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
added 3 commits
March 21, 2026 14:38
Two root causes for 5 failing 3D Cylindrical tests on Frontier CCE: 1. Post-process segfault (non-viscous tests 301B9153, 128954AD): s_axis() declared pb_in/mv_in as non-optional, but they are passed as absent optionals from s_populate_variables_buffers in post_process. CCE dereferences the null descriptor; gfortran tolerates it. Added 'optional' attribute and present() guards, matching peer routines like s_symmetry and s_periodic. 2. Simulation FPE (viscous tests 07C33719, 939D6718, 09623DE3): The direct MPI execution path in the test runner bypasses the batch template which sets 'ulimit -s unlimited'. CCE-compiled viscous cylindrical binaries overflow the default stack. Set the soft stack limit to the hard limit at module load via resource.setrlimit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
./mfc.sh runper test (which re-bootstraps the Python toolchain, re-parses the case file, generates a shell script, runs syscheck, etc.), the test runner now generates.inpfiles directly from the in-memory parameter dict and calls the MPI launcher on binaries directly..makotemplate now defines ampi_configdict (binary, flags, env) that both the template and the test runner read from, keeping MPI configuration in one place per system. No hardcoded duplication.Ntfrom 50 to 25 for all tests, and capped example test timesteps at 25. Reduced IBM STL 3D grid from 99x49x49 to 29x29x29.Benchmarks (Phoenix, -j 10, full suite, 551 tests)
Key changes
toolchain/mfc/case.py: Added_get_inp_content()(quiet version ofget_inp()for thread-safe use)toolchain/mfc/test/case.py: Replaced./mfc.sh runsubprocess with direct.inpgeneration +mpirun/sruncalls. Reads MPI config from templates via_extract_mpi_config().toolchain/templates/*.mako: Each template definesmpi_configdict. Simple templates use newhelpers.mpi_cmd(). Complex templates (frontier, santis, tuo) reference config fields directly.toolchain/templates/include/helpers.mako: Addedmpi_cmd()helper.toolchain/mfc/test/cases.py: Reduced IBM STL grid, added grid reduction functor.toolchain/mfc/test/case.py:Nt = 50->Nt = 25.Notes
./mfc.sh runcommand is completely unchanged -- this only affects the test path../mfc.sh test(CI already does this viasource ./mfc.sh load)..makofile -- the test runner extractsmpi_configautomatically.Test plan