Skip to content

Add a separate benchmark comparison pipeline for Python and .NET#15

Merged
sharpninja merged 2 commits intomainfrom
copilot/add-benchmarking-separate-pipeline
Mar 21, 2026
Merged

Add a separate benchmark comparison pipeline for Python and .NET#15
sharpninja merged 2 commits intomainfrom
copilot/add-benchmarking-separate-pipeline

Conversation

Copy link

Copilot AI commented Mar 20, 2026

Description

Adds a dedicated benchmark workflow that runs both implementations against the shared smoke-test fixtures and publishes a single comparison report. The pipeline measures indexing plus all configured query paths so reviewers can see a like-for-like Python vs .NET result set in one place.

Related Issues

Addresses the benchmarking request for:

  • a separate pipeline for Python and .NET benchmarks
  • an apple-to-apples comparison report
  • coverage of indexing and querying flows from the existing test data

Proposed Changes

  • Benchmark workflow

    • Added .github/workflows/benchmark-comparison.yml
    • Runs Python and .NET benchmarks as a standalone pipeline
    • Publishes JSON outputs plus a markdown comparison report artifact
    • Uses the existing smoke fixture configs as the shared benchmark contract
  • Shared benchmark harness

    • Added scripts/benchmark_smoke.py
    • Discovers benchmark cases from tests/fixtures/*/config.json
    • Executes:
      • indexing for each fixture
      • every configured query for each fixture
    • Emits per-implementation JSON results and a consolidated markdown report
  • Comparison report

    • Produces a single benchmark-comparison.md with:
      • per-operation timing
      • per-implementation status
      • side-by-side Python/.NET comparison
      • Python workflow metrics from stats.json when available
      • explicit missing-output reporting, which currently surfaces .NET indexing gaps instead of masking them
  • Fixture-driven scope

    • Reuses the existing smoke fixtures rather than introducing benchmark-only datasets
    • Keeps the benchmark aligned with the same indexing/query coverage already encoded in fixture configs
  • Focused test coverage

    • Added tests/unit/test_benchmark_smoke.py
    • Covers fixture loading, command generation, and report rendering

Example of the generated benchmark flow:

uv run python scripts/benchmark_smoke.py run \
  --implementation python \
  --repo-root "$PWD" \
  --output benchmark-python.json

uv run python scripts/benchmark_smoke.py run \
  --implementation dotnet \
  --repo-root "$PWD" \
  --output benchmark-dotnet.json

uv run python scripts/benchmark_smoke.py compare \
  --python-results benchmark-python.json \
  --dotnet-results benchmark-dotnet.json \
  --output benchmark-comparison.md

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Additional Notes

The report is intentionally status-aware, not timing-only. For the current .NET CLI, query timings are recorded, while indexing is reported with missing expected outputs so the comparison remains honest and actionable instead of implying feature parity where it does not yet exist.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Agent-Logs-Url: https://github.com/sharpninja/graphrag/sessions/041ee5aa-dc22-44d5-b2e8-5608209d11ac
Copilot AI changed the title [WIP] Add benchmarking for python and dotnet implementations Add a separate benchmark comparison pipeline for Python and .NET Mar 20, 2026
Copilot AI requested a review from sharpninja March 20, 2026 15:46
@sharpninja sharpninja marked this pull request as ready for review March 21, 2026 14:36
Copilot AI review requested due to automatic review settings March 21, 2026 14:36
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@sharpninja sharpninja merged commit 47366ce into main Mar 21, 2026
14 of 25 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a standalone CI workflow and shared Python harness to benchmark Python vs .NET implementations using the existing smoke-test fixtures, producing a single comparison report artifact.

Changes:

  • Added Benchmark Comparison GitHub Actions workflow to run Python + .NET benchmarks and publish JSON + markdown report artifacts.
  • Added scripts/benchmark_smoke.py to discover fixtures, run indexing + queries, and render a consolidated comparison report.
  • Added unit tests covering fixture loading, command generation, and report rendering.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
.github/workflows/benchmark-comparison.yml New workflow that builds/runs both implementations and uploads benchmark artifacts.
scripts/benchmark_smoke.py New benchmark harness for fixture discovery, execution, JSON output, and markdown comparison rendering.
tests/unit/test_benchmark_smoke.py Unit tests for key benchmark harness behaviors.
.semversioner/next-release/patch-20260320154154809493.json Semversioner entry documenting the patch-level change.

)
expected_artifacts = tuple(
artifact
for workflow in workflow_config.values()
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_fixture_cases() assumes every workflow_config value is a dict and calls workflow.get(...), but tests/fixtures/azure/config.json contains non-dict entries (e.g., "skip_assert": true). This will raise AttributeError when the benchmark workflow loads all fixtures. Filter to dict values (or explicitly skip known non-workflow keys like azure/skip_assert) before reading expected_artifacts.

Suggested change
for workflow in workflow_config.values()
for workflow in workflow_config.values()
if isinstance(workflow, dict)

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +31
def test_load_fixture_cases_uses_smoke_configs():
module = load_module()
repo_root = Path(__file__).resolve().parents[2]

fixtures = module.load_fixture_cases(repo_root, ["text", "min-csv"])

assert [fixture.name for fixture in fixtures] == ["min-csv", "text"]
assert fixtures[0].queries[0].method == "local"
assert fixtures[1].queries[-1].method == "basic"
assert "community_reports.csv" in fixtures[1].expected_artifacts
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests only exercise text/min-csv fixtures, so the non-dict workflow_config values in the azure fixture (e.g., skip_assert) aren’t covered and the current loader bug would slip through. Add a test case that loads the azure fixture (or a minimal config with skip_assert) to ensure fixture parsing works for all fixture shapes.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants