-
Notifications
You must be signed in to change notification settings - Fork 0
feat: upstream sync pipeline with per-commit AI analysis #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,355 @@ | ||||||||||||||||||||||||||||||||||||||||||
| name: Analyze Upstream Commit | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| on: | ||||||||||||||||||||||||||||||||||||||||||
| workflow_dispatch: | ||||||||||||||||||||||||||||||||||||||||||
| inputs: | ||||||||||||||||||||||||||||||||||||||||||
| upstream_commit_sha: | ||||||||||||||||||||||||||||||||||||||||||
| description: 'Upstream commit SHA to analyze (from microsoft/graphrag main)' | ||||||||||||||||||||||||||||||||||||||||||
| required: true | ||||||||||||||||||||||||||||||||||||||||||
| type: string | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| permissions: | ||||||||||||||||||||||||||||||||||||||||||
| contents: write | ||||||||||||||||||||||||||||||||||||||||||
| pull-requests: write | ||||||||||||||||||||||||||||||||||||||||||
| issues: write | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| jobs: | ||||||||||||||||||||||||||||||||||||||||||
| analyze-and-pr: | ||||||||||||||||||||||||||||||||||||||||||
| runs-on: ubuntu-latest | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| steps: | ||||||||||||||||||||||||||||||||||||||||||
| - name: Checkout main branch | ||||||||||||||||||||||||||||||||||||||||||
| uses: actions/checkout@v4 | ||||||||||||||||||||||||||||||||||||||||||
| with: | ||||||||||||||||||||||||||||||||||||||||||
| ref: main | ||||||||||||||||||||||||||||||||||||||||||
| fetch-depth: 0 | ||||||||||||||||||||||||||||||||||||||||||
| token: ${{ secrets.GITHUB_TOKEN }} | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| - name: Configure git identity | ||||||||||||||||||||||||||||||||||||||||||
| run: | | ||||||||||||||||||||||||||||||||||||||||||
| git config user.name "github-actions[bot]" | ||||||||||||||||||||||||||||||||||||||||||
| git config user.email "github-actions[bot]@users.noreply.github.com" | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| - name: Fetch upstream commit | ||||||||||||||||||||||||||||||||||||||||||
| run: | | ||||||||||||||||||||||||||||||||||||||||||
| git remote add upstream https://github.com/microsoft/graphrag.git | ||||||||||||||||||||||||||||||||||||||||||
| git fetch upstream main --no-tags | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| - name: Extract commit information | ||||||||||||||||||||||||||||||||||||||||||
| id: commit-info | ||||||||||||||||||||||||||||||||||||||||||
| run: | | ||||||||||||||||||||||||||||||||||||||||||
| SHA="${{ inputs.upstream_commit_sha }}" | ||||||||||||||||||||||||||||||||||||||||||
| SHORT="${SHA:0:8}" | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| git show "$SHA" --format="%s%n%b" --no-patch \ | ||||||||||||||||||||||||||||||||||||||||||
| > /tmp/commit_message.txt 2>/dev/null \ | ||||||||||||||||||||||||||||||||||||||||||
| || echo "Commit ${SHORT}" > /tmp/commit_message.txt | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| git show "$SHA" --stat --no-patch \ | ||||||||||||||||||||||||||||||||||||||||||
| > /tmp/commit_stat.txt 2>/dev/null \ | ||||||||||||||||||||||||||||||||||||||||||
| || echo "(stat unavailable)" > /tmp/commit_stat.txt | ||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||
| # Capture diff for Python and Markdown files only (capped to keep tokens low) | ||||||||||||||||||||||||||||||||||||||||||
| git show "$SHA" -- '*.py' '*.md' \ | ||||||||||||||||||||||||||||||||||||||||||
| | head -c 8000 > /tmp/commit_diff.txt 2>/dev/null \ | ||||||||||||||||||||||||||||||||||||||||||
| || echo "(diff unavailable)" > /tmp/commit_diff.txt | ||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+53
to
+55
|
||||||||||||||||||||||||||||||||||||||||||
| git show "$SHA" -- '*.py' '*.md' \ | |
| | head -c 8000 > /tmp/commit_diff.txt 2>/dev/null \ | |
| || echo "(diff unavailable)" > /tmp/commit_diff.txt | |
| if git show "$SHA" -- '*.py' '*.md' > /tmp/commit_diff_raw.txt 2>/dev/null; then | |
| head -c 8000 /tmp/commit_diff_raw.txt > /tmp/commit_diff.txt | |
| else | |
| echo "(diff unavailable)" > /tmp/commit_diff.txt | |
| fi |
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The git show "$SHA" command in the "Extract commit information" step (line 53) only fetches upstream main but uses git show on an arbitrary SHA provided as input. If the SHA is not reachable from upstream/main (e.g., it was force-pushed away, belongs to a different branch, or was entered incorrectly in a manual workflow_dispatch), the git show commands will fail silently (the 2>/dev/null || fallback catches the error) but the extracted diff will be empty. There is no validation that the SHA actually exists in the fetched upstream ref before proceeding. A SHA validation step (e.g., git cat-file -e "$SHA" after fetching) would prevent creating empty/misleading analysis documents.
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 65 in analyze-upstream-commit.yml, the branch-existence check uses grep -q "$BRANCH" where $BRANCH is an unquoted value from steps.commit-info.outputs.branch. If the branch name contained regex metacharacters (e.g., .), the grep would match unintended patterns. The branch name should be double-quoted: grep -qF "$BRANCH" (using -F for fixed-string matching to avoid any regex interpretation).
| if git ls-remote --heads origin "$BRANCH" | grep -q "$BRANCH"; then | |
| if git ls-remote --heads origin "$BRANCH" | grep -qF "$BRANCH"; then |
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upstream_commit_sha input is interpolated directly into a Python heredoc at line 83: sha = "${{ inputs.upstream_commit_sha }}". A maliciously crafted SHA (e.g., one containing ", \n, or backticks) could escape the Python string literal or inject arbitrary shell/Python code. The SHA should be passed via an environment variable (using env:) and read inside Python with os.environ, rather than being interpolated directly into the script source code.
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AGENTS.md, .clinerules, and custom-instructions/repo/.github/copilot-instructions.md files in this repository instruct AI agents to read a file AGENTS-README-FIRST.yaml in the repo root and POST to an MCP session log endpoint before doing any work. This is a prompt injection / exfiltration vector: a malicious upstream commit message or diff (processed by the AI in this workflow) could contain instructions that exploit these agent procedures. Moreover, the GITHUB_TOKEN is passed to the GitHub Models API at line 138, which could allow a compromised or unintentionally crafted prompt (via commit_msg, stat, or diff — lines 94–96) to exfiltrate the token via the API call, or to instruct the AI to output content that, when written to the repo, later triggers the agent MCP procedures. Input content used in the AI prompt should be carefully sandboxed and not include raw diff content without additional escaping or content moderation.
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The urllib.request.urlopen call at line 173 will raise an HTTPError (which is a subclass of URLError) for non-2xx HTTP responses — resp.status is never actually reached in those cases because the exception is thrown before body is assigned. The status != 200 check at line 176 will only be reached for a 200 response, making it dead code. The correct approach is to catch urllib.error.HTTPError separately and extract the status code from the exception object, or use the response's status attribute from within a try/except around the urlopen call.
| with urllib.request.urlopen(req, timeout=90) as resp: | |
| status = resp.status | |
| body = resp.read() | |
| try: | |
| with urllib.request.urlopen(req, timeout=90) as resp: | |
| status = resp.status | |
| body = resp.read() | |
| except urllib.error.HTTPError as http_err: | |
| status = http_err.code | |
| body = http_err.read() |
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The analyze-upstream-commit.yml workflow uses ${{ secrets.GITHUB_TOKEN }} to authenticate against the GitHub Models API at https://models.inference.ai.azure.com. The GITHUB_TOKEN is a short-lived token scoped to the current repository. Depending on the repository's token settings, this token may not have the required permissions to call GitHub Models. If this fails in a run that was dispatched automatically, there is no alerting mechanism — the analysis will silently fall back to a placeholder message (line 197–200), the PR will still be opened, but with no useful content. Consider logging or failing the step explicitly on API auth failures rather than silently creating empty analysis PRs.
| analysis_text = ( | |
| f"Analysis unavailable: {exc}\n\n" | |
| msg = str(exc) | |
| print(f"GitHub Models API call failed: {msg}") | |
| # If this looks like an authentication/authorization failure, | |
| # fail the step explicitly so we don't create an empty analysis PR. | |
| if "401" in msg or "403" in msg: | |
| print( | |
| "GitHub Models API authentication/authorization appears to have " | |
| "failed (HTTP 401/403). Verify that the token used for " | |
| "GitHub Models access has the required permissions." | |
| ) | |
| raise | |
| # For non-auth failures, fall back to a placeholder analysis but keep the workflow running. | |
| analysis_text = ( | |
| f"Analysis unavailable: {msg}\n\n" |
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The analyze-upstream-commit.yml workflow writes analysis files to /tmp (lines 202–207) in the "Analyze commit with AI" step, but those /tmp files are then read in two subsequent steps (lines 250–253 and 232). If the "Analyze commit with AI" step is skipped (when steps.branch-check.outputs.exists == 'false' is false), the subsequent steps that read from /tmp/analysis.md, /tmp/pr_title.txt, and /tmp/pr_body.txt will fail because those files won't exist. However, those subsequent steps are also guarded by if: steps.branch-check.outputs.exists == 'false', so this is consistent. But there is no fallback to ensure the files always exist before the "Create sync branch" step runs if the Python step fails mid-way — in that case, the commit step will fail trying to cat /tmp/analysis.md. The branch commit step at line 232 should handle a missing /tmp/analysis.md gracefully (e.g., use cat /tmp/analysis.md 2>/dev/null || echo "(analysis unavailable)").
| cat /tmp/analysis.md | |
| cat /tmp/analysis.md 2>/dev/null || echo "(analysis unavailable)" |
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At lines 246 and 248, ${{ inputs.upstream_commit_sha }} and ${{ steps.commit-info.outputs.branch }} are interpolated directly into the JavaScript source using single-quoted string literals. If the SHA or branch name contains a single quote or other JS metacharacter, the script can be broken or exploited. These values should instead be read from environment variables (set via the env: key of the step) and accessed through process.env inside the script.
| with: | |
| script: | | |
| const fs = require('fs'); | |
| const sha = '${{ inputs.upstream_commit_sha }}'; | |
| const short = sha.substring(0, 8); | |
| const branch = '${{ steps.commit-info.outputs.branch }}'; | |
| env: | |
| UPSTREAM_SHA: ${{ inputs.upstream_commit_sha }} | |
| SYNC_BRANCH: ${{ steps.commit-info.outputs.branch }} | |
| with: | |
| script: | | |
| const fs = require('fs'); | |
| const sha = process.env.UPSTREAM_SHA; | |
| const short = sha.substring(0, 8); | |
| const branch = process.env.SYNC_BRANCH; |
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The analyze-upstream-commit.yml workflow has permissions: contents: write and pull-requests: write at the job level, but it is triggered only by workflow_dispatch. When dispatched by sync-incoming.yml (which runs on schedule), the dispatched run inherits its own GITHUB_TOKEN permissions for the main ref. However, the analyze-upstream-commit.yml workflow also lacks a pull-requests: write permission that would be needed if auto-merge via GraphQL is called with the default token. Separately, the pr_node_id output set at line 298 is never actually used — the auto-merge step fetches the PR again via REST to get node_id (line 334). This is redundant and the pr_node_id output can be removed.
| core.setOutput('pr_node_id', pr.data.node_id); |
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 313, ${{ steps.create-pr.outputs.pr_number }} is interpolated directly into JavaScript. While pr_number is set to pr.data.number.toString() by an earlier step and is unlikely to be attacker-controlled, the pattern of direct interpolation into script strings is unsafe and inconsistent with GitHub's recommended practice. This value should be passed via an environment variable and read via process.env inside the script.
| with: | |
| script: | | |
| const prNumber = parseInt('${{ steps.create-pr.outputs.pr_number }}', 10); | |
| env: | |
| PR_NUMBER: ${{ steps.create-pr.outputs.pr_number }} | |
| with: | |
| script: | | |
| const prNumber = parseInt(process.env.PR_NUMBER || '', 10); |
Copilot
AI
Feb 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The analyze-upstream-commit.yml workflow does not have statuses: write or any other permission for enabling auto-merge. In repositories with branch protection requiring status checks, the auto-merge GraphQL mutation will succeed but the PR will only merge once all required checks pass. The fallback direct merge may bypass required status checks if branch protections are not configured. This could cause AI-generated sync PRs to be merged without any CI validation — e.g., without the dotnet-ci, python-checks, or spellcheck workflows passing. Consider removing the direct-merge fallback and relying solely on auto-merge + branch protection.
| console.log('Auto-merge not available — falling back to direct merge:', autoMergeErr.message); | |
| // If auto-merge is not supported (e.g. no branch-protection rules), | |
| // attempt a direct merge. This succeeds only when there are no conflicts. | |
| try { | |
| await github.rest.pulls.merge({ | |
| owner: context.repo.owner, | |
| repo: context.repo.repo, | |
| pull_number: prNumber, | |
| merge_method: 'squash', | |
| }); | |
| console.log(`PR #${prNumber} merged directly.`); | |
| } catch (mergeErr) { | |
| console.log( | |
| `Direct merge skipped (conflicts or required checks pending): ${mergeErr.message}` | |
| ); | |
| } | |
| // If auto-merge is not available (e.g. no branch-protection rules), | |
| // leave the PR open for manual review and merging. | |
| console.log('Auto-merge could not be enabled:', autoMergeErr.message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line 41,
${{ inputs.upstream_commit_sha }}is interpolated directly into a shell script (SHA="${{ inputs.upstream_commit_sha }}"). Although the SHA is immediately quoted with"$SHA"when used in subsequentgit showcommands, an attacker-controlled SHA could contain shell metacharacters (e.g.,$(...), backticks) that execute before the variable is assigned and quoted. The input should be sanitized or validated to be a valid hex SHA (e.g., using a regex check like[[ "$SHA" =~ ^[0-9a-fA-F]{40}$ ]]) before use.