Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
355 changes: 355 additions & 0 deletions .github/workflows/analyze-upstream-commit.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,355 @@
name: Analyze Upstream Commit

on:
workflow_dispatch:
inputs:
upstream_commit_sha:
description: 'Upstream commit SHA to analyze (from microsoft/graphrag main)'
required: true
type: string

permissions:
contents: write
pull-requests: write
issues: write

jobs:
analyze-and-pr:
runs-on: ubuntu-latest

steps:
- name: Checkout main branch
uses: actions/checkout@v4
with:
ref: main
fetch-depth: 0
token: ${{ secrets.GITHUB_TOKEN }}

- name: Configure git identity
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"

- name: Fetch upstream commit
run: |
git remote add upstream https://github.com/microsoft/graphrag.git
git fetch upstream main --no-tags

- name: Extract commit information
id: commit-info
run: |
SHA="${{ inputs.upstream_commit_sha }}"
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 41, ${{ inputs.upstream_commit_sha }} is interpolated directly into a shell script (SHA="${{ inputs.upstream_commit_sha }}"). Although the SHA is immediately quoted with "$SHA" when used in subsequent git show commands, an attacker-controlled SHA could contain shell metacharacters (e.g., $(...), backticks) that execute before the variable is assigned and quoted. The input should be sanitized or validated to be a valid hex SHA (e.g., using a regex check like [[ "$SHA" =~ ^[0-9a-fA-F]{40}$ ]]) before use.

Suggested change
SHA="${{ inputs.upstream_commit_sha }}"
SHA="${{ inputs.upstream_commit_sha }}"
if ! [[ "$SHA" =~ ^[0-9a-fA-F]{40}$ ]]; then
echo "Invalid upstream_commit_sha: $SHA" >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
SHORT="${SHA:0:8}"

git show "$SHA" --format="%s%n%b" --no-patch \
> /tmp/commit_message.txt 2>/dev/null \
|| echo "Commit ${SHORT}" > /tmp/commit_message.txt

git show "$SHA" --stat --no-patch \
> /tmp/commit_stat.txt 2>/dev/null \
|| echo "(stat unavailable)" > /tmp/commit_stat.txt

# Capture diff for Python and Markdown files only (capped to keep tokens low)
git show "$SHA" -- '*.py' '*.md' \
| head -c 8000 > /tmp/commit_diff.txt 2>/dev/null \
|| echo "(diff unavailable)" > /tmp/commit_diff.txt
Comment on lines +53 to +55
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 53, git show "$SHA" -- '*.py' '*.md' is piped to head -c 8000, but the || fallback is applied to the entire pipeline. In bash, the exit code of the pipeline is the exit code of the last command (head), not git show. This means if git show fails, the pipeline may still "succeed" (because head returns 0), and the fallback echo "(diff unavailable)" will NOT be written. The correct approach would be to use set -o pipefail at the top of the shell script, or to check git show separately before piping.

Suggested change
git show "$SHA" -- '*.py' '*.md' \
| head -c 8000 > /tmp/commit_diff.txt 2>/dev/null \
|| echo "(diff unavailable)" > /tmp/commit_diff.txt
if git show "$SHA" -- '*.py' '*.md' > /tmp/commit_diff_raw.txt 2>/dev/null; then
head -c 8000 /tmp/commit_diff_raw.txt > /tmp/commit_diff.txt
else
echo "(diff unavailable)" > /tmp/commit_diff.txt
fi

Copilot uses AI. Check for mistakes.

echo "sha=${SHA}" >> "$GITHUB_OUTPUT"
echo "short=${SHORT}" >> "$GITHUB_OUTPUT"
echo "branch=sync/upstream-${SHORT}" >> "$GITHUB_OUTPUT"
Comment on lines +44 to +59
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git show "$SHA" command in the "Extract commit information" step (line 53) only fetches upstream main but uses git show on an arbitrary SHA provided as input. If the SHA is not reachable from upstream/main (e.g., it was force-pushed away, belongs to a different branch, or was entered incorrectly in a manual workflow_dispatch), the git show commands will fail silently (the 2>/dev/null || fallback catches the error) but the extracted diff will be empty. There is no validation that the SHA actually exists in the fetched upstream ref before proceeding. A SHA validation step (e.g., git cat-file -e "$SHA" after fetching) would prevent creating empty/misleading analysis documents.

Copilot uses AI. Check for mistakes.

- name: Check whether sync branch already exists
id: branch-check
run: |
BRANCH="${{ steps.commit-info.outputs.branch }}"
if git ls-remote --heads origin "$BRANCH" | grep -q "$BRANCH"; then
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 65 in analyze-upstream-commit.yml, the branch-existence check uses grep -q "$BRANCH" where $BRANCH is an unquoted value from steps.commit-info.outputs.branch. If the branch name contained regex metacharacters (e.g., .), the grep would match unintended patterns. The branch name should be double-quoted: grep -qF "$BRANCH" (using -F for fixed-string matching to avoid any regex interpretation).

Suggested change
if git ls-remote --heads origin "$BRANCH" | grep -q "$BRANCH"; then
if git ls-remote --heads origin "$BRANCH" | grep -qF "$BRANCH"; then

Copilot uses AI. Check for mistakes.
echo "exists=true" >> "$GITHUB_OUTPUT"
else
echo "exists=false" >> "$GITHUB_OUTPUT"
fi

- name: Analyze commit with AI and generate PR content
if: steps.branch-check.outputs.exists == 'false'
id: analysis
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
python3 << 'PYEOF'
import json
import os
import textwrap
import urllib.request

sha = "${{ inputs.upstream_commit_sha }}"
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream_commit_sha input is interpolated directly into a Python heredoc at line 83: sha = "${{ inputs.upstream_commit_sha }}". A maliciously crafted SHA (e.g., one containing ", \n, or backticks) could escape the Python string literal or inject arbitrary shell/Python code. The SHA should be passed via an environment variable (using env:) and read inside Python with os.environ, rather than being interpolated directly into the script source code.

Copilot uses AI. Check for mistakes.
short = sha[:8]

def read_capped(path, max_bytes=3000):
try:
with open(path) as fh:
return fh.read(max_bytes)
except Exception as read_exc:
print(f"Warning: could not read {path}: {read_exc}")
return ""

commit_msg = read_capped("/tmp/commit_message.txt", 800)
stat = read_capped("/tmp/commit_stat.txt", 2000)
diff = read_capped("/tmp/commit_diff.txt", 4000)

prompt = textwrap.dedent(f"""
You are analyzing an upstream commit from the microsoft/graphrag Python repository.
This fork (sharpninja/graphrag) adds a .NET/C# implementation in `dotnet/` and
extended documentation that mirrors the Python library behavior.

Upstream commit: {short}

Commit message:
{commit_msg}

Changed files (stat):
{stat}

Diff preview (Python/Markdown files only):
{diff}

Analyze what changes are required in the fork's `dotnet/` and `docs/` directories
to keep the .NET implementation and documentation synchronized with this upstream change.

Reply with EXACTLY this format (keep all section headers):

## Summary
<one-paragraph description of what this upstream commit does>

## .NET Changes Required
<bullet list of specific changes needed in dotnet/, or "None required" if not applicable>

## Documentation Changes Required
<bullet list of documentation changes needed, or "None required" if not applicable>

## Priority
HIGH | MEDIUM | LOW — with one-sentence justification

## PR Title
<concise imperative title, e.g. "sync: update X to match upstream Y behavior">

## PR Body
<markdown body (2-4 sentences) for the pull request>
""").strip()
Comment on lines +94 to +136
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AGENTS.md, .clinerules, and custom-instructions/repo/.github/copilot-instructions.md files in this repository instruct AI agents to read a file AGENTS-README-FIRST.yaml in the repo root and POST to an MCP session log endpoint before doing any work. This is a prompt injection / exfiltration vector: a malicious upstream commit message or diff (processed by the AI in this workflow) could contain instructions that exploit these agent procedures. Moreover, the GITHUB_TOKEN is passed to the GitHub Models API at line 138, which could allow a compromised or unintentionally crafted prompt (via commit_msg, stat, or diff — lines 94–96) to exfiltrate the token via the API call, or to instruct the AI to output content that, when written to the repo, later triggers the agent MCP procedures. Input content used in the AI prompt should be carefully sandboxed and not include raw diff content without additional escaping or content moderation.

Copilot uses AI. Check for mistakes.

token = os.environ["GITHUB_TOKEN"]
url = "https://models.inference.ai.azure.com/chat/completions"

payload = {
"model": "gpt-4o-mini",
"messages": [
{
"role": "system",
"content": (
"You are an expert .NET architect helping keep a C# fork "
"in sync with an upstream Python library."
),
},
{"role": "user", "content": prompt},
],
"max_tokens": 1200,
"temperature": 0.2,
}

analysis_text = ""
pr_title = f"sync: apply upstream changes from commit {short}"
pr_body = (
f"Synchronize the `.NET` implementation and documentation with "
f"upstream microsoft/graphrag commit `{short}`."
)

try:
req = urllib.request.Request(
url,
data=json.dumps(payload).encode(),
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {token}",
},
)
with urllib.request.urlopen(req, timeout=90) as resp:
status = resp.status
body = resp.read()
Comment on lines +173 to +175
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The urllib.request.urlopen call at line 173 will raise an HTTPError (which is a subclass of URLError) for non-2xx HTTP responses — resp.status is never actually reached in those cases because the exception is thrown before body is assigned. The status != 200 check at line 176 will only be reached for a 200 response, making it dead code. The correct approach is to catch urllib.error.HTTPError separately and extract the status code from the exception object, or use the response's status attribute from within a try/except around the urlopen call.

Suggested change
with urllib.request.urlopen(req, timeout=90) as resp:
status = resp.status
body = resp.read()
try:
with urllib.request.urlopen(req, timeout=90) as resp:
status = resp.status
body = resp.read()
except urllib.error.HTTPError as http_err:
status = http_err.code
body = http_err.read()

Copilot uses AI. Check for mistakes.
if status != 200:
raise RuntimeError(f"GitHub Models API returned HTTP {status}: {body[:200]}")
data = json.loads(body)
analysis_text = data["choices"][0]["message"]["content"]

# Extract PR Title
if "## PR Title" in analysis_text:
after = analysis_text.split("## PR Title", 1)[1].strip()
title_candidate = after.splitlines()[0].lstrip("#").strip()
if title_candidate:
pr_title = title_candidate[:120]

# Extract PR Body
if "## PR Body" in analysis_text:
body_part = analysis_text.split("## PR Body", 1)[1].strip()
if "##" in body_part:
body_part = body_part.split("##")[0].strip()
if body_part:
pr_body = body_part[:2000]

except Exception as exc:
analysis_text = (
f"Analysis unavailable: {exc}\n\n"
Comment on lines +197 to +198
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analyze-upstream-commit.yml workflow uses ${{ secrets.GITHUB_TOKEN }} to authenticate against the GitHub Models API at https://models.inference.ai.azure.com. The GITHUB_TOKEN is a short-lived token scoped to the current repository. Depending on the repository's token settings, this token may not have the required permissions to call GitHub Models. If this fails in a run that was dispatched automatically, there is no alerting mechanism — the analysis will silently fall back to a placeholder message (line 197–200), the PR will still be opened, but with no useful content. Consider logging or failing the step explicitly on API auth failures rather than silently creating empty analysis PRs.

Suggested change
analysis_text = (
f"Analysis unavailable: {exc}\n\n"
msg = str(exc)
print(f"GitHub Models API call failed: {msg}")
# If this looks like an authentication/authorization failure,
# fail the step explicitly so we don't create an empty analysis PR.
if "401" in msg or "403" in msg:
print(
"GitHub Models API authentication/authorization appears to have "
"failed (HTTP 401/403). Verify that the token used for "
"GitHub Models access has the required permissions."
)
raise
# For non-auth failures, fall back to a placeholder analysis but keep the workflow running.
analysis_text = (
f"Analysis unavailable: {msg}\n\n"

Copilot uses AI. Check for mistakes.
f"Manual review of upstream commit `{short}` is required."
)

with open("/tmp/analysis.md", "w") as fh:
fh.write(analysis_text)
with open("/tmp/pr_title.txt", "w") as fh:
fh.write(pr_title)
with open("/tmp/pr_body.txt", "w") as fh:
fh.write(pr_body)

print("Analysis complete.")
PYEOF

- name: Create sync branch and commit analysis document
if: steps.branch-check.outputs.exists == 'false'
run: |
SHORT="${{ steps.commit-info.outputs.short }}"
BRANCH="${{ steps.commit-info.outputs.branch }}"

git checkout -b "$BRANCH"
mkdir -p docs/upstream-sync

ANALYSIS_FILE="docs/upstream-sync/upstream-${SHORT}.md"

{
echo "# Upstream Sync Analysis: \`${SHORT}\`"
echo ""
echo "**Upstream Commit:** \`${{ inputs.upstream_commit_sha }}\` "
echo "**Upstream Repository:** [microsoft/graphrag](https://github.com/microsoft/graphrag/commit/${{ inputs.upstream_commit_sha }}) "
echo "**Analyzed:** $(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo ""
echo "---"
echo ""
cat /tmp/analysis.md
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analyze-upstream-commit.yml workflow writes analysis files to /tmp (lines 202–207) in the "Analyze commit with AI" step, but those /tmp files are then read in two subsequent steps (lines 250–253 and 232). If the "Analyze commit with AI" step is skipped (when steps.branch-check.outputs.exists == 'false' is false), the subsequent steps that read from /tmp/analysis.md, /tmp/pr_title.txt, and /tmp/pr_body.txt will fail because those files won't exist. However, those subsequent steps are also guarded by if: steps.branch-check.outputs.exists == 'false', so this is consistent. But there is no fallback to ensure the files always exist before the "Create sync branch" step runs if the Python step fails mid-way — in that case, the commit step will fail trying to cat /tmp/analysis.md. The branch commit step at line 232 should handle a missing /tmp/analysis.md gracefully (e.g., use cat /tmp/analysis.md 2>/dev/null || echo "(analysis unavailable)").

Suggested change
cat /tmp/analysis.md
cat /tmp/analysis.md 2>/dev/null || echo "(analysis unavailable)"

Copilot uses AI. Check for mistakes.
} > "$ANALYSIS_FILE"

git add "$ANALYSIS_FILE"
git commit -m "docs: upstream sync analysis for commit ${SHORT}"
git push origin "$BRANCH"

- name: Create pull request
if: steps.branch-check.outputs.exists == 'false'
id: create-pr
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const sha = '${{ inputs.upstream_commit_sha }}';
const short = sha.substring(0, 8);
const branch = '${{ steps.commit-info.outputs.branch }}';
Comment on lines +243 to +248
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At lines 246 and 248, ${{ inputs.upstream_commit_sha }} and ${{ steps.commit-info.outputs.branch }} are interpolated directly into the JavaScript source using single-quoted string literals. If the SHA or branch name contains a single quote or other JS metacharacter, the script can be broken or exploited. These values should instead be read from environment variables (set via the env: key of the step) and accessed through process.env inside the script.

Suggested change
with:
script: |
const fs = require('fs');
const sha = '${{ inputs.upstream_commit_sha }}';
const short = sha.substring(0, 8);
const branch = '${{ steps.commit-info.outputs.branch }}';
env:
UPSTREAM_SHA: ${{ inputs.upstream_commit_sha }}
SYNC_BRANCH: ${{ steps.commit-info.outputs.branch }}
with:
script: |
const fs = require('fs');
const sha = process.env.UPSTREAM_SHA;
const short = sha.substring(0, 8);
const branch = process.env.SYNC_BRANCH;

Copilot uses AI. Check for mistakes.

const prTitle = fs.readFileSync('/tmp/pr_title.txt', 'utf8').trim()
|| `sync: apply upstream changes from commit ${short}`;
const prBodyFromAI = fs.readFileSync('/tmp/pr_body.txt', 'utf8').trim();
const analysis = fs.readFileSync('/tmp/analysis.md', 'utf8');

const prBody = [
`## Upstream Sync: [\`${short}\`](https://github.com/microsoft/graphrag/commit/${sha})`,
'',
prBodyFromAI,
'',
'---',
'',
'## Agent Analysis',
'',
analysis.substring(0, 5000),
'',
'---',
'*Automatically created by the [Analyze Upstream Commit](../../actions/workflows/analyze-upstream-commit.yml) workflow.*',
].join('\n');

// Ensure the upstream-sync label exists
try {
await github.rest.issues.getLabel({
owner: context.repo.owner,
repo: context.repo.repo,
name: 'upstream-sync',
});
} catch {
await github.rest.issues.createLabel({
owner: context.repo.owner,
repo: context.repo.repo,
name: 'upstream-sync',
color: '0e8a16',
description: 'Tracks upstream synchronization changes from microsoft/graphrag',
});
}

const pr = await github.rest.pulls.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: prTitle,
body: prBody,
head: branch,
base: 'main',
draft: false,
});

core.setOutput('pr_number', pr.data.number.toString());
core.setOutput('pr_node_id', pr.data.node_id);
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analyze-upstream-commit.yml workflow has permissions: contents: write and pull-requests: write at the job level, but it is triggered only by workflow_dispatch. When dispatched by sync-incoming.yml (which runs on schedule), the dispatched run inherits its own GITHUB_TOKEN permissions for the main ref. However, the analyze-upstream-commit.yml workflow also lacks a pull-requests: write permission that would be needed if auto-merge via GraphQL is called with the default token. Separately, the pr_node_id output set at line 298 is never actually used — the auto-merge step fetches the PR again via REST to get node_id (line 334). This is redundant and the pr_node_id output can be removed.

Suggested change
core.setOutput('pr_node_id', pr.data.node_id);

Copilot uses AI. Check for mistakes.
console.log(`Created PR #${pr.data.number}: ${pr.data.html_url}`);

await github.rest.issues.addLabels({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: pr.data.number,
labels: ['upstream-sync'],
}).catch(e => console.log('Label warning:', e.status, e.message));

- name: Enable auto-merge on pull request
if: steps.branch-check.outputs.exists == 'false' && steps.create-pr.outputs.pr_number != ''
uses: actions/github-script@v7
with:
script: |
const prNumber = parseInt('${{ steps.create-pr.outputs.pr_number }}', 10);
Comment on lines +311 to +313
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 313, ${{ steps.create-pr.outputs.pr_number }} is interpolated directly into JavaScript. While pr_number is set to pr.data.number.toString() by an earlier step and is unlikely to be attacker-controlled, the pattern of direct interpolation into script strings is unsafe and inconsistent with GitHub's recommended practice. This value should be passed via an environment variable and read via process.env inside the script.

Suggested change
with:
script: |
const prNumber = parseInt('${{ steps.create-pr.outputs.pr_number }}', 10);
env:
PR_NUMBER: ${{ steps.create-pr.outputs.pr_number }}
with:
script: |
const prNumber = parseInt(process.env.PR_NUMBER || '', 10);

Copilot uses AI. Check for mistakes.
if (!prNumber) return;

try {
// Prefer GraphQL enablePullRequestAutoMerge so the PR merges automatically
// once all required status checks pass and there are no conflicts.
const { data: pr } = await github.rest.pulls.get({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: prNumber,
});

await github.graphql(`
mutation EnableAutoMerge($pullRequestId: ID!) {
enablePullRequestAutoMerge(input: {
pullRequestId: $pullRequestId
mergeMethod: SQUASH
}) {
pullRequest { autoMergeRequest { enabledAt } }
}
}
`, { pullRequestId: pr.node_id });

console.log(`Auto-merge enabled for PR #${prNumber}`);
} catch (autoMergeErr) {
console.log('Auto-merge not available — falling back to direct merge:', autoMergeErr.message);

// If auto-merge is not supported (e.g. no branch-protection rules),
// attempt a direct merge. This succeeds only when there are no conflicts.
try {
await github.rest.pulls.merge({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: prNumber,
merge_method: 'squash',
});
console.log(`PR #${prNumber} merged directly.`);
} catch (mergeErr) {
console.log(
`Direct merge skipped (conflicts or required checks pending): ${mergeErr.message}`
);
}
Comment on lines +338 to +354
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analyze-upstream-commit.yml workflow does not have statuses: write or any other permission for enabling auto-merge. In repositories with branch protection requiring status checks, the auto-merge GraphQL mutation will succeed but the PR will only merge once all required checks pass. The fallback direct merge may bypass required status checks if branch protections are not configured. This could cause AI-generated sync PRs to be merged without any CI validation — e.g., without the dotnet-ci, python-checks, or spellcheck workflows passing. Consider removing the direct-merge fallback and relying solely on auto-merge + branch protection.

Suggested change
console.log('Auto-merge not available — falling back to direct merge:', autoMergeErr.message);
// If auto-merge is not supported (e.g. no branch-protection rules),
// attempt a direct merge. This succeeds only when there are no conflicts.
try {
await github.rest.pulls.merge({
owner: context.repo.owner,
repo: context.repo.repo,
pull_number: prNumber,
merge_method: 'squash',
});
console.log(`PR #${prNumber} merged directly.`);
} catch (mergeErr) {
console.log(
`Direct merge skipped (conflicts or required checks pending): ${mergeErr.message}`
);
}
// If auto-merge is not available (e.g. no branch-protection rules),
// leave the PR open for manual review and merging.
console.log('Auto-merge could not be enabled:', autoMergeErr.message);

Copilot uses AI. Check for mistakes.
}
Loading
Loading