Skip to content

[RLC-10] Rebase Custom Changes to rlc-10/6.12.0-124.45.1.el10_1#1004

Open
PlaidCat wants to merge 26 commits intorlc-10/6.12.0-124.45.1.el10_1from
{jmaple}_rlc-10/6.12.0-124.45.1.el10_1
Open

[RLC-10] Rebase Custom Changes to rlc-10/6.12.0-124.45.1.el10_1#1004
PlaidCat wants to merge 26 commits intorlc-10/6.12.0-124.45.1.el10_1from
{jmaple}_rlc-10/6.12.0-124.45.1.el10_1

Conversation

@PlaidCat
Copy link
Collaborator

https://ciqinc.atlassian.net/browse/KERNEL-756

Update process (This kernel CentOS base for 6.12.0-124.45.1.el10_1)

  • Rolling Release Rebase Process
  • Create rlc-10/6.12.0-124.45.1.el10_1 branch from rocky10_1
  • Cherry-pick all code from previous branch rlc-10/6.12.0-124.40.1.el10_1 into new branch (skipping unneeded code)
    • Fix conflicts as they arise
  • Build and Test

Rebase Log

Already on 'rlc-10/6.12.0-124.40.1.el10_1'
Already on 'jmaple_rlc-10/6.12.0-124.45.1.el10_1'
[rolling release update] Rolling Product:  rlc-10
[rolling release update] Checking out branch:  rlc-10/6.12.0-124.40.1.el10_1
[rolling release update] Gathering all the RESF kernel Tags
[rolling release update] Found 14 RESF kernel tags
[rolling release update] Checking out branch:  rocky10_1
[rolling release update] Gathering all the RESF kernel Tags
[rolling release update] Found 15 RESF kernel tags
[rolling release update] Common tag sha:  b'16ddd0825bd1'
"16ddd0825bd1a7237364f06fae356660b8837e07 Rebuild rocky10_1 with kernel-6.12.0-124.40.1.el10_1"
[rolling release update] Checking for FIPS protected changes between the common tag and HEAD
[rolling release update] Checking for FIPS protected changes
[rolling release update] Getting SHAS 16ddd0825bd1..HEAD
[rolling release update] Number of commits to check:  55
[rolling release update] Checking modifications of shas
[rolling release update] Checked 5 of 55 commits
[rolling release update] Checked 10 of 55 commits
[rolling release update] Checked 15 of 55 commits
[rolling release update] Checked 20 of 55 commits
[rolling release update] Checked 25 of 55 commits
[rolling release update] Checked 30 of 55 commits
[rolling release update] Checked 35 of 55 commits
[rolling release update] Checked 40 of 55 commits
[rolling release update] Checked 45 of 55 commits
[rolling release update] Checked 50 of 55 commits
[rolling release update] Checked 55 of 55 commits
[rolling release update] 0 of 55 commits have FIPS protected changes
[rolling release update] Checking out old rolling branch:  rlc-10/6.12.0-124.40.1.el10_1
[rolling release update] Finding the CIQ Kernel and Associated Upstream commits between the last resf tag and HEAD
[rolling release update] Getting SHAS 16ddd0825bd1..HEAD
[rolling release update] Last RESF tag sha:  b'16ddd0825bd1'
[rolling release update] Total commits in old branch: 26
[rolling release update] Checking out new base branch:  rocky10_1
[rolling release update] Finding the kernel version for the new rolling release
[rolling release update] New Branch to create: rlc-10/6.12.0-124.45.1.el10_1
[rolling release update] Creating new branch: rlc-10/6.12.0-124.45.1.el10_1
[rolling release update] Creating new branch for PR:  jmaple_rlc-10/6.12.0-124.45.1.el10_1
[rolling release update] Creating Map of all new commits from last rolling release fork
[rolling release update] Total commits in new branch: 54
[rolling release update] Checking if any of the commits from the old rolling release are already present in the new base branch
[rolling release update] Found 0 duplicate commits to remove
[rolling release update] Applying 26 remaining commits to the new branch
  [1/26] c9c7cb9604a9 tools: hv: Enable debug logs for hv_kvp_daemon
  [2/26] 35880816e479 RDMA/mana_ib: Add device statistics support
  [3/26] 618761ae9ee6 PCI/MSI: Export pci_msix_prepare_desc() for dynamic MSI-X allocations
  [4/26] 912141ce1036 PCI: hv: Allow dynamic MSI-X vector allocation
  [5/26] 3b6518a6584c net: mana: explain irq_setup() algorithm
  [6/26] 0040782e4895 net: mana: Allow irq_setup() to skip cpus for affinity
  [7/26] fdd3a51e6c87 net: mana: Allocate MSI-X vectors dynamically
  [8/26] 87f59f15f2a4 net: mana: Add support for net_shaper_ops
  [9/26] a6a687f694c3 net: mana: Add speed support in mana_get_link_ksettings
  [10/26] a9c61ac47f8c net: mana: Handle unsupported HWC commands
  [11/26] 03a2f21c06d1 net: mana: Fix build errors when CONFIG_NET_SHAPER is disabled
  [12/26] c63173a25b91 RDMA/mana_ib: add additional port counters
  [13/26] 0450c13c92f4 RDMA/mana_ib: Drain send wrs of GSI QP
  [14/26] b63664e6c8f1 net: hv_netvsc: fix loss of early receive events from host during channel open.
  [15/26] 80ea00d9c520 net: mana: Reduce waiting time if HWC not responding
  [16/26] 0c70bf6f0172 RDMA/mana_ib: Extend modify QP
  [17/26] ad4e9448861e scsi: storvsc: Prefer returning channel with the same CPU as on the I/O issuing CPU
  [18/26] 9e86095fda9f net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.
  [19/26] e4f84b30ecc2 dcache: export shrink_dentry_list() and add new helper d_dispose_if_unused()
  [20/26] 54bae277e28b idpf: fix a race in txq wakeup
  [21/26] f51c7be2c535 idpf: add support for Tx refillqs in flow scheduling mode
  [22/26] c6e0b6b0326e idpf: improve when to set RE bit logic
  [23/26] ca94a60dbe10 idpf: simplify and fix splitq Tx packet rollback error path
  [24/26] cbfcfd9b9f7e idpf: replace flow scheduling buffer ring with buffer pool
  [25/26] 561a71676f43 idpf: stop Tx if there are insufficient buffer resources
  [26/26] 5f9a41096abb idpf: remove obsolete stashing code
[rolling release update] Successfully applied all 26 commits

BUILD

$ egrep -B 5 -A 5 "\[TIMER\]|^Starting Build" $(ls -t kbuild* | head -n1)
/mnt/code/kernel-src-tree-build
Running make mrproper...
  CLEAN   scripts/basic
  CLEAN   scripts/kconfig
  CLEAN   include/config include/generated
[TIMER]{MRPROPER}: 7s
x86_64 architecture detected, copying config
'configs/kernel-x86_64-rhel.config' -> '.config'
Setting Local Version for build
CONFIG_LOCALVERSION="-rocky10_1_rebuild-20b01e9350fe"
Making olddefconfig
--
  HOSTCC  scripts/kconfig/util.o
  HOSTLD  scripts/kconfig/conf
#
# configuration written to .config
#
Starting Build
  GEN     arch/x86/include/generated/asm/orc_hash.h
  WRAP    arch/x86/include/generated/uapi/asm/bpf_perf_event.h
  WRAP    arch/x86/include/generated/uapi/asm/errno.h
  WRAP    arch/x86/include/generated/uapi/asm/fcntl.h
  WRAP    arch/x86/include/generated/uapi/asm/ioctl.h
--
  LD [M]  virt/lib/irqbypass.ko
  BTF [M] net/hsr/hsr.ko
  BTF [M] net/qrtr/qrtr-mhi.ko
  BTF [M] net/qrtr/qrtr.ko
  BTF [M] virt/lib/irqbypass.ko
[TIMER]{BUILD}: 1991s
Making Modules
  SYMLINK /lib/modules/6.12.0-rocky10_1_rebuild-20b01e9350fe+/build
  INSTALL /lib/modules/6.12.0-rocky10_1_rebuild-20b01e9350fe+/modules.order
  INSTALL /lib/modules/6.12.0-rocky10_1_rebuild-20b01e9350fe+/modules.builtin
  INSTALL /lib/modules/6.12.0-rocky10_1_rebuild-20b01e9350fe+/modules.builtin.modinfo
--
  STRIP   /lib/modules/6.12.0-rocky10_1_rebuild-20b01e9350fe+/kernel/net/qrtr/qrtr-mhi.ko
  STRIP   /lib/modules/6.12.0-rocky10_1_rebuild-20b01e9350fe+/kernel/virt/lib/irqbypass.ko
  SIGN    /lib/modules/6.12.0-rocky10_1_rebuild-20b01e9350fe+/kernel/virt/lib/irqbypass.ko
  SIGN    /lib/modules/6.12.0-rocky10_1_rebuild-20b01e9350fe+/kernel/net/qrtr/qrtr-mhi.ko
  DEPMOD  /lib/modules/6.12.0-rocky10_1_rebuild-20b01e9350fe+
[TIMER]{MODULES}: 11s
Making Install
  INSTALL /boot
[TIMER]{INSTALL}: 16s
Checking kABI
kABI check passed
Setting Default Kernel to /boot/vmlinuz-6.12.0-rocky10_1_rebuild-20b01e9350fe+ and Index to 2
Hopefully Grub2.0 took everything ... rebooting after time metrices
[TIMER]{MRPROPER}: 7s
[TIMER]{BUILD}: 1991s
[TIMER]{MODULES}: 11s
[TIMER]{INSTALL}: 16s
[TIMER]{TOTAL} 2030s
Rebooting in 10 seconds

KSelfTest

$ ./kernel-tools/kernel_auto_rebuild/get_kselftest_diff.sh
selftest-6.12.0-io_uring_tests-c88e283d000a+-1.log: 476 passed
selftest-6.12.0-jmaple_rlc-10_6.12.0-124.40.1.el10_1-08ab2f7161e0+-1.log: 478 passed
selftest-6.12.0-jmaple_rlc-10_6.12.0-124.45.1.el10_1-9b5680ba2fb6+-1.log: 479 passed
selftest-6.12.0-jmaple_rlc-10_6.12.0-124.45.1.el10_1-3ec19c8b1979+-1.log: 479 passed

Before: selftest-6.12.0-jmaple_rlc-10_6.12.0-124.45.1.el10_1-9b5680ba2fb6+-1.log
After: selftest-6.12.0-jmaple_rlc-10_6.12.0-124.45.1.el10_1-3ec19c8b1979+-1.log
Diff:
-ok 7 selftests: timers: raw_skew # SKIP
+ok 7 selftests: timers: raw_skew

PlaidCat and others added 26 commits March 23, 2026 14:22
jira LE-3207
feature tools_hv
commit-author Shradha Gupta <shradhagupta@linux.microsoft.com>
commit a9c0b33

Allow the KVP daemon to log the KVP updates triggered in the VM
with a new debug flag(-d).
When the daemon is started with this flag, it logs updates and debug
information in syslog with loglevel LOG_DEBUG. This information comes
in handy for debugging issues where the key-value pairs for certain
pools show mismatch/incorrect values.
The distro-vendors can further consume these changes and modify the
respective service files to redirect the logs to specific files as
needed.

	Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
	Reviewed-by: Naman Jain <namjain@linux.microsoft.com>
	Reviewed-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/r/1744715978-8185-1-git-send-email-shradhagupta@linux.microsoft.com
	Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <1744715978-8185-1-git-send-email-shradhagupta@linux.microsoft.com>
(cherry picked from commit a9c0b33)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
jira LE-4478
commit-author Shiraz Saleem <shirazsaleem@microsoft.com>
commit baa640d

Add support for mana device level statistics.

Co-developed-by: Solom Tamawy <solom.tamawy@microsoft.com>
	Signed-off-by: Solom Tamawy <solom.tamawy@microsoft.com>
	Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com>
	Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1749559717-3424-1-git-send-email-kotaranov@linux.microsoft.com
	Reviewed-by: Long Li <longli@microsoft.com>
	Signed-off-by: Leon Romanovsky <leon@kernel.org>
(cherry picked from commit baa640d)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4467
commit-author Shradha Gupta <shradhagupta@linux.microsoft.com>
commit 5da8a8b

For supporting dynamic MSI-X vector allocation by PCI controllers, enabling
the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN is not enough, msix_prepare_msi_desc()
to prepare the MSI descriptor is also needed.

Export pci_msix_prepare_desc() to allow PCI controllers to support dynamic
MSI-X vector allocation.

	Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
	Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
	Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
	Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
	Acked-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit 5da8a8b)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4467
commit-author Shradha Gupta <shradhagupta@linux.microsoft.com>
commit ad518f2

Allow dynamic MSI-X vector allocation for pci_hyperv PCI controller
by adding support for the flag MSI_FLAG_PCI_MSIX_ALLOC_DYN and using
pci_msix_prepare_desc() to prepare the MSI-X descriptors.

Feature support added for both x86 and ARM64

	Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
	Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
	Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
	Acked-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit ad518f2)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4467
commit-author Yury Norov <yury.norov@gmail.com>
commit 4607617

Commit 91bfe21 ("net: mana: add a function to spread IRQs per CPUs")
added the irq_setup() function that distributes IRQs on CPUs according
to a tricky heuristic. The corresponding commit message explains the
heuristic.

Duplicate it in the source code to make available for readers without
digging git in history. Also, add more detailed explanation about how
the heuristics is implemented.

	Signed-off-by: Yury Norov <yury.norov@gmail.com>
	Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
(cherry picked from commit 4607617)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4467
commit-author Shradha Gupta <shradhagupta@linux.microsoft.com>
commit 845c62c

In order to prepare the MANA driver to allocate the MSI-X IRQs
dynamically, we need to enhance irq_setup() to allow skipping
affinitizing IRQs to the first CPU sibling group.

This would be for cases when the number of IRQs is less than or equal
to the number of online CPUs. In such cases for dynamically added IRQs
the first CPU sibling group would already be affinitized with HWC IRQ.

	Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
	Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
	Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
(cherry picked from commit 845c62c)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4467
commit-author Shradha Gupta <shradhagupta@linux.microsoft.com>
commit 7553911
upstream-diff There were conflicts seen when applying this patch
due to following commit present in our tree before this patch.
590bcf1 ("net: mana: Add handler for hardware servicing events")

Currently, the MANA driver allocates MSI-X vectors statically based on
MANA_MAX_NUM_QUEUES and num_online_cpus() values and in some cases ends
up allocating more vectors than it needs. This is because, by this time
we do not have a HW channel and do not know how many IRQs should be
allocated.

To avoid this, we allocate 1 MSI-X vector during the creation of HWC and
after getting the value supported by hardware, dynamically add the
remaining MSI-X vectors.

	Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
	Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
(cherry picked from commit 7553911)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4473
commit-author Erni Sri Satya Vennela <ernis@linux.microsoft.com>
commit 75cabb4

Introduce support for net_shaper_ops in the MANA driver,
enabling configuration of rate limiting on the MANA NIC.

To apply rate limiting, the driver issues a HWC command via
mana_set_bw_clamp() and updates the corresponding shaper object
in the net_shaper cache. If an error occurs during this process,
the driver restores the previous speed by querying the current link
configuration using mana_query_link_cfg().

The minimum supported bandwidth is 100 Mbps, and only values that are
exact multiples of 100 Mbps are allowed. Any other values are rejected.

To remove a shaper, the driver resets the bandwidth to the maximum
supported by the SKU using mana_set_bw_clamp() and clears the
associated cache entry. If an error occurs during this process,
the shaper details are retained.

On the hardware that does not support these APIs, the net-shaper
calls to set speed would fail.

Set the speed:
./tools/net/ynl/pyynl/cli.py \
 --spec Documentation/netlink/specs/net_shaper.yaml \
 --do set --json '{"ifindex":'$IFINDEX',
		   "handle":{"scope": "netdev", "id":'$ID' },
		   "bw-max": 200000000 }'

Get the shaper details:
./tools/net/ynl/pyynl/cli.py \
 --spec Documentation/netlink/specs/net_shaper.yaml \
 --do get --json '{"ifindex":'$IFINDEX',
		      "handle":{"scope": "netdev", "id":'$ID' }}'

> {'bw-max': 200000000,
> 'handle': {'scope': 'netdev'},
> 'ifindex': $IFINDEX,
> 'metric': 'bps'}

Delete the shaper object:
./tools/net/ynl/pyynl/cli.py \
 --spec Documentation/netlink/specs/net_shaper.yaml \
 --do delete --json '{"ifindex":'$IFINDEX',
		      "handle":{"scope": "netdev","id":'$ID' }}'

	Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
	Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
	Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
	Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
	Reviewed-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-3-git-send-email-ernis@linux.microsoft.com
	Signed-off-by: Paolo Abeni <pabeni@redhat.com>

(cherry picked from commit 75cabb4)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4473
commit-author Erni Sri Satya Vennela <ernis@linux.microsoft.com>
commit a6d5edf

Allow mana ethtool get_link_ksettings operation to report
the maximum speed supported by the SKU in mbps.

The driver retrieves this information by issuing a
HWC command to the hardware via mana_query_link_cfg(),
which retrieves the SKU's maximum supported speed.

These APIs when invoked on hardware that are older/do
not support these APIs, the speed would be reported as UNKNOWN.

Before:
$ethtool enP30832s1
> Settings for enP30832s1:
        Supported ports: [  ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: Unknown!
        Duplex: Full
        Auto-negotiation: off
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Link detected: yes

After:
$ethtool enP30832s1
> Settings for enP30832s1:
        Supported ports: [  ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 16000Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Link detected: yes

	Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
	Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
	Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
	Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
	Reviewed-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-4-git-send-email-ernis@linux.microsoft.com
	Signed-off-by: Paolo Abeni <pabeni@redhat.com>

(cherry picked from commit a6d5edf)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4473
commit-author Erni Sri Satya Vennela <ernis@linux.microsoft.com>
commit ca8ac48
upstream-diff There were conflicts seen when applying this
patch due to the following patch being in our tree before
this one.
7a3c235 ("net: mana: Handle Reset Request from MANA NIC")

If any of the HWC commands are not recognized by the
underlying hardware, the hardware returns the response
header status of -1. Log the information using
netdev_info_once to avoid multiple error logs in dmesg.

	Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
	Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
	Reviewed-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
	Reviewed-by: Saurabh Singh Sengar <ssengar@linux.microsoft.com>
	Reviewed-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/1750144656-2021-5-git-send-email-ernis@linux.microsoft.com
	Signed-off-by: Paolo Abeni <pabeni@redhat.com>

(cherry picked from commit ca8ac48)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4473
commit-author Erni Sri Satya Vennela <ernis@linux.microsoft.com>
commit 11cd020

Fix build errors when CONFIG_NET_SHAPER is disabled, including:

drivers/net/ethernet/microsoft/mana/mana_en.c:804:10: error:
'const struct net_device_ops' has no member named 'net_shaper_ops'

     804 |         .net_shaper_ops         = &mana_shaper_ops,

drivers/net/ethernet/microsoft/mana/mana_en.c:804:35: error:
initialization of 'int (*)(struct net_device *, struct neigh_parms *)'
from incompatible pointer type 'const struct net_shaper_ops *'
[-Werror=incompatible-pointer-types]

     804 |         .net_shaper_ops         = &mana_shaper_ops,

	Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Fixes: 75cabb4 ("net: mana: Add support for net_shaper_ops")
	Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506230625.bfUlqb8o-lkp@intel.com/
	Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1750851355-8067-1-git-send-email-ernis@linux.microsoft.com
	Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(cherry picked from commit 11cd020)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4527
commit-author Zhiyue Qiu <zhiyueqiu@microsoft.com>
commit 084f35b

Add packet and request port counters to mana_ib.

	Signed-off-by: Zhiyue Qiu <zhiyueqiu@microsoft.com>
	Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1752143395-5324-1-git-send-email-kotaranov@linux.microsoft.com
	Reviewed-by: Long Li <longli@microsoft.com>
	Signed-off-by: Leon Romanovsky <leon@kernel.org>
(cherry picked from commit 084f35b)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4524
commit-author Konstantin Taranov <kotaranov@microsoft.com>
commit 44d69d3

Drain send WRs of the GSI QP on device removal.

In rare servicing scenarios, the hardware may delete the
state of the GSI QP, preventing it from generating CQEs
for pending send WRs. Since WRs submitted to the GSI QP
hold CM resources, the device cannot be removed until
those WRs are completed. This patch marks all pending
send WRs as failed, allowing the GSI QP to release the CM
resources and enabling safe device removal.

	Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1753779618-23629-1-git-send-email-kotaranov@linux.microsoft.com
	Signed-off-by: Leon Romanovsky <leon@kernel.org>
(cherry picked from commit 44d69d3)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
…nnel open.

jira LE-4494
commit-author Dipayaan Roy <dipayanroy@linux.microsoft.com>
commit 9448ccd

The hv_netvsc driver currently enables NAPI after opening the primary and
subchannels. This ordering creates a race: if the Hyper-V host places data
in the host -> guest ring buffer and signals the channel before
napi_enable() has been called, the channel callback will run but
napi_schedule_prep() will return false. As a result, the NAPI poller never
gets scheduled, the data in the ring buffer is not consumed, and the
receive queue may remain permanently stuck until another interrupt happens
to arrive.

Fix this by enabling NAPI and registering it with the RX/TX queues before
vmbus channel is opened. This guarantees that any early host signal after
open will correctly trigger NAPI scheduling and the ring buffer will be
drained.

Fixes: 76bb5db ("netvsc: fix use after free on module removal")
	Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20250825115627.GA32189@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
	Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(cherry picked from commit 9448ccd)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4497
commit-author Haiyang Zhang <haiyangz@microsoft.com>
commit c4deabb

If HW Channel (HWC) is not responding, reduce the waiting time, so further
steps will fail quickly.
This will prevent getting stuck for a long time (30 minutes or more), for
example, during unloading while HWC is not responding.

	Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1757537841-5063-1-git-send-email-haiyangz@linux.microsoft.com
	Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(cherry picked from commit c4deabb)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
jira LE-4521
commit-author Shiraz Saleem <shirazsaleem@microsoft.com>
commit 2bd7dd3

Extend modify QP to support further attributes: local_ack_timeout, UD qkey,
rate_limit, qp_access_flags, flow_label, max_rd_atomic.

	Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com>
	Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1757923172-4475-1-git-send-email-kotaranov@linux.microsoft.com
	Signed-off-by: Leon Romanovsky <leon@kernel.org>
(cherry picked from commit 2bd7dd3)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
…/O issuing CPU

jira LE-4537
commit-author Long Li <longli@microsoft.com>
commit b69ffea

When selecting an outgoing channel for I/O, storvsc tries to select a
channel with a returning CPU that is not the same as issuing CPU. This
worked well in the past, however it doesn't work well when the Hyper-V
exposes a large number of channels (up to the number of all CPUs). Use a
different CPU for returning channel is not efficient on Hyper-V.

Change this behavior by preferring to the channel with the same CPU as
the current I/O issuing CPU whenever possible.

Tests have shown improvements in newer Hyper-V/Azure environment, and no
regression with older Hyper-V/Azure environments.

	Tested-by: Raheel Abdul Faizy <rabdulfaizy@microsoft.com>
	Signed-off-by: Long Li <longli@microsoft.com>
Message-Id: <1759381530-7414-1-git-send-email-longli@linux.microsoft.com>
	Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b69ffea)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
…es to improve memory efficiency.

jira LE-4490
commit-author Dipayaan Roy <dipayanroy@linux.microsoft.com>
commit 730ff06
upstream-diff This patch was causing build failures due to missing
commit 0f92140 ("memory-provider: dmabuf devmem memory provider")
To fix it, we have removed pprm.queue_idx parameter which seems
redundant in this case.

This patch enhances RX buffer handling in the mana driver by allocating
pages from a page pool and slicing them into MTU-sized fragments, rather
than dedicating a full page per packet. This approach is especially
beneficial on systems with large base page sizes like 64KB.

Key improvements:

- Proper integration of page pool for RX buffer allocations.
- MTU-sized buffer slicing to improve memory utilization.
- Reduce overall per Rx queue memory footprint.
- Automatic fallback to full-page buffers when:
   * Jumbo frames are enabled (MTU > PAGE_SIZE / 2).
   * The XDP path is active, to avoid complexities with fragment reuse.

Testing on VMs with 64KB pages shows around 200% throughput improvement.
Memory efficiency is significantly improved due to reduced wastage in page
allocations. Example: We are now able to fit 35 rx buffers in a single 64kb
page for MTU size of 1500, instead of 1 rx buffer per page previously.

Tested:

- iperf3, iperf2, and nttcp benchmarks.
- Jumbo frames with MTU 9000.
- Native XDP programs (XDP_PASS, XDP_DROP, XDP_TX, XDP_REDIRECT) for
  testing the XDP path in driver.
- Memory leak detection (kmemleak).
- Driver load/unload, reboot, and stress scenarios.

	Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
	Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com>
	Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
	Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20250814140410.GA22089@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
	Signed-off-by: Paolo Abeni <pabeni@redhat.com>

(cherry picked from commit 730ff06)
	Signed-off-by: Shreeya Patel <spatel@ciq.com>
…nused()

jira SECO-468
commit-author Luis Henriques <luis@igalia.com>
commit 395b955

Add and export a new helper d_dispose_if_unused() which is simply a wrapper
around to_shrink_list(), to add an entry to a dispose list if it's not used
anymore.

Also export shrink_dentry_list() to kill all dentries in a dispose list.

	Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
	Signed-off-by: Luis Henriques <luis@igalia.com>
	Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 395b955)
	Signed-off-by: Roxana Nicolescu <rnicolescu@ciq.com>
jira KERNEL-170
commit-author Brian Vazquez <brianvv@google.com>
commit 7292af0

Add a helper function to correctly handle the lockless
synchronization when the sender needs to block. The paradigm is

        if (no_resources()) {
                stop_queue();
                barrier();
                if (!no_resources())
                        restart_queue();
        }

netif_subqueue_maybe_stop already handles the paradigm correctly, but
the code split the check for resources in three parts, the first one
(descriptors) followed the protocol, but the other two (completions and
tx_buf) were only doing the first part and so race prone.

Luckily netif_subqueue_maybe_stop macro already allows you to use a
function to evaluate the start/stop conditions so the fix only requires
the right helper function to evaluate all the conditions at once.

The patch removes idpf_tx_maybe_stop_common since it's no longer needed
and instead adjusts separately the conditions for singleq and splitq.

Note that idpf_tx_buf_hw_update doesn't need to check for resources
since that will be covered in idpf_tx_splitq_frame.

To reproduce:

Reduce the threshold for pending completions to increase the chances of
hitting this pause by changing your kernel:

drivers/net/ethernet/intel/idpf/idpf_txrx.h

-#define IDPF_TX_COMPLQ_OVERFLOW_THRESH(txcq)   ((txcq)->desc_count >> 1)
+#define IDPF_TX_COMPLQ_OVERFLOW_THRESH(txcq)   ((txcq)->desc_count >> 4)

Use pktgen to force the host to push small pkts very aggressively:

./pktgen_sample02_multiqueue.sh -i eth1 -s 100 -6 -d $IP -m $MAC \
  -p 10000-10000 -t 16 -n 0 -v -x -c 64

Fixes: 6818c4d ("idpf: add splitq start_xmit")
	Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
	Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
	Signed-off-by: Josh Hay <joshua.a.hay@intel.com>
	Signed-off-by: Brian Vazquez <brianvv@google.com>
	Signed-off-by: Luigi Rizzo <lrizzo@google.com>
	Reviewed-by: Simon Horman <horms@kernel.org>
	Tested-by: Samuel Salin <Samuel.salin@intel.com>
	Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
(cherry picked from commit 7292af0)
	Signed-off-by: Roxana Nicolescu <rnicolescu@ciq.com>
jira KERNEL-170
commit-author Joshua Hay <joshua.a.hay@intel.com>
commit cb83b55
upstream-diff |
	adjusted the number of bytes expected in
	libeth_cacheline_set_assert for struct idpf_tx_queue due to missing
	of some elements in the struct introduced in commit
	1a49cf8 ("idpf: add Tx timestamp flows").

In certain production environments, it is possible for completion tags
to collide, meaning N packets with the same completion tag are in flight
at the same time. In this environment, any given Tx queue is effectively
used to send both slower traffic and higher throughput traffic
simultaneously. This is the result of a customer's specific
configuration in the device pipeline, the details of which Intel cannot
provide. This configuration results in a small number of out-of-order
completions, i.e., a small number of packets in flight. The existing
guardrails in the driver only protect against a large number of packets
in flight. The slower flow completions are delayed which causes the
out-of-order completions. The fast flow will continue sending traffic
and generating tags. Because tags are generated on the fly, the fast
flow eventually uses the same tag for a packet that is still in flight
from the slower flow. The driver has no idea which packet it should
clean when it processes the completion with that tag, but it will look
for the packet on the buffer ring before the hash table.  If the slower
flow packet completion is processed first, it will end up cleaning the
fast flow packet on the ring prematurely. This leaves the descriptor
ring in a bad state resulting in a crash or Tx timeout.

In summary, generating a tag when a packet is sent can lead to the same
tag being associated with multiple packets. This can lead to resource
leaks, crashes, and/or Tx timeouts.

Before we can replace the tag generation, we need a new mechanism for
the send path to know what tag to use next. The driver will allocate and
initialize a refillq for each TxQ with all of the possible free tag
values. During send, the driver grabs the next free tag from the refillq
from next_to_clean. While cleaning the packet, the clean routine posts
the tag back to the refillq's next_to_use to indicate that it is now
free to use.

This mechanism works exactly the same way as the existing Rx refill
queues, which post the cleaned buffer IDs back to the buffer queue to be
reposted to HW. Since we're using the refillqs for both Rx and Tx now,
genericize some of the existing refillq support.

Note: the refillqs will not be used yet. This is only demonstrating how
they will be used to pass free tags back to the send path.

	Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
	Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
	Tested-by: Samuel Salin <Samuel.salin@intel.com>
	Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
(cherry picked from commit cb83b55)
	Signed-off-by: Roxana Nicolescu <rnicolescu@ciq.com>
jira KERNEL-170
commit-author Joshua Hay <joshua.a.hay@intel.com>
commit f2d18e1

Track the gap between next_to_use and the last RE index. Set RE again
if the gap is large enough to ensure RE bit is set frequently. This is
critical before removing the stashing mechanisms because the
opportunistic descriptor ring cleaning from the out-of-order completions
will go away. Previously the descriptors would be "cleaned" by both the
descriptor (RE) completion and the out-of-order completions. Without the
latter, we must ensure the RE bit is set more frequently. Otherwise,
it's theoretically possible for the descriptor ring next_to_clean to
never advance.  The previous implementation was dependent on the start
of a packet falling on a 64th index in the descriptor ring, which is not
guaranteed with large packets.

	Signed-off-by: Luigi Rizzo <lrizzo@google.com>
	Signed-off-by: Brian Vazquez <brianvv@google.com>
	Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
	Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
	Tested-by: Samuel Salin <Samuel.salin@intel.com>
	Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
(cherry picked from commit f2d18e1)
	Signed-off-by: Roxana Nicolescu <rnicolescu@ciq.com>
jira KERNEL-170
commit-author Joshua Hay <joshua.a.hay@intel.com>
commit b61dfa9
upstream-diff |
	adjusted context in 2 places:
	- when removing func idpf_tx_dma_map_error due to different memset
	call that uses the hardcoded struct type;
	- in func idpf_tx_splitq_frame due to missing expected
	union idpf_flex_tx_ctx_desc *ctx_desc;
	both differences were introduced in commit
	1a49cf8 ("idpf: add Tx timestamp flows").

Move (and rename) the existing rollback logic to singleq.c since that
will be the only consumer. Create a simplified splitq specific rollback
function to loop through and unmap tx_bufs based on the completion tag.
This is critical before replacing the Tx buffer ring with the buffer
pool since the previous rollback indexing will not work to unmap the
chained buffers from the pool.

Cache the next_to_use index before any portion of the packet is put on
the descriptor ring. In case of an error, the rollback will bump tail to
the correct next_to_use value. Because the splitq path now supports
different types of context descriptors (and potentially multiple in the
future), this will take care of rolling back any and all context
descriptors encoded on the ring for the erroneous packet. The previous
rollback logic was broken for PTP packets since it would not account for
the PTP context descriptor.

Fixes: 1a49cf8 ("idpf: add Tx timestamp flows")
	Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
	Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
	Tested-by: Samuel Salin <Samuel.salin@intel.com>
	Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
(cherry picked from commit b61dfa9)
	Signed-off-by: Roxana Nicolescu <rnicolescu@ciq.com>
jira KERNEL-170
commit-author Joshua Hay <joshua.a.hay@intel.com>
commit 5f417d5
upstream-diff |
	adjusted context in:
	- ifpf_tx_splitq_frame and idpf_tx_clean_bufs;
	- libeth_cacheline_set_assert for struct idpf_tx_queue due to missing
	of some elements in the struct;
	all cases are due to missing commit
	1a49cf8 ("idpf: add Tx timestamp flows").

Replace the TxQ buffer ring with one large pool/array of buffers (only
for flow scheduling). This eliminates the tag generation and makes it
impossible for a tag to be associated with more than one packet.

The completion tag passed to HW through the descriptor is the index into
the array. That same completion tag is posted back to the driver in the
completion descriptor, and used to index into the array to quickly
retrieve the buffer during cleaning.  In this way, the tags are treated
as a fix sized resource. If all tags are in use, no more packets can be
sent on that particular queue (until some are freed up). The tag pool
size is 64K since the completion tag width is 16 bits.

For each packet, the driver pulls a free tag from the refillq to get the
next free buffer index. When cleaning is complete, the tag is posted
back to the refillq. A multi-frag packet spans multiple buffers in the
driver, therefore it uses multiple buffer indexes/tags from the pool.
Each frag pulls from the refillq to get the next free buffer index.
These are tracked in a next_buf field that replaces the completion tag
field in the buffer struct. This chains the buffers together so that the
packet can be cleaned from the starting completion tag taken from the
completion descriptor, then from the next_buf field for each subsequent
buffer.

In case of a dma_mapping_error occurs or the refillq runs out of free
buf_ids, the packet will execute the rollback error path. This unmaps
any buffers previously mapped for the packet. Since several free
buf_ids could have already been pulled from the refillq, we need to
restore its original state as well. Otherwise, the buf_ids/tags
will be leaked and not used again until the queue is reallocated.

Descriptor completions only advance the descriptor ring index to "clean"
the descriptors. The packet completions only clean the buffers
associated with the given packet completion tag and do not update the
descriptor ring index.

When operating in queue based scheduling mode, the array still acts as a
ring and will only have TxQ descriptor count entries. The tx_bufs are
still associated 1:1 with the descriptor ring entries and we can use the
conventional indexing mechanisms.

Fixes: c2d548c ("idpf: add TX splitq napi poll support")
	Signed-off-by: Luigi Rizzo <lrizzo@google.com>
	Signed-off-by: Brian Vazquez <brianvv@google.com>
	Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
	Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
	Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
	Tested-by: Samuel Salin <Samuel.salin@intel.com>
	Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
(cherry picked from commit 5f417d5)
	Signed-off-by: Roxana Nicolescu <rnicolescu@ciq.com>
jira KERNEL-170
commit-author Joshua Hay <joshua.a.hay@intel.com>
commit 0c3f135
upstream-diff |
	adjusted conflict in idpf_tx_splitq_frame func due to missing
	1a49cf8 ("idpf: add Tx timestamp flows").

The Tx refillq logic will cause packets to be silently dropped if there
are not enough buffer resources available to send a packet in flow
scheduling mode. Instead, determine how many buffers are needed along
with number of descriptors. Make sure there are enough of both resources
to send the packet, and stop the queue if not.

Fixes: 7292af0 ("idpf: fix a race in txq wakeup")
	Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
	Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
	Tested-by: Samuel Salin <Samuel.salin@intel.com>
	Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
(cherry picked from commit 0c3f135)
	Signed-off-by: Roxana Nicolescu <rnicolescu@ciq.com>
jira KERNEL-170
commit-author Joshua Hay <joshua.a.hay@intel.com>
commit 6c4e684
upstream-diff |
	- adjusted context due to missing idpf_tx_read_tstamp func;
	- adjusted the number of bytes expected in
	libeth_cacheline_set_assert for struct idpf_tx_queue due to missing
	of some elements in the struct;
	both are due to missing commit
	1a49cf8 ("idpf: add Tx timestamp flows").

With the new Tx buffer management scheme, there is no need for all of
the stashing mechanisms, the hash table, the reserve buffer stack, etc.
Remove all of that.

	Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
	Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
	Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
	Tested-by: Samuel Salin <Samuel.salin@intel.com>
	Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
(cherry picked from commit 6c4e684)
	Signed-off-by: Roxana Nicolescu <rnicolescu@ciq.com>
@PlaidCat PlaidCat self-assigned this Mar 23, 2026
@PlaidCat PlaidCat requested review from a team March 23, 2026 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants