Controlling device peer-to-peer access from user space

March 7, 2019

This article was contributed by Marta Rybczyńska

The recent addition of support for direct (peer-to-peer) operations between PCIe devices in the kernel has opened the door for different use cases. The initial work concentrated on in-kernel support and the NVMe subsystem; it also added support for memory regions that can be used for such transfers. Jérôme Glisse recently proposed two extensions that would allow the mapping of those regions into user space and mapping device files between two devices. The resulting discussion surprisingly led to consideration of the future of core kernel structures dealing with memory management.

Some PCIe devices can perform direct data transfers to other devices without involving the CPU; support for these peer-to-peer transactions was added to the kernel for the 4.20 release. The rationale behind the functionality is that, if the data is passed between two devices without modification, there is no need to involve the CPU, which can perform other tasks instead. The peer-to-peer feature was developed to allow Remote Direct Memory Access (RDMA) network interface cards to pass data directly to NVMe drives in the NVMe fabrics subsystem. Using peer-to-peer transfers lowers the memory bandwidth needed (it avoids one copy operation in the standard path from device to system memory, then to another device) and CPU usage (the devices set up the DMAs on their own). While not considered directly in the initial work, graphics processing units (GPUs) and RDMA interfaces have been able to use that functionality in out-of-tree modules for years.

The merged work concentrated on support at the PCIe layer. It included setting up special memory regions and the devices that will export and use those regions. It also allows finding out if the PCIe topology allows the peer-to-peer transfers.

The new work by Glisse adds support for peer-to-peer memory regions managed from user space. An important aspect is adding support for heterogeneous memory management (HMM), of which Glisse is the author, to access such memory, but the functionality also covers devices without HMM. The current prototype implementation connects a network card (Mellanox mlx5) to a GPU (AMD); Glisse expects to use it in AMD ROCm RDMA. He mentioned a number of possible use cases, including one device controlling another device's command queue. An example of this situation is a network card accessing a block device command queue so that it can submit storage transactions without the CPU's involvement. Another is direct memory access between two devices, where the memory is not even mapped to the CPU. In this case, the computation on one device runs without interaction with the other one. Examples include a network device monitoring the progress of a device and avoiding use of the system memory, or a network device streaming results from an accelerator.

The patch set in its current state does not include any drivers that actually use the feature. This resulted in other developers having difficulties in understanding how the code is expected to be used and commenting that users are needed before the code can be merged. Similar comments appeared elsewhere in the discussion. Glisse provided some code examples from other branches, but it seems that examples of drivers using the feature will have to appear in future versions of the patch set so that it can be seriously considered for a merge. A part of this related to that fact that many developers remember the difficulties after HMM was initially merged and would rather avoid a repeat.

Interconnect topologies

Currently, PCIe peer-to-peer transfers are only allowed when two devices are attached to the same bridge. Glisse added helper functions to simplify the checks in drivers and support any other interconnect in the process. In the discussion that followed, Greg Kroah-Hartman commented that there is no peer-to-peer concept in the device model; the implementation only covers PCIe for now, so he thought that the changes were premature. Glisse agreed to concentrate on PCIe for now. It seems likely that the device model will get richer when other interconnects start using the peer-to-peer functionality.

Extending `vm_operations_struct`

One of the patches added two new callbacks to struct vm_operations_struct, a core kernel structure collecting callbacks for virtual memory area (VMA) operations. The two proposed callbacks are p2p_map() and p2p_unmap(); respectively, they map and unmap a peer-to-peer region. Rather than mapping the region into a process's address space, though, these functions instruct the device to make the regions available for peer-to-peer access. A common use case would be for a user-space process to map a device memory region with mmap(), then pass that region to a second device to set up the peer-to-peer connection. That second device would then call the new methods to manage that mapping.

Logan Gunthorpe asked about the possible use of the existing fault() callback that would map peer-to-peer memory if appropriate. Glisse answered that the two new callbacks should be called by the mmap() callbacks of the exporting device driver. Instead of mapping to a typical process space (as fault() would do), the goal is to map memory between two device drivers. This task may include complex conditions and allow accessing some memory by a peer device, but not all of it.

The resulting memory region works as a connection between the two sides of the peer-to-peer setup. It will not necessarily be available to user space (or even the CPU itself). Glisse explained that, in the GPU case, it is easier to add a specific callback: the exporting device can check if the peer-to-peer operation is possible and then allow this operation, or use main memory instead. Setting up a fault handler, instead, would require numerous additional flags and a fake page table to be filled and then interpreted by the other device, according to Glisse.

Views of device memory

The discussion highlighted some differences in the handling of special memory in the GPU and RDMA subsystems. It started with Glisse explaining the GPU point of view: the driver often updates the device page tables that map memory within the GPU itself, meaning that the CPU's mappings of GPU memory are invalid most of the time. There is a need for a method to tell the exporting driver that another device wants to map one of its memory regions. If possible, the GPU driver should avoid remapping the affected zone while it is mapped. The exporting device must be able to decide whether it wants to authorize that mapping. Glisse noted that he also wants to use the API in case of interconnected GPUs. In this case, the CPU page tables are invalid and the physical addresses are the only meaningful ones; the kernel has no information about what the addresses are. Glisse also gave an overview of the GPU usage of HMM and how things work without HMM.

One significant subthread of the discussion had to do with whether device peer-to-peer memory should be represented in the CPU's memory map with page structures. For hardware where HMM is in use, device memory behaves (mostly) like ordinary memory and, thus, has those structures. Jason Gunthorpe commented, though, that in the case where HMM is not applicable (many RDMA settings, for example), there are no page structures for this memory; he would like things to stay that way. Attempts to use page structures for this memory, he said, always run into trouble.

Christoph Hellwig replied that not having page structures can create even more trouble; some functionalities, like get_user_pages() or direct I/O to the underlying device memory, just do not work without them. Deeper in the discussion, he listed three uses of struct page in the kernel: to keep the physical address, to keep a reference of memory so that it does not go away while it is still needed, and to set the dirty flag on the PTE after writing to that memory. Any solution that avoids struct page would have to solve those problems first, he said.

get_user_pages(), which maps user-space memory into the kernel's address space, is frequently called by drivers to access that memory. Whether it works or not will have a significant impact on how peer-to-peer memory is used. Some developers would like to see this memory act like ordinary memory, with associated page structures, that would be mappable with get_user_pages(). However, peer-to-peer memory is I/O memory that requires special handling, Jason Gunthorpe noted, so it can never look entirely like ordinary system memory. Glisse would rather avoid supporting get_user_pages() for that memory altogether. In the case of GPUs, it is not needed, he noted. Things turn out to be more complicated for the other potential user, the RDMA layer, though. The developers discussed other options like a special version of get_user_pages() for DMA-only mappings.

Jason Gunthorpe commented on the current state of the peer-to-peer implementation, which implements a type of scatter-gather list (SGL) that references DMA memory only — there is no mapping for the CPU. This is different than the standard in-kernel SGLs that hold both the CPU and DMA addresses. The NVMe and RDMA layers got fixes to support the special SGLs, but he fears that some RDMA drivers may still break because they won't understand those specific SGL semantics. Making get_user_pages() work for all those cases will be a big job and there are conflicting requirements, he said. He also suggested promoting the peer-to-peer, DMA-only, scatter-gather lists to general-use kernel structures.

Next steps

The developers have not found a solution to all of the mentioned problems yet. There are arguments for both keeping and avoiding struct page. Using page structures would allow the use of O_DIRECT and other APIs, but would require much additional work and fixing all issues in the process. On the other hand, avoiding struct page will lead to a type of special memory zone that needs to be handled in a particular way. The decision seems to have potentially important consequences.

The choice has not yet been made, and it seems that there will be more discussion needed before there is a solution the developers can agree on. In addition to that, future submissions of this patch set will probably need to include examples of the API usage so that the developers can understand them better. It seems likely that the peer-to-peer memory will be available from user space some time in the future, but there is still important work to be done first.

Index entries for this article
Kernel	Device drivers/Support APIs
Kernel	PCI
GuestArticles	Rybczynska, Marta

Controlling device peer-to-peer access from user space

Posted Mar 19, 2019 16:11 UTC (Tue) by ScottMinster (subscriber, #67541) [Link] (1 responses)

> He mentioned a number of possible use cases, including one device controlling another device's command queue. An example of this situation is a network card accessing a block device command queue so that it can submit storage transactions without the CPU's involvement.

This sounds like it could really enhance those Thunderclap vulnerabilities (https://lwn.net/Articles/782381/). A network adapter that could send read (or write) commands to the storage device without any mediation from the main system seems dangerous. While things would be fine with a well behaving device, could a rogue device read and transmit an entire drive with nothing on the system aware of it? Or some other nefarious behavior writing to the drive.

What sort of security precautions are there to mitigate a rogue device, especially one plugged into an external port?

Controlling device peer-to-peer access from user space

Posted Oct 21, 2020 17:50 UTC (Wed) by imMute (guest, #96323) [Link]

It's the same vulnerability. To sum it up: PCIe devices can initiated read and write commands. Typically, those commands target system RAM (this is how DMA works). Devices can just as easily target Memory or I/O space in other devices.
The solution is the same: IOMMUs as firewalls between devices you want to segregate.

>could a rogue device read and transmit an entire drive with nothing on the system aware of it?
Yes. It's exactly the same hole as reading and transmitting system RAM without the CPU noticing (except that it's typically more involved to access disk data than it is to access RAM).

Controlling device peer-to-peer access from user space

Interconnect topologies

Extending vm_operations_struct

Views of device memory

Next steps

Controlling device peer-to-peer access from user space

Controlling device peer-to-peer access from user space

Extending `vm_operations_struct`