Rafael Rafael - 6 months ago 117
Linux Question

Memcpy performance on /dev/mem outside kernel ram

I'm using a SoC with a custom linux on it. I have reserved the upper 512MB of 1GB total RAM by specifying kernel boot parameter mem=512M.
I can access the upper memory from a userspace program by opening /dev/mem and mmap the upper 512MB which is not used by the kernel.
Know I want to copy big chunks of memory inside this area by memcpy() but the performance is about 50MB/sek. When I allocate buffers by the kernel and memcpy between them I can reach about 500MB/sek.
I'm quite sure is due to the cache is disabled for my special memory area but don't know how to tell the kernel to use cache here.

Has anybody an idea how to solve this?

Answer

Note: A lot of this is prefaced by my top comments, so I'll try to avoid repeating them verbatim.

About buffers for DMA, kernel access, and userspace access. The buffers can be allocated by any mechanism that is suitable.

As mentioned, using mem=512M and /dev/mem with mmap in userspace, the mem driver may not set optimal caching policy. Also, the mem=512M is more typically used to tell the kernel to just never use the memory (e.g. we want to test with less system memory) and we're not going to use the upper 512M for anything.

A better way may to leave off mem=512M and use CMA as you mentioned. Another way may be to bind the driver into the kernel and have it reserve the full memory block during system startup [possibly using CMA].

The memory area might be chosen via kernel command line parameters [from grub.cfg] such as mydev.area= and mydev.size=. That is useful for the "bound" driver that must know these values during the "early" phases of system startup.

So, now we have the "big" area. Now, we need to have a way for the device to get access and the application to get it mapped. The kernel driver can do this. When the device is opened, an ioctl can set up the mappings, with correct kernel policy.

So, depending on the allocation mechanism, the ioctl can be given address/length by the application, or it can pass them back to the application [suitably mapped].

When I had to do this, I created a struct that described a memory area/buffer. It can be the whole area or the large area can be subdivided as needed. Rather than using a variable length, dynamic scheme equivalent to malloc [like what you were writing], I've found that fixed size subpools work better. In the kernel, this is called a "slab" allocator.

The struct had an "id" number for the given area. It also had three addresses: address app could use, address kernel driver could use, and address that would be given to H/W device. Also, in the case of multiple devices, it might have an id for which particular device it is currently associated with.

So, you take the large area and subdivide like this. 5 devices. Dev0 needs 10 1K buffers, Dev1 needs 10 20K buffers, Dev3 needs 10 2K buffers, ...

The application and kernel driver would keep lists of these descriptor structs. The application would start DMA with another ioctl that would take a descriptor id number. Repeat this for all devices.

The application could then issue an ioctl that waits for completion. The driver fills in the descriptor of the just completed operation. The app processes the data and loops. It does this "in-place"--See below.

You're concerned about memcpy speed being slow. As we've discussed, that may be due to the way you were using mmap on /dev/mem.

But, if you're DMAing from a device into memory, the CPU cache may become stale, so you have to account for that. A real device driver has plenty of in-kernel support routines to handle this.

Here's a big one: Why do you need to do a memcpy at all? If things are set up properly, the application can operate directly on the data without needing to copy it. That is, the DMA operation puts the data in exactly the place the app needs it.

At a guess, right now, you've got your memcpy "racing" against the device. That is, you've got to copy off the data fast, so you can start the next DMA without losing any data.

The "big" area should be subdivided [as mentioned above] and the kernel driver should know about the sections. So, the driver starts DMA to id 0. When that completes, it immediately [in the ISR] starts DMA to id 1. When that completes, it goes onto the next one in its subpool. This can be done in a similar manner for each device. The application could poll for completion with an ioctl

That way, the driver can keep all devices running at maximum speed and the application can have plenty of time to process a given buffer. And, once again, it doesn't need to copy it.

Another thing to talk about. Are the DMA registers on your devices double buffered or not? I'm assuming that your devices don't support sophisticated scatter/gather lists and are relatively simple.

In my particular case, in rev 1 of the H/W the DMA registers were not double buffered.

So, after starting DMA on buffer 0, the driver had to wait until the completion interrupt for buffer 0 before setting the DMA registers up for the next transfer to buffer 1. Thus, the driver had to "race" to do the setup for the next DMA [and had a very short window of time to do so]. After starting buffer 0, if the driver had changed the DMA registers on the device, it would have disrupted the already active request.

We fixed this in rev 2 with double buffering. When the driver setup the DMA regs, it would hit the "start" port. All the DMA ports were immediately latched by the device. At this point, the driver was free to do the full setup for buffer 1 and the device would automatically switch to it [without driver intervention] when buffer 0 was complete. The driver would get an interrupt, but could take almost the entire transfer time to set up the next request.

So, with rev 1 style system, a uio approach could not have worked--it would be way too slow. With rev 2, uio might be possible, but I'm not a fan, even if it's possible.

Note: In my case, the we did not use read(2) or write(2) to the device read/write callbacks at all. Everything was handled through special ioctl calls that took various structs like the one mentioned above. At a point early on, we did use read/write in a manner similar to the way uio uses them. But, we found the mapping to be artificial and limiting [and troublesome], so we converted over to the "only ioctl" approach.

More to the point, what are the requirements? Amount of data transferred per second. Number of devices that do what? Are they all input or are there output ones as well?

In my case [which did R/T processing of broadcast quality hidef H.264 video], we were able to do processing in the driver and application space as well as the custom FPGA logic. But, we used a full [non-uio] driver approach, even though, architecturally it looked like uio in places.

We had stringent requirements for reliability, R/T predictability, guaranteed latency. We had to process 60 video frames / second. If we ran over, even by a fraction, our customers started screaming. uio could not have done this for us.

So, you started this with a simple approach. But, I might take a step back and look at requirements, device capabilities/restrictions, alternate ways to get contiguous buffers, R/T throughput and latency, and reassess things. Does your current solution really address all the needs? Currently, you're already running into hot spots [data races between app and device] and/or limitations. Or, would you be better off with a native driver that gives you more flexibility (i.e. There might be an as yet unknown that will force the native driver).

Xilinx probably provides a suitable skeleton full driver in their SDK that you could hack up pretty quickly.