PCIE region not aligned, and not consistent - c

I am developing a PCIE device driver for openwrt, and I met a data bus error when trying to access the io-memory in timer interrupt, which I mentioned in my last question. After lots of research I think I might have found the reason, but I am unable to solve it. Below are my troubles.
Last week I found out that the pcie region size might have changed during system startup. The region size of bar0 is 4096 in my driver (return from pci_resource_len) and the region size is 4097 in lspci -vv, which breaks the page size of linux kernel. By reading the source code of pciutil, I find that lspci command fetch the pcie information from /sys/devices/pci0000:00/0000:00:00.0/resouce file. So I remove all my custom components and run the original openwrt on my router. By cat /sys/devices/pci0000:00/0000:00:00.0/resouce, the first line of the result (bar0) is
0x0000000010008000 0x0000000010009000 0x0000000000040200
Moreover, I also check the content of /proc/iomem, and the content related to PCIE is
10000000-13ffffff : mem_base
10000000-13ffffff : PCI memory space
10000000-10007fff : 0000:00:00.0
10008000-10008fff : 0000:00:00.0
It is super weird that the region size of bar0 indicated by the two files above is different! According to the mechanism of PCIE, the region size should always be the power of 2. How come the region size becomes 4097?

After spending weeks reading source code of linux kernel, I find out that this is a bug of linux kernel 4.4.14.
The content of /sys/devices/pci0000:00/0000:00:00.0/resouce is generated through function resource_show in file drivers/pci/pci-sysfs.c. The related code is
for (i = 0; i < max; i++) {
struct resource *res = &pci_dev->resource[i];
pci_resource_to_user(pci_dev, i, res, &start, &end);
str += sprintf(str, "0x%016llx 0x%016llx 0x%016llx\n",
(unsigned long long)start,
(unsigned long long)end,
(unsigned long long)res->flags);
}
The function pci_resource_to_user actually invoked is located in arch/mips/include/asm/pci.h
static inline void pci_resource_to_user(const struct pci_dev *dev, int bar,
const struct resource *rsrc, resource_size_t *start,
resource_size_t *end)
{
phys_addr_t size = resource_size(rsrc);
*start = fixup_bigphys_addr(rsrc->start, size);
*end = rsrc->start + size;
}
The calculation of *end is wrong and should be replace by
*end = rsrc->start + size - (size ? 1 : 0)

Related

Get maximum available register for Linux PCI device

I am currently debugging a Linux kernel driver.
I want to sweep a PCI device's mmio registers to scan for certain information.
This is the function I wrote so far.
void _sweep_registers(struct pci_dev *dev)
{
int i;
int activecontrolstatus;
int activestatus;
for (i = 0; i < AMD_P2C_MSG_INTSTS; i++) {
activecontrolstatus = readl(privdata->mmio + i);
activestatus = activecontrolstatus >> 4;
dev_info(&dev->dev, "activecontrolstatus = %d / activestatus = %d",
activecontrolstatus, activestatus);
}
}
Currently I am reading mmio until what's specified in AMD_P2C_MSG_INTSTS (which is 0x10694).
But how far can I actually go?
I have zero knowledge of Linux kernel development and only rudimentary knowledge of C.
Background
My goal is to find information about which sensors of the AMD Sensor Fusion Hub are marked as active.
They should be under the register 0x1068C, but are not on my system (it's 0x0, but at least an accelerometer is available, so the bitmask should at least match 0x1).
I want to see, whether they are stored somewhere else.

AF_XDP: Relationship between `FRAME_SIZE` and actual size of packet

My AF-XDP userspace program is based on this tutorial: https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP
I am currently trying to parse ~360.000 RTP-packets per second (checking for continuous sequence numbers) but I loose around 25 per second (this means that for 25 packets the statement previous_rtp_sqnz_nmbr + 1 == current_rtp_sqnz_nmbr doesn't hold true).
So I tried to increase the number of allocated packets NUM_FRAMES from 228.000 to 328.000. With default FRAME_SIZE of XSK_UMEM__DEFAULT_FRAME_SIZE = 4096 this results in 1281Mbyte being allocated (no problem because I have 32GB of RAM) but for whatever reason, this function call:
static struct xsk_umem_info *configure_xsk_umem(void *buffer, uint64_t size)
{
printf("Try to allocate %lu\n", size);
struct xsk_umem_info *umem;
int ret;
umem = calloc(1, sizeof(*umem));
if (!umem)
return NULL;
ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
NULL);
if (ret) {
errno = -ret;
return NULL;
}
umem->buffer = buffer;
return umem;
}
fails with
Try to allocate 1343488000
ERROR: Can't create umem "Cannot allocate memory"
I don't know why? But because I know that my RTP-packets are not larger than 1500bytes, I set FRAME_SIZE 3072 so I am now at around 960Mbyte (which works without an error).
However, I am now loosing half of the received packets (this means that for 180.000 packets the previous sequence number doesn't line up with the current sequence number).
Because of this I ask the question: What is the relationship between FRAME_SIZE and the actual size of a packet? Because obviously it can not be the same.
Edit: I am using 5.4.0-4-amd64 #1 SMP Debian 5.4.19-1 (2020-02-13) x86_64 GNU/Linux and just copied the libbpf-repository from here into my code-base: https://github.com/libbpf/libbpf
So I don't know whether the error mentioned here: https://github.com/xdp-project/xdp-tutorial/issues/76 is still valid?

How to count the anonymous pages and shared pages for a process in Linux using kernel module

I am trying to find the resident set size of a c program running on Linux os (ubuntu 14.04). I get the PID of the running C program and pass it to a custom kernel module. The kernel module figures out the *task and extracts the *mm pointer. Then I loop through all the VM areas and in each VM area I again loop through each page aligned virtual addresses and request a page_walk(virtual addresses) to get the pte structure of type pte_t. Then I used the pte_preset() function to check the existence of the actual physical page in the RAM.
The issues I am facing are as follows:
The rss value does not match with the value shown in htop or top. Although the value I have calculated does increase proportionally as the test C program accesses more memory (using some array accessing).
I have found that the rss value of htop application gives the same result as given by the get_mm_struct() function call provided by the Linux kernel itself.
static inline unsigned long get_mm_rss(struct mm_struct *mm)
{
return get_mm_counter(mm, MM_FILEPAGES) +
get_mm_counter(mm, MM_ANONPAGES) +
get_mm_counter(mm, MM_SHMEMPAGES);
}
My query is how to count or detect these anonymous pages and shared pages? What are bits that need to be checked?
Thank You !
The correct way to do this is to realize that the count is in an array. Try:
static inline unsigned long get_mm_rss(struct mm_struct *mm)
{
int k;
unsigned long count = 0;
for(k = 0; k < NR_MM_COUNTERS; k++) {
long len = atomic_long_read(&mm->rss_stat.count[k]);
if(len < 0)
len = 0;
count += len;
}
}
Walking Physical Pages
You need to set up mm_walk struct with your call backs for pte and pmd (driven by whether or not HUGETABLES are used in the kernel) to walk through the physical pages.
For example:
show_smap uses this:
struct mm_walk smaps_walk = {
.pmd_entry = smaps_pte_range,
#ifdef CONFIG_HUGETLB_PAGE
.hugetlb_entry = smaps_hugetlb_range,
#endif
.mm = vma->vm_mm,
};
after setting up the call backs.

Block ram disk fails to read/write with offset

I'm creating a very very simple block RAM disk based on sbull.
So far it works fine if I read/write blocks of data using dd, but whenever I try mounting a filesystem on it (and sometimes creating a file system) my driver crashes.
After long weeks of debugging, I finally found out what is wrong, even though I can't really find a way to solve the problem. Hence my question here :)
Whenever a user space application creates a request to the device WITH AN OFFSET, the driver won't work! Let me show you the source code in order to clarify:
First of all, I'm handling requests using mk_request (not using a request_queue):
static void escsi_mk_request(struct request_queue *q, struct bio *bio)
{
struct block_device *bdev = bio->bi_bdev;
struct escsi_dev *esd = bdev->bd_disk->private_data;
int rw;
struct bio_vec *bvec;
sector_t sector;
int i;
int err = -EIO;
printk("request received nr. sectors = %lu\n",bio_sectors(bio));
sector = bio->bi_sector;
if (bio_end_sector(bio) > get_capacity(bdev->bd_disk))
goto out;
if (unlikely(bio->bi_rw & REQ_DISCARD)) {
err = 0;
goto out;
}
rw = bio_rw(bio);
if (rw == READA)
rw = READ;
bio_for_each_segment(bvec, bio, i) {
unsigned int len = bvec->bv_len;
err = esd_do_bvec(esd, bvec->bv_page, len, bvec->bv_offset, rw, sector);
if (err) {
printk("err!\n");
break;
}
sector += len >> SECTOR_SHIFT;
}
out:
bio_endio(bio, err);
}
The esd_do_bvec function:
static int esd_do_bvec(struct escsi_dev *esd, struct page *page,
unsigned int len, unsigned int off, int rw,
sector_t sector)
{
void *mem;
int err = 0;
unsigned int offset;
int i;
offset = off + sector * 512;
printk("ESD RW=%d, len=%d, off=%d, offset=%d, sector=%lu\n",rw,len,off,offset,sector);
mem = kmap_atomic(page);
if (rw == READ) {
memcpy(mem,esd->data+offset,len);
} else {
memcpy(esd->data+offset,mem,len);
}
kunmap_atomic(mem);
out:
return err;
}
OK, so basically when I read or write data using dd, the variable "off" in esd_do_bvec() is always 0, regardless of where and how many bytes I want to write. The file system obviously always performs I/O in 4KB chunks and will write a full block even when only one byte needs to be replaced.
I am sure that reads and writes are working correctly when there's no offset because I created a file that is the same size as my block RAM disk and dumped the entire file into my device using dd, then got the output of the device (also using dd), and the input and output files are exactly the same. I also wrote the same file into a brd (Linux kernel original block RAM disk driver) and the outputs are the same comparing my device and the brd device.
BUT -- in some specific situations I try to mount or create a new file system on my device and somehow it gets I/O requests with an offset, and at that point my driver fails. I assume that I'm not handling the offset properly. For example, when I try "mount -t ext2 /dev/esda":
linux-xjwl:/home/phil/escsi # mount /dev/esda -t ext2 /mnt/esda1/
mount: wrong fs type, bad option, bad superblock on /dev/esda,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
linux-xjwl:/home/phil/escsi # dmesg|tail -n 10
[ 2239.275901] ESD RW=0, len=4096, off=0, offset=16384, sector=32
[ 2239.275947] request received nr. sectors = 8
[ 2239.275959] ESD RW=0, len=4096, off=0, offset=4096, sector=8
[ 2239.276516] request received nr. sectors = 8
[ 2239.276537] ESD RW=0, len=4096, off=0, offset=2097152, sector=4096
[ 2239.276606] request received nr. sectors = 8
[ 2239.276626] ESD RW=0, len=4096, off=0, offset=28672, sector=56
[ 2239.277535] request received nr. sectors = 2
[ 2239.277535] ESD RW=0, len=1024, off=1024, offset=2048, sector=2
[ 2239.277535] EXT4-fs (esda): VFS: Can't find ext4 filesystem
(p.s.: the output shows "EXT4" but I am running with "-t ext2")
I have checked the contents of sector n. 2 in my device and it does contain the ext2 metadata (since I ran mkfs.ext2 prior to trying to mount, of course). So I believe there's a problem with the offset. So far I can't really debug my driver because I wasn't able to come up with a request which would cause an I/O request with an offset (e.g., if I try writing a single byte into my device, Linux will read the whole block and rewrite it with only one different byte).
Hope it's not a too simple question for you.
Thanks in advance,
Phil
Please see the answer provided by Peter below.
If you're wondering what the esd_do_bvec() function looks like now, here it comes:
static int esd_do_bvec(struct escsi_dev *esd, char *buf,
unsigned int len, int rw, sector_t sector)
{
int err = 0;
unsigned int offset;
// Please notice that we STILL have an offset to deal with, but
// this offset comes in sectors and needs to be converted to a
// a byte offset.
offset = sector << SECTOR_SHIFT; // or multiply by 512
//printk("ESD RW=%d, len=%d, off=%d, offset=%d, sector=%lu\n",rw,len,off,offset,sector);
if (rw == READ) {
memcpy(buf,esd->data+offset,len);
} else {
memcpy(esd->data+offset,buf,len);
}
return err;
}
The offset per segment does not refer to an offset from the block device location, but rather an offset into the page. To cause this to be nonzero, you'll probably need to write your own C program that runs read() and write(). Allocate a page-aligned buffer, then read/write to/from different locations in that buffer, and those should show up as offsets in the bvec.
That said, LWN warns of managing this page offset manually, and recommends instead the macro bio_kmap_irq(), which is called on the bio_for_each_segment() variable bio, and takes care of the atomic kmap AND manages the offset entry as well. Source: http://lwn.net/Articles/26404/
Your code will look something like:
bio_for_each_segment(bvec, bio, i) {
unsigned int len = bvec->bv_len;
unsigned long flags;
char *buf = bio_kmap_irq(bio, &flags);
err = esd_do_bvec(esd, buf, len, rw, sector);
bio_kunmap_irq(buf, &flags);
if (err) {
printk("err!\n");
break;
}
sector += len >> SECTOR_SHIFT;
}
Of course this changes the signature of esd_do_bvec to accept the memory buffer directly rather than page/offset.

Finding the address range of the data segment

As a programming exercise, I am writing a mark-and-sweep garbage collector in C. I wish to scan the data segment (globals, etc.) for pointers to allocated memory, but I don't know how to get the range of the addresses of this segment. How could I do this?
If you're working on Windows, then there are Windows API that would help you.
//store the base address the loaded Module
dllImageBase = (char*)hModule; //suppose hModule is the handle to the loaded Module (.exe or .dll)
//get the address of NT Header
IMAGE_NT_HEADERS *pNtHdr = ImageNtHeader(hModule);
//after Nt headers comes the table of section, so get the addess of section table
IMAGE_SECTION_HEADER *pSectionHdr = (IMAGE_SECTION_HEADER *) (pNtHdr + 1);
ImageSectionInfo *pSectionInfo = NULL;
//iterate through the list of all sections, and check the section name in the if conditon. etc
for ( int i = 0 ; i < pNtHdr->FileHeader.NumberOfSections ; i++ )
{
char *name = (char*) pSectionHdr->Name;
if ( memcmp(name, ".data", 5) == 0 )
{
pSectionInfo = new ImageSectionInfo(".data");
pSectionInfo->SectionAddress = dllImageBase + pSectionHdr->VirtualAddress;
**//range of the data segment - something you're looking for**
pSectionInfo->SectionSize = pSectionHdr->Misc.VirtualSize;
break;
}
pSectionHdr++;
}
Define ImageSectionInfo as,
struct ImageSectionInfo
{
char SectionName[IMAGE_SIZEOF_SHORT_NAME];//the macro is defined WinNT.h
char *SectionAddress;
int SectionSize;
ImageSectionInfo(const char* name)
{
strcpy(SectioName, name);
}
};
Here's a complete, minimal WIN32 console program you can run in Visual Studio that demonstrates the use of the Windows API:
#include <stdio.h>
#include <Windows.h>
#include <DbgHelp.h>
#pragma comment( lib, "dbghelp.lib" )
void print_PE_section_info(HANDLE hModule) // hModule is the handle to a loaded Module (.exe or .dll)
{
// get the location of the module's IMAGE_NT_HEADERS structure
IMAGE_NT_HEADERS *pNtHdr = ImageNtHeader(hModule);
// section table immediately follows the IMAGE_NT_HEADERS
IMAGE_SECTION_HEADER *pSectionHdr = (IMAGE_SECTION_HEADER *)(pNtHdr + 1);
const char* imageBase = (const char*)hModule;
char scnName[sizeof(pSectionHdr->Name) + 1];
scnName[sizeof(scnName) - 1] = '\0'; // enforce nul-termination for scn names that are the whole length of pSectionHdr->Name[]
for (int scn = 0; scn < pNtHdr->FileHeader.NumberOfSections; ++scn)
{
// Note: pSectionHdr->Name[] is 8 bytes long. If the scn name is 8 bytes long, ->Name[] will
// not be nul-terminated. For this reason, copy it to a local buffer that's nul-terminated
// to be sure we only print the real scn name, and no extra garbage beyond it.
strncpy(scnName, (const char*)pSectionHdr->Name, sizeof(pSectionHdr->Name));
printf(" Section %3d: %p...%p %-10s (%u bytes)\n",
scn,
imageBase + pSectionHdr->VirtualAddress,
imageBase + pSectionHdr->VirtualAddress + pSectionHdr->Misc.VirtualSize - 1,
scnName,
pSectionHdr->Misc.VirtualSize);
++pSectionHdr;
}
}
// For demo purpopses, create an extra constant data section whose name is exactly 8 bytes long (the max)
#pragma const_seg(".t_const") // begin allocating const data in a new section whose name is 8 bytes long (the max)
const char const_string1[] = "This string is allocated in a special const data segment named \".t_const\".";
#pragma const_seg() // resume allocating const data in the normal .rdata section
int main(int argc, const char* argv[])
{
print_PE_section_info(GetModuleHandle(NULL)); // print section info for "this process's .exe file" (NULL)
}
This page may be helpful if you're interested in additional uses of the DbgHelp library.
You can read the PE image format here, to know it in details. Once you understand the PE format, you'll be able to work with the above code, and can even modify it to meet your need.
PE Format
Peering Inside the PE: A Tour of the Win32 Portable Executable File Format
An In-Depth Look into the Win32 Portable Executable File Format, Part 1
An In-Depth Look into the Win32 Portable Executable File Format, Part 2
Windows API and Structures
IMAGE_SECTION_HEADER Structure
ImageNtHeader Function
IMAGE_NT_HEADERS Structure
I think this would help you to great extent, and the rest you can research yourself :-)
By the way, you can also see this thread, as all of these are somehow related to this:
Scenario: Global variables in DLL which is used by Multi-threaded Application
The bounds for text (program code) and data for linux (and other unixes):
#include <stdio.h>
#include <stdlib.h>
/* these are in no header file, and on some
systems they have a _ prepended
These symbols have to be typed to keep the compiler happy
Also check out brk() and sbrk() for information
about heap */
extern char etext, edata, end;
int
main(int argc, char **argv)
{
printf("First address beyond:\n");
printf(" program text segment(etext) %10p\n", &etext);
printf(" initialized data segment(edata) %10p\n", &edata);
printf(" uninitialized data segment (end) %10p\n", &end);
return EXIT_SUCCESS;
}
Where those symbols come from: Where are the symbols etext ,edata and end defined?
Since you'll probably have to make your garbage collector the environment in which the program runs, you can get it from the elf file directly.
Load the file that the executable came from and parse the PE headers, for Win32. I've no idea about on other OSes. Remember that if your program consists of multiple files (e.g. DLLs) you may have multiple data segments.
For iOS you can use this solution. It shows how to find the text segment range but you can easily change it to find any segment you like.

Resources