submitting my own built bio stuck - c

When I tried to build a bio and use generic_make_request, I got flooded with this kind of messages.
The pseudocode is following
struct page *page = kmalloc(sizeof(struct page), GFP_KERNEL);
struct bio *bio = bio_alloc(GFP_KERNEL);
set_bio(bio);
add_bio_page(bio, page);
submit_bio(bio);
Then the log will be flooded with messages like:
nommu_map_sg overflow xxxxxxxxxxx+4096 of device mask ffffffff
When I change allocation of page to
struct page *page = alloc_page(GFP_KERNEL);
Kernel just hung up and I can see large CPU consumption of the VM I use.

I can't point to the exact error, but your code lacks a few elements:
the bi_bdev field of struct bio should point to a block device
the bi_end_io field should point to an I/O completion routine

Related

Is there any way to detect packet direction in the PRE_ROUTING hook point

I'm trying to create a firewall in C as a linux kernel module. as part of the firewall, I've implemented a hook function which performs packets inspection inside the PRE_ROUTING hook point.
In the hook function I need to deduce the packet direction based on its source and destination networking devices.
Whenever I try to extract the source and destination devices, in the packet inspection function, a kernel panic occurs and the OS crashes, and I have no idea why (I've followed linux/netfilter.h strictly). I would more than appreciate any help!
The relevant part of the hook function is as below:
unsigned int inspect_packet(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
char *src_device;
char *dst_device;
src_device = state->in->name;
dst_device = state->in->name;
/* Deduce the packets direction by the networking devices direction */
if (src_device[5] == IN_DEVICE_NUM && dst_device[5] == OUT_DEVICE_NUM)
{
/* some code */
}
}
As you can see, I used (as in the header files) the state->in and state->out fields in order to extract the source and destination device of the packet.
Note: The kernel panic certainly occurs from the code above, the rest of the code is irrelevant.
Solution:
As found in the comments above, the mistake which was made is the assumption the destination device of the packet is already assigned when the hook function was called. The assumption is problematic because the hook function is registered in the PRE_ROUTING hook and therefore still has no destination networking device. In order to solve the problem, we can deduce the packet direction just from the source device.
Here is the fixed version of the code:
unsigned int inspect_packet(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
{
char *src_device;
src_device = state->in->name;
/* Deduce the packets direction just by the source networking device */
if (src_device[5] == IN_DEVICE_NUM)
{
/* some code */
}
}

How to properly utilize masks to send index information to perf event output?

According to the documentation for bpf_perf_event_output found here: http://man7.org/linux/man-pages/man7/bpf-helpers.7.html
"The flags are used to indicate the index in map for which the value must be put, masked with BPF_F_INDEX_MASK."
In the following code:
SEC("xdp_sniffer")
int xdp_sniffer_prog(struct xdp_md *ctx)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
if (data < data_end) {
/* If we have reached here, that means this
* is a useful packet for us. Pass on-the-wire
* size and our cookie via metadata.
*/
/* If we have reached here, that means this
* is a useful packet for us. Pass on-the-wire
* size and our cookie via metadata.
*/
__u64 flags = BPF_F_INDEX_MASK;
__u16 sample_size;
int ret;
struct S metadata;
metadata.cookie = 0xdead;
metadata.pkt_len = (__u16)(data_end - data);
/* To minimize writes to disk, only
* pass necessary information to userspace;
* that is just the header info.
*/
sample_size = min(metadata.pkt_len, SAMPLE_SIZE);
flags |= (__u64)sample_size << 32;
ret = bpf_perf_event_output(ctx, &my_map, flags,
&metadata, sizeof(metadata));
if (ret)
bpf_printk("perf_event_output failed: %d\n", ret);
}
return XDP_PASS;
}
It works as you would expect and stores the information for the given CPU number.
However, suppose I want all packets to be sent to index 1.
I swap
__u64 flags = BPF_F_INDEX_MASK;
for
__u64 flags = 0x1ULL;
The code compiles correctly and throws no errors, however no packets get saved at all anymore. What am I doing wrong if I want all of the packets to be sent to index 1?
Partial answer: I see no reason why the packets would not be sent to the perf buffer, but I suspect the error is on the user space code (not provided). It could be that you do not “open” the perf event for all CPUs when trying to read from the buffer. Have a look at the man page for perf_event_open(2): check that the combination of values for pid and cpu allows you to read data written for CPU 1.
As a side note, this:
__u64 flags = BPF_F_INDEX_MASK;
is misleading. The mask should be used to mask the index, not to set its value. BPF_F_CURRENT_CPU should be used instead, the former only happens to work because the two enum attributes have the same value.

Using physical address as sk_buff data fragment

Is it possible to map physical address as data fragment in sk_buff?
I am working on Zynq Ultrascale+ platform (FPGA + ARM SOC). I have memory buffer mapped to physical address. The goal is to efficiently send that data over UDP. By efficiently I mean ZEROCOPY. What I am trying to do is to develop linux driver that would map that physical address into kernel memory and append it to sk_buff as fragment.
I started with:
#define PACKET_LEN 1024
struct page *pag;
struct net_device *dev;
struct sk_buff *skb = NULL;
skb = alloc_skb(LL_RESERVED_SPACE(dev) + PACKET_LEN + ip_header_l +
udp_header_l, GFP_ATOMIC);
udp = skb_push(skb, udp_header_l);
//Fill up udp header
...
ip = skb_push(skb, ip_header_l);
//fill up ip header
...
dev_hard_header(skb, dev, ETH_P_IP, addr, myaddr, dev->addr_len);
skb->dev = dev;
//map page with data as fragment
skb_fill_page_desc(skb, 0, pag, 0, PACKET_LEN);
//send data
dev_queue_xmit(skb);
And as long as page is created by:
pagebuff = vmalloc(PACKET_LEN);
pag = vmalloc_to_page(pagebuff);
It all works fine. Packet gets send. Packet is send by two DMA transactions (Scatter Gather).
Going towards my goal I replaced vmalloced page with:
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
membase = devm_ioremap_resource(&pdev->dev, res);
pag = virt_to_page(membase);
Physical address is 0xb0000000 and is mapped to virtual address 0xffffff800ad30000 page is at 0xffffffbf0025e280.
After dev_queue_xmit packet goes to network queue and ends up being mapped for DMA.
Problem arises when swiotlb_map_page uses 0x00ad30000 as phys_addr, which is different than original 0xb0000000.
virt_to_phys is used in swiotlb_map_page to calculate physical address and it basically takes lower 32 bits as phys address. Is there a different way to map memory region so it can be used as sk_buff fragment?
As a temporary fix I created fake page like this:
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
pag = alloc_page(0); //create fake page
memset(pag, 0, sizeof(struct page));
pag->private = res->start;
And patched ethernet driver to use page private data as mapping address:
mapping = skb_frag_page(frag)->private;
if (mapping) {
// printk("macb mapping override to %p\n",mapping);
}
else {
mapping = skb_frag_dma_map(&bp->pdev->dev, frag, offset, size, DMA_TO_DEVICE);
if (dma_mapping_error(&bp->pdev->dev, mapping))
goto dma_error;
}
With such a hack it all works. Data is filled with contents of 0xb0000000. Although it works fine I really doubt it is the right way to do it. Nevertheless it shows there is no hardware limitation to do it. Does anyone know how to map that memory correctly?
P.S. I also tried to map physical address to fixed virtual address in such manner that swiotlb_map_page would calculate correct address (and virt_to_phys did), but it ended with "Unable to handle kernel paging request at virtual address" error.
membase = phys_to_virt(res->start);
i = ioremap_page_range(membase, membase + resource_size(res),
res->start, PAGE_KERNEL);
//I tried both
pag = phys_to_page(res->start);
pag = virt_to_page(membase);
Maybe I am looking for page at wrong address or maybe it is nonexistent.
Can anyone point me in the right direction. Is there a way to accomplish the goal without such a nasty hack?

Raw LWIP Send TCP Transmission to Static IP

I've got the TCP Echo example working well on my hardware, and yesterday figured out how to get a UDP Broadcast working. After further thought, I've realized is that what I really need is to be able to set up a TCP Connection to a Static IP, the idea being that my hardware can connect to a server of some sort and then use that connection for all its transactions. The difference is that whereas the echo example sets up a passive connection, that binds with the incoming source (as I understand it), I want to initiate the connection deliberately to a known IP.
Based on what I found on Wikia Here Here
I've attempted as a base case to implement a function that can send a packet to a Defined IP. I'm simply trying to send a packet to my PC, and I'm looking for it on Wireshark.
void echo_tx_tcp()
{
err_t wr_err = ERR_OK;
struct tcp_pcb *l_tcp_pcb;
l_tcp_pcb = tcp_new();
ip_addr_t dest_ip =
{ ((u32_t)0x0C0C0C2BUL) };
wr_err = tcp_bind(l_tcp_pcb, &dest_ip, 12);
wr_err = tcp_connect(l_tcp_pcb, &dest_ip, 12, echo_accept);
tcp_sent(l_tcp_pcb, echo_sent);
struct pbuf *p = pbuf_alloc(PBUF_TRANSPORT, 1024, PBUF_RAM);
unsigned char buffer_send[1024] = "My Name Is TCP";
p->payload = buffer_send;
p->len = 1024;
p->tot_len = 1024;
wr_err = tcp_write(l_tcp_pcb, p->payload, p->len, 1);
wr_err = tcp_output(l_tcp_pcb);
if(wr_err == ERR_OK)
{
p->len++;
}
return;
}
The last if statement just exists so that I can inspect the wr_err value with a debugger. The err is coming back OK but the packet is not seen on wireshark. My setup is my hardare as well as my PC connected to a router in an isolated manner. The IP Address of the PC locally is 12.12.12.43
Am I missing a step here?
The tcp_write() function will fail and return ERR_MEM if:
The length of the data exceeds the current send buffer size.
The length of the queue of the outgoing segment is larger than the upper limit defined in lwipopts.h.
The number of bytes available in the output queue can be retrieved with the tcp_sndbuf() function.
Potential solution(s):
Try again but send less data.
Monitor the amount of space available in the send buffer and only send (more) data when there is space available in the send buffer.
Suggestions:
tcp_snd_buf() can be used to find out how much send buffer space is available.
tcp_sent() can be implemented with callback function, that will be called when send butter space is available.

When two processes read the same file simultaneously, will Linux kernel save one device I/O?

I have a generic question about Linux kernel's handling of file I/O. So far my understanding is that, in an ideal case, after process A reads a file, data is loaded into page cache, and if process B reads the same page before it is reclaimed, it does not need to hit the disk again.
My question is related to how the block device I/O works. Process A's read request will eventually be queued before the I/O actually happens. Now if device B's request (a bio struct) is to be inserted into the request_queue, before A's request is executed, elevator will consider whether to merge B's bio into any existing request. Now, if A and B try to read the same file offset, i.e. same device block, they are literally the same I/O, (or A and B's requests are not exactly the same but they overlap for some blocks), but so far I have not seen this case being considered in kernel code. (The only relevant thing I saw is a test on whether bio can be glued to an existing request contiguously.)
kernel 2.6.11
inline int elv_try_merge(struct request *__rq, struct bio *bio)
{
int ret = ELEVATOR_NO_MERGE;
/*
* we can merge and sequence is ok, check if it's possible
*/
if (elv_rq_merge_ok(__rq, bio)) {
if (__rq->sector + __rq->nr_sectors == bio->bi_sector)
ret = ELEVATOR_BACK_MERGE;
else if (__rq->sector - bio_sectors(bio) == bio->bi_sector)
ret = ELEVATOR_FRONT_MERGE;
}
return ret;
}
kernel 5.3.5
enum elv_merge elv_merge(struct request_queue *q, struct request **req,
struct bio *bio)
{
struct elevator_queue *e = q->elevator;
struct request *__rq;
...
/*
* See if our hash lookup can find a potential backmerge.
*/
__rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
...
}
struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
{
struct elevator_queue *e = q->elevator;
struct hlist_node *next;
struct request *rq;
hash_for_each_possible_safe(e->hash, rq, next, hash, offset) {
...
if (rq_hash_key(rq) == offset)
return rq;
}
return NULL;
}
#define rq_hash_key(rq) (blk_rq_pos(rq) + blk_rq_sectors(rq))
Does that mean kernel will just do two I/Os? Or (very likely) I missed something?
thanks!

Resources