Imaging having a stack of protocols and some c/cpp code that neatly covers
sending on each layer. Each send function uses the layer below to add
another header until the whole message is eventually placed into a
continuous global buffer on layer 0:
void SendLayer2(void * payload, unsigned int payload_length)
{
Layer2Header header; /* eat some stack */
const int msg_length = sizeof(header) + payload_length;
char msg[msg_length]; /* and have some more */
memset(msg, 0, sizeof(msg));
header.whatever = 42;
memcpy(&buffer[sizeof(header)], payload, payload_length);
SendLayer1(buffer, msg_length);
}
void SendLayer1(void * payload, unsigned int payload_length)
{
Layer1Header header; /* eat some stack */
const int msg_length = sizeof(header) + payload_length;
char msg[msg_length]; /* and have some more */
memset(msg, 0, sizeof(msg));
header.whatever = 42;
memcpy(&buffer[sizeof(header)], payload, payload_length);
SendLayer0(buffer, msg_length);
}
Now the data is moved to some global variable and actually transferred:
char globalSendBuffer[MAX_MSG_SIZE];
void SendLayer0(void * payload, unsigned int payload_length)
{
// Some clever locking for the global buffer goes here
memcpy(globalSendBuffer, payload, payload_length);
SendDataViaCopper(globalSendBuffer, payload_length);
}
I'd like to reduce both the stack usage and the number of memcpy()s in this
code, so I imagine something like:
void SendLayer2(void * payload, unsigned int payload_length)
{
Layer2Header * header = GetHeader2Pointer();
header->whatever = 42;
void * buffer = GetPayload2Pointer();
memcpy(buffer, payload, payload_length);
...
}
My idea would be to have something at the bottom that would calculate the proper offsets for each layers header and the offset for the actual payload by continuously subtracting from MAX_MSG_SIZE and letting the upper layer code then fill in the global buffer directly from the end / right side.
Does this sound sensible? Are there alternative, perhaps more elegant approaches?
You may be interested in this article: "Network Buffers and Memory Management" by Alan Cox. Basically, you have the buffer and several pointers to different interesting parts of that buffer: protocol headers, data, ... Initially you reserve some space for headers by setting the data pointer to (buffer_start + max_headers_size), and each layer gets a pointer nearer to the start of the buffer.
I'm sure there must be a similar description somewhere for BSD's mbufs.
EDIT:
David Miller (Linux networking maintainer) has this article "How SKBs work"
This sounds like "Zero Copy." I'm no expert so search for that term and you'll find all sorts of references.
Related
According to the documentation for bpf_perf_event_output found here: http://man7.org/linux/man-pages/man7/bpf-helpers.7.html
"The flags are used to indicate the index in map for which the value must be put, masked with BPF_F_INDEX_MASK."
In the following code:
SEC("xdp_sniffer")
int xdp_sniffer_prog(struct xdp_md *ctx)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
if (data < data_end) {
/* If we have reached here, that means this
* is a useful packet for us. Pass on-the-wire
* size and our cookie via metadata.
*/
/* If we have reached here, that means this
* is a useful packet for us. Pass on-the-wire
* size and our cookie via metadata.
*/
__u64 flags = BPF_F_INDEX_MASK;
__u16 sample_size;
int ret;
struct S metadata;
metadata.cookie = 0xdead;
metadata.pkt_len = (__u16)(data_end - data);
/* To minimize writes to disk, only
* pass necessary information to userspace;
* that is just the header info.
*/
sample_size = min(metadata.pkt_len, SAMPLE_SIZE);
flags |= (__u64)sample_size << 32;
ret = bpf_perf_event_output(ctx, &my_map, flags,
&metadata, sizeof(metadata));
if (ret)
bpf_printk("perf_event_output failed: %d\n", ret);
}
return XDP_PASS;
}
It works as you would expect and stores the information for the given CPU number.
However, suppose I want all packets to be sent to index 1.
I swap
__u64 flags = BPF_F_INDEX_MASK;
for
__u64 flags = 0x1ULL;
The code compiles correctly and throws no errors, however no packets get saved at all anymore. What am I doing wrong if I want all of the packets to be sent to index 1?
Partial answer: I see no reason why the packets would not be sent to the perf buffer, but I suspect the error is on the user space code (not provided). It could be that you do not “open” the perf event for all CPUs when trying to read from the buffer. Have a look at the man page for perf_event_open(2): check that the combination of values for pid and cpu allows you to read data written for CPU 1.
As a side note, this:
__u64 flags = BPF_F_INDEX_MASK;
is misleading. The mask should be used to mask the index, not to set its value. BPF_F_CURRENT_CPU should be used instead, the former only happens to work because the two enum attributes have the same value.
I have a need for a ring buffer (In C language) which can hold objects of any type at the run time (almost the data will be different signal's values like current (100ms and 10ms) and temperature.etc) ( I am not sure if it have to be a fixed size or not) and it needs to be very high performance. although it's in a multi-tasking embedded environment.
Actually i need this buffer as a back up, which mean the embedded software will work as normal and save the data into the ring buffer, so far for any reason and when an error occurred, then i could have like a reference for the measured values then i will be able to have a look on them and determine the problem. Also i need to make a time stamp on the ring buffer, which mean every data (Signal value) is stored on the ring buffer will stored with the measurement's time.
Any code or ideas would be greatly appreciated. some of the operations required are:
create a ring buffer with specific size.
Link it with the whole software.
put at the tail.
get from the head.
at error, read the data and when its happen (time stamp).
return the count.
overwrite when the buffer is being full.
#include<stdint.h>
#include<stdio.h>
#include<stdlib.h>
typedef struct ring_buffer
{
void * buffer; // data buffer
void * buffer_end; // end of data buffer
void * data_start; // pointer to head
void * data_end; // pointer to tail
uint64_t capacity; // maximum number of items in buffer
uint64_t count; // number of items in the buffer
uint64_t size; // size of each item in the buffer
} ring_buffer;
void rb_init (ring_buffer *rb, uint64_t size, uint64_t capacity )
{
rb->buffer = malloc(capacity * size);
if(rb->buffer == NULL)
// handle error
rb->buffer_end = (char *)rb->buffer + capacity * size;
rb->capacity = capacity;
rb->count = 0;
rb->size = size;
rb->data_start = rb->buffer;
rb->data_end = rb->buffer;
}
void cb_free(ring_buffer *rb)
{
free(rb->buffer);
// clear out other fields too, just to be safe
}
void rb_push_back(ring_buffer *rb, const void *item)
{
if(rb->count == rb->capacity){
// handle error
}
memcpy(rb->data_start, item, rb->size);
rb->data_start = (char*)rb->data_start + rb->size;
if(rb->data_start == rb->buffer_end)
rb->data_start = rb->buffer;
rb->count++;
}
void rb_pop_front(ring_buffer *rb, void *item)
{
if(rb->count == 0){
// handle error
}
memcpy(item, rb->data_end, rb->size);
rb->data_end = (char*)rb->data_end + rb->size;
if(rb->data_end == rb->buffer_end)
rb->data_end = rb->buffer;
rb->count--;
}
Creating a ring buffer/FIFO with hardcopies of generic type is highly questionable design for embedded systems. You shouldn't need that high level of abstraction for code so close to the hardware.
Either you make a ring buffer with a data type tag (like an enum) plus a void* to data allocated elsewhere, or you make a ring buffer where all data is of the same type. Everything else is most likely confused program design ("XY problem").
You need some means to lock access to the ring buffer internally, to make it thread-safe/interrupt-safe. This, as well as the time stamp, has to be handled internally by the ring buffer ADT.
I'm creating a very very simple block RAM disk based on sbull.
So far it works fine if I read/write blocks of data using dd, but whenever I try mounting a filesystem on it (and sometimes creating a file system) my driver crashes.
After long weeks of debugging, I finally found out what is wrong, even though I can't really find a way to solve the problem. Hence my question here :)
Whenever a user space application creates a request to the device WITH AN OFFSET, the driver won't work! Let me show you the source code in order to clarify:
First of all, I'm handling requests using mk_request (not using a request_queue):
static void escsi_mk_request(struct request_queue *q, struct bio *bio)
{
struct block_device *bdev = bio->bi_bdev;
struct escsi_dev *esd = bdev->bd_disk->private_data;
int rw;
struct bio_vec *bvec;
sector_t sector;
int i;
int err = -EIO;
printk("request received nr. sectors = %lu\n",bio_sectors(bio));
sector = bio->bi_sector;
if (bio_end_sector(bio) > get_capacity(bdev->bd_disk))
goto out;
if (unlikely(bio->bi_rw & REQ_DISCARD)) {
err = 0;
goto out;
}
rw = bio_rw(bio);
if (rw == READA)
rw = READ;
bio_for_each_segment(bvec, bio, i) {
unsigned int len = bvec->bv_len;
err = esd_do_bvec(esd, bvec->bv_page, len, bvec->bv_offset, rw, sector);
if (err) {
printk("err!\n");
break;
}
sector += len >> SECTOR_SHIFT;
}
out:
bio_endio(bio, err);
}
The esd_do_bvec function:
static int esd_do_bvec(struct escsi_dev *esd, struct page *page,
unsigned int len, unsigned int off, int rw,
sector_t sector)
{
void *mem;
int err = 0;
unsigned int offset;
int i;
offset = off + sector * 512;
printk("ESD RW=%d, len=%d, off=%d, offset=%d, sector=%lu\n",rw,len,off,offset,sector);
mem = kmap_atomic(page);
if (rw == READ) {
memcpy(mem,esd->data+offset,len);
} else {
memcpy(esd->data+offset,mem,len);
}
kunmap_atomic(mem);
out:
return err;
}
OK, so basically when I read or write data using dd, the variable "off" in esd_do_bvec() is always 0, regardless of where and how many bytes I want to write. The file system obviously always performs I/O in 4KB chunks and will write a full block even when only one byte needs to be replaced.
I am sure that reads and writes are working correctly when there's no offset because I created a file that is the same size as my block RAM disk and dumped the entire file into my device using dd, then got the output of the device (also using dd), and the input and output files are exactly the same. I also wrote the same file into a brd (Linux kernel original block RAM disk driver) and the outputs are the same comparing my device and the brd device.
BUT -- in some specific situations I try to mount or create a new file system on my device and somehow it gets I/O requests with an offset, and at that point my driver fails. I assume that I'm not handling the offset properly. For example, when I try "mount -t ext2 /dev/esda":
linux-xjwl:/home/phil/escsi # mount /dev/esda -t ext2 /mnt/esda1/
mount: wrong fs type, bad option, bad superblock on /dev/esda,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
linux-xjwl:/home/phil/escsi # dmesg|tail -n 10
[ 2239.275901] ESD RW=0, len=4096, off=0, offset=16384, sector=32
[ 2239.275947] request received nr. sectors = 8
[ 2239.275959] ESD RW=0, len=4096, off=0, offset=4096, sector=8
[ 2239.276516] request received nr. sectors = 8
[ 2239.276537] ESD RW=0, len=4096, off=0, offset=2097152, sector=4096
[ 2239.276606] request received nr. sectors = 8
[ 2239.276626] ESD RW=0, len=4096, off=0, offset=28672, sector=56
[ 2239.277535] request received nr. sectors = 2
[ 2239.277535] ESD RW=0, len=1024, off=1024, offset=2048, sector=2
[ 2239.277535] EXT4-fs (esda): VFS: Can't find ext4 filesystem
(p.s.: the output shows "EXT4" but I am running with "-t ext2")
I have checked the contents of sector n. 2 in my device and it does contain the ext2 metadata (since I ran mkfs.ext2 prior to trying to mount, of course). So I believe there's a problem with the offset. So far I can't really debug my driver because I wasn't able to come up with a request which would cause an I/O request with an offset (e.g., if I try writing a single byte into my device, Linux will read the whole block and rewrite it with only one different byte).
Hope it's not a too simple question for you.
Thanks in advance,
Phil
Please see the answer provided by Peter below.
If you're wondering what the esd_do_bvec() function looks like now, here it comes:
static int esd_do_bvec(struct escsi_dev *esd, char *buf,
unsigned int len, int rw, sector_t sector)
{
int err = 0;
unsigned int offset;
// Please notice that we STILL have an offset to deal with, but
// this offset comes in sectors and needs to be converted to a
// a byte offset.
offset = sector << SECTOR_SHIFT; // or multiply by 512
//printk("ESD RW=%d, len=%d, off=%d, offset=%d, sector=%lu\n",rw,len,off,offset,sector);
if (rw == READ) {
memcpy(buf,esd->data+offset,len);
} else {
memcpy(esd->data+offset,buf,len);
}
return err;
}
The offset per segment does not refer to an offset from the block device location, but rather an offset into the page. To cause this to be nonzero, you'll probably need to write your own C program that runs read() and write(). Allocate a page-aligned buffer, then read/write to/from different locations in that buffer, and those should show up as offsets in the bvec.
That said, LWN warns of managing this page offset manually, and recommends instead the macro bio_kmap_irq(), which is called on the bio_for_each_segment() variable bio, and takes care of the atomic kmap AND manages the offset entry as well. Source: http://lwn.net/Articles/26404/
Your code will look something like:
bio_for_each_segment(bvec, bio, i) {
unsigned int len = bvec->bv_len;
unsigned long flags;
char *buf = bio_kmap_irq(bio, &flags);
err = esd_do_bvec(esd, buf, len, rw, sector);
bio_kunmap_irq(buf, &flags);
if (err) {
printk("err!\n");
break;
}
sector += len >> SECTOR_SHIFT;
}
Of course this changes the signature of esd_do_bvec to accept the memory buffer directly rather than page/offset.
I am trying to send data between a client/Server, the data looks like
typedef Struct Message
{ int id;
int message_length;
char* message_str;
}message;
I am trying to Write and Read this message between a client and server constantly updating the elements in this struct. I have heard Writev may do the trick. i want to send a
message to the server and then the server pulls out the elements and uses those elements as conditionals to execute the proper method?
Assuming you want to do the serialization yourself and not use Google Protocol Buffers or some library to handle it for you, I'd suggest writing a pair of functions like this:
// Serializes (msg) into a flat array of bytes, and returns the number of bytes written
// Note that (outBuf) must be big enough to hold any Message you might have, or there will
// be a buffer overrun! Modifying this function to check for that problem and
// error out instead is left as an exercise for the reader.
int SerializeMessage(const struct Message & msg, char * outBuf)
{
char * outPtr = outBuf;
int32_t sendID = htonl(msg.id); // htonl will make sure it gets sent in big-endian form
memcpy(outPtr, &sendID, sizeof(sendID));
outPtr += sizeof(sendID);
int32_t sendLen = htonl(msg.message_length);
memcpy(outPtr, &sendLen, sizeof(sendLen));
outPtr += sizeof(sendLen);
memcpy(outPtr, msg.message_str, msg.message_length); // I'm assuming message_length=strlen(message_str)+1 here
outPtr += msg.message_length;
return (outPtr-outBuf);
}
// Deserializes a flat array of bytes back into a Message object. Returns 0 on success, or -1 on failure.
int DeserializeMessage(const char * inBuf, int numBytes, struct Message & msg)
{
const char * inPtr = inBuf;
if (numBytes < sizeof(int32_t)) return -1; // buffer was too short!
int32_t recvID = ntohl(*((int32_t *)inPtr));
inPtr += sizeof(int32_t);
numBytes -= sizeof(int32_t);
msg.id = recvID;
if (numBytes < sizeof(int32_t)) return -1; // buffer was too short!
int32_t recvLen = ntohl(*((int32_t *)inPtr));
inPtr += sizeof(int32_t);
numBytes -= sizeof(int32_t);
msg.message_length = recvLen; if (msg.message_length > 1024) return -1; /* Sanity check, just in case something got munged we don't want to allocate a giant array */
msg.message_str = new char[msg.message_length];
memcpy(msg.message_str, inPtr, numBytes);
return 0;
}
With these functions, you are now able to convert a Message into a simple char-array and back at will. So now all you have to do is send the char-array over the TCP connection, receive it at the far end, and then Deserialize the array back into a Message struct there.
One wrinkle with this is that your char arrays will be variable-length (due to the presence of a string which can be different lengths), so your receiver will need some easy way to know how many bytes to receive before calling DeserializeMessage() on the array.
An easy way to handle that is to always send a 4-byte integer first, before sending the char-array. The 4-byte integer should always be the size of the upcoming array, in bytes. (Be sure to convert the integer to big-endian first, via htonl(), before sending it, and convert it back to native-endian on the receiver via htonl() before using it).
Okay, I'll take a stab at this. I'm going to assume that you have a "message" object on the sending side and what you want to do is somehow send it across to another machine and reconstruct the data there so you can do some computation on it. The part that you may not be clear on is how to encode the data for communications and then decode it on the receiving side to recover the information. The simplistic approach of just writing the bytes contained in a "message" object (i.e. write(fd, msg, sizeof(*msg), where "msg" is a pointer to an object of type "message") won't work because you will end up sending the value of a virtual address in the memory of one machine to different machine and there's not much you can do with that on the receiving end. So the problem is to design a way to pass an two integers and a character string bundled up in a way that you can fish them back out on the other end. There are, of course, many ways to do this. Does this describe what you are trying to do?
You can send structs over socket, but you have to serialize them before sending the struct using boost serialization.
Here is a sample code :
#include<iostream>
#include<unistd.h>
#include<cstring>
#include <sstream>
#include <boost/archive/text_oarchive.hpp>
#include <boost/archive/text_iarchive.hpp>
using namespace std;
typedef struct {
public:
int id;
int message_length;
string message_str;
private:
friend class boost::serialization::access;
template <typename Archive>
void serialize(Archive &ar, const unsigned int vern)
{
ar & id;
ar & message_length;
ar & message_str;
}
} Message;
int main()
{
Message newMsg;
newMsg.id = 7;
newMsg.message_length = 14;
newMsg.message_str="Hi ya Whats up";
std::stringstream strData;
boost::archive::text_oarchive oa(strData);
oa << newMsg;
char *serObj = (char*) strData.str().c_str();
cout << "Serialized Data ::: " << serObj << "Len ::: " << strlen(serObj) << "\n";
/* Send serObj thru Sockets */
/* recv serObj from socket & deserialize it */
std::stringstream rcvdObj(serObj);
Message deserObj;
boost::archive::text_iarchive ia(rcvdObj);
ia >> deserObj;
cout<<"id ::: "<<deserObj.id<<"\n";
cout<<"len ::: "<<deserObj.message_length<<"\n";
cout<<"str ::: "<<deserObj.message_str<<"\n";
}
you can compile the program by
g++ -o serial boost.cpp /usr/local/lib/libboost_serialization.a
you must have libboost_serialization.a statically compiled in your machine.
Keeping the sockets 'blocking' will be good and you have to devise for reading these structs from recv buffer.
I have a generic question about Linux kernel's handling of file I/O. So far my understanding is that, in an ideal case, after process A reads a file, data is loaded into page cache, and if process B reads the same page before it is reclaimed, it does not need to hit the disk again.
My question is related to how the block device I/O works. Process A's read request will eventually be queued before the I/O actually happens. Now if device B's request (a bio struct) is to be inserted into the request_queue, before A's request is executed, elevator will consider whether to merge B's bio into any existing request. Now, if A and B try to read the same file offset, i.e. same device block, they are literally the same I/O, (or A and B's requests are not exactly the same but they overlap for some blocks), but so far I have not seen this case being considered in kernel code. (The only relevant thing I saw is a test on whether bio can be glued to an existing request contiguously.)
kernel 2.6.11
inline int elv_try_merge(struct request *__rq, struct bio *bio)
{
int ret = ELEVATOR_NO_MERGE;
/*
* we can merge and sequence is ok, check if it's possible
*/
if (elv_rq_merge_ok(__rq, bio)) {
if (__rq->sector + __rq->nr_sectors == bio->bi_sector)
ret = ELEVATOR_BACK_MERGE;
else if (__rq->sector - bio_sectors(bio) == bio->bi_sector)
ret = ELEVATOR_FRONT_MERGE;
}
return ret;
}
kernel 5.3.5
enum elv_merge elv_merge(struct request_queue *q, struct request **req,
struct bio *bio)
{
struct elevator_queue *e = q->elevator;
struct request *__rq;
...
/*
* See if our hash lookup can find a potential backmerge.
*/
__rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
...
}
struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
{
struct elevator_queue *e = q->elevator;
struct hlist_node *next;
struct request *rq;
hash_for_each_possible_safe(e->hash, rq, next, hash, offset) {
...
if (rq_hash_key(rq) == offset)
return rq;
}
return NULL;
}
#define rq_hash_key(rq) (blk_rq_pos(rq) + blk_rq_sectors(rq))
Does that mean kernel will just do two I/Os? Or (very likely) I missed something?
thanks!