Confused with read() system call - c

I am very confused with read() calls to get inotify events.
Here is the code:
#define EVENT_SIZE sizeof(inotify_event)
int fd = inotify_init();
int wd = inotify_add_watch(fd, dir, IN_MODIFY);
void* p = malloc(sizeof(EVENT_SIZE));
read(wd, p, (EVENT_SIZE + 10));
My test file is a.txt.
The output after debug in gdb is:
{wd = 0, mask = 0, cookie = 0, len = 0, name = 0x558f05d002d0 ""}
Now, when I change the last line to read(fd, p, (EVENT_SIZE + 16));, the output that I get in gdb is:
{wd = 1, mask = 2, cookie = 0, len = 16, name = 0x5625cdd422d0 ".a.txt.swp"}
Q1. Why don't I get an overflow error because in both cases, I am writing more than the allocated buffer p?
Q2. If there is no error, then my first program should also work because my filename is less than 10, but it doesn't work and it only works with 16. What am I missing here?
compiler - g++ 9.3.0
os - ubuntu 20.04
Thank you.

The behaviour of writing outside the boundaries of an array is undefined. Additionally, read reads up to the given number of bytes.
???

Q1
You get a segfault, if your process attempts to access a memory address that does not belong to it.
malloc - "Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2)."
sbrk - "sbrk() increments the program's data space by increment bytes."
From 1., 2. and 3.: There is (probably) room left where you are able to write to, but the data (if any) after your allocated memory is definitely corrupted!
Q2
Read the documentation for inotify.
The meaning of the fields for inotify_event structure:
mask contains bits that describe the event that occurred (see below).
and
The len field counts all of the bytes in name, including the null bytes; the length of each inotify_event structure is thus sizeof(struct inotify_event)+len.
and
The name field is present only when an event is returned for a file inside a watched directory; it identifies the filename within the watched directory. This filename is null-terminated, and may include further null bytes ('\0') to align subsequent reads to a suitable address boundary.
and
The behavior when the buffer given to read(2) is too small to return information about the next event depends on the kernel version: in kernels before 2.6.21, read(2) returns 0; since kernel 2.6.21, read(2) fails with the error EINVAL. Specifying a buffer of size sizeof(struct inotify_event) + NAME_MAX + 1 will be sufficient to read at least one event.

Related

How do you inject code onto a process without ptrace

Is there a way to inject code into an ELF binary without ptrace, I can't use it since the program I'm writing this for is using GDB and I don't want to stop the process for the while it's injecting. I read it's possible by using /proc/pid/mem but I couldn't quite find anything about how to do it. I don't want to use LD_PRELOAD either since it would require restarting the program and I'd want to do it during runtime.
EDIT: I can't use ptrace since the process might already be attached to by gdb
/proc/pid/mem behaves like an image of the process's memory. To read/write the process's memory, just open /proc/pid/mem, then lseek to the desired address and read() or write() however many bytes you want.
For instance, to overwrite the byte at address 0x12345 in the process with 0x90, you can just do
fd = open("/proc/XXX/mem", O_RDWR);
lseek(fd, 0x12345, SEEK_SET);
unsigned char new = 0x90;
write(fd, &new, 1);
On a 32-bit system, use lseek64 instead (and add #define _LARGEFILE64_SOURCE before the standard includes).
Note that accessing /proc/XXX/mem requires the same permissions as to ptrace the process. In particular, on some systems you may need to be root.
I decided I'd use process_vm_writev, which seems to work I don't know why it didn't want to write to /proc/pid/mem which is odd.
/**
* #brief write_process_memory Writes to the memory of a given process
* #param pid Program pid
* #param address The base memory address
* #param buffer Buffer to write
* #param n How many bytes to write
* #return Returns bytes written
*/
ssize_t write_process_memory(pid_t pid, void *address, void *buffer, ssize_t n) {
struct iovec local, remote;
/* this might have to be made so that if n > _SC_PAGESIZE
* local would be split into multiple locals, similar to how
* read_process_memory works, no fucking clue though if it's necessary */
remote.iov_base = address;
remote.iov_len = n;
local.iov_base = buffer;
local.iov_len = n;
ssize_t amount_read = process_vm_writev(pid, &local, 1, &remote, 1, 0);
return amount_read;
}

How to get malloc/calloc to fail if request exceeds free physical memory (i.e., don't use swap)

malloc/calloc apparently use swap space to satisfy a request that exceeds available free memory. And that pretty much hangs the system as the disk-use light remains constantly on. After it happened to me, and I wasn't immediately sure why, I wrote the following 5-line test program to check that this is indeed why the system was hanging,
/* --- test how many bytes can be malloc'ed successfully --- */
#include <stdio.h>
#include <stdlib.h>
int main ( int argc, char *argv[] ) {
unsigned int nmalloc = (argc>1? atoi(argv[1]) : 10000000 ),
size = (argc>2? atoi(argv[2]) : (0) );
unsigned char *pmalloc = (size>0? calloc(nmalloc,size):malloc(nmalloc));
fprintf( stdout," %s malloc'ed %d elements of %d bytes each.\n",
(pmalloc==NULL? "UNsuccessfully" : "Successfully"),
nmalloc, (size>0?size:1) );
if ( pmalloc != NULL ) free(pmalloc);
} /* --- end-of-function main() --- */
And that indeed hangs the system if the product of your two command-line args exceeds physical memory. Easiest solution is some way whereby malloc/calloc automatically just fail. Harder and non-portable was to write a little wrapper that popen()'s a free command, parses the output, and only calls malloc/calloc if the request can be satisfied by the available "free" memory, maybe with a little safety factor built in.
Is there any easier and more portable way to accomplish that? (Apparently similar to this question can calloc or malloc be used to allocate ONLY physical memory in OSX?, but I'm hoping for some kind of "yes" answer.)
E d i t--------------
Decided to follow Tom's /proc/meminfo suggestion. That is, rather than popen()'ing "free", just directly parse the existing and easily-parsible /proc/meminfo file. And then, a one-line macro of the form
#define noswapmalloc(n) ( (n) < 1000l*memfree(NULL)/2? malloc(n) : NULL )
finishes the job. memfree(), shown below, isn't as portable as I'd like, but can easily and transparently be replaced by a better solution if/when the need arises, which isn't now.
#include <stdio.h>
#include <stdlib.h>
#define _GNU_SOURCE /* for strcasestr() in string.h */
#include <string.h>
char *strcasestr(); /* non-standard extension */
/* ==========================================================================
* Function: memfree ( memtype )
* Purpose: return number of Kbytes of available memory
* (as reported in /proc/meminfo)
* --------------------------------------------------------------------------
* Arguments: memtype (I) (char *) to null-terminated, case-insensitive
* (sub)string matching first field in
* /proc/meminfo (NULL uses MemFree)
* --------------------------------------------------------------------------
* Returns: ( int ) #Kbytes of memory, or -1 for any error
* --------------------------------------------------------------------------
* Notes: o
* ======================================================================= */
/* --- entry point --- */
int memfree ( char *memtype ) {
/* ---
* allocations and declarations
* ------------------------------- */
static char memfile[99] = "/proc/meminfo"; /* linux standard */
static char deftype[99] = "MemFree"; /* default if caller passes null */
FILE *fp = fopen(memfile,"r"); /* open memfile for read */
char memline[999]; /* read memfile line-by-line */
int nkbytes = (-1); /* #Kbytes, init for error */
/* ---
* read memfile until line with desired memtype found
* ----------------------------------------------------- */
if ( memtype == NULL ) memtype = deftype; /* caller wants default */
if ( fp == NULL ) goto end_of_job; /* but we can't get it */
while ( fgets(memline,512,fp) /* read next line */
!= NULL ) { /* quit at eof (or error) */
if ( strcasestr(memline,memtype) /* look for memtype in line */
!= NULL ) { /* found line with memtype */
char *delim = strchr(memline,':'); /* colon following MemType */
if ( delim != NULL ) /* NULL if file format error? */
nkbytes = atoi(delim+1); /* num after colon is #Kbytes */
break; } /* no need to read further */
} /* --- end-of-while(fgets()!=NULL) --- */
end_of_job: /* back to caller with nkbytes */
if ( fp != NULL ) fclose(fp); /* close /proc/meminfo file */
return ( nkbytes ); /* and return nkbytes to caller */
} /* --- end-of-function memfree() --- */
#if defined(MEMFREETEST)
int main ( int argc, char *argv[] ) {
char *memtype = ( argc>1? argv[1] : NULL );
int memfree();
printf ( " memfree(\"%s\") = %d Kbytes\n Have a nice day.\n",
(memtype==NULL?" ":memtype), memfree(memtype) );
} /* --- end-of-function main() --- */
#endif
malloc/calloc apparently use swap space to satisfy a request that exceeds available free memory.
Well, no.
Malloc/calloc use virtual memory. The "virtual" means that it's not real - it's an artificially constructed illusion made out of fakery and lies. Your entire process is built on these artificially constructed illusions - a thread is a virtual CPU, a socket is a virtual network connection, the C language is really a specification for a "C abstract machine", a process is a virtual computer (that implements the languages' abstract machine).
You're not supposed to look behind the magic curtain. You're not supposed to know that physical memory exists. The system doesn't hang - the illusion is just slower, but that's fine because the C abstract machine says nothing about how long anything is supposed to take and does not provide any performance guarantees.
More importantly; because of the illusion, software works. It doesn't crash because there's not enough physical memory. Failure means that it takes an infinite amount of time for software to complete successfully, and "an infinite amount of time" is many orders of magnitude worse than "slower because of swap space".
How to get malloc/calloc to fail if request exceeds free physical memory (i.e., don't use swap)
If you are going to look behind the magic curtain, you need to define your goals carefully.
For one example, imagine if your process has 123 MiB of code and there's currently 1000 MiB of free physical RAM; but (because the code is in virtual memory) only a tiny piece of the code is using real RAM (and the rest of the code is on disk because the OS/executable loader used memory mapped files to avoid wasting real RAM until it's actually necessary). You decide to allocate 1000 MiB of memory (and because the OS creating the illusion isn't very good, unfortunately this causes 1000 MiB of real RAM to be allocated). Next, you execute some more code, but the code you execute isn't in real memory yet, so the OS has to fetch the code from the file on the disk into physical RAM, but you consumed all of the physical RAM so the OS has to send some of the data to swap space.
For another example, imagine if your process has 1 MiB of code and 1234 MiB of data that was carefully allocated to make sure that everything fits in physical memory. Then a completely different process is started and it allocates 6789 MiB of memory for its code and data; so the OS sends all of your process' data to swap space to satisfy the other process that you have no control over.
EDIT
The problem here is that the OS providing the illusion is not very good. When you allocate a large amount of virtual memory with malloc() or calloc(); the OS should be able to use a tiny piece of real memory to lie to you and avoid consuming a large amount of real memory. Specifically (for most modern operating systems running on normal hardware); the OS should be able to fill a huge area of virtual memory with a single page full of zeros that is mapped many times (at many virtual addresses) as "read only", so that allocating a huge amount of virtual memory costs almost no physical RAM at all (until you write to the virtual memory, causing the OS to allocate the least physical memory needed to satisfy the modifications). Of course if you eventually do write to all of the allocated virtual memory, then you'll end up exhausting physical memory and using some swap space; but this will probably happen gradually and not all at once - many tiny delays scattered over a large period of time are far less likely to be noticed than a single huge delay.
With this in mind; I'd be tempted to try using mmap(..., MAP_ANONYMOUS, ...) instead of the (poorly implemented) malloc() or calloc(). This might mean that you have to deal with the possibility that the allocated virtual memory isn't guaranteed to be initialized to zeros, but (depending on what you're using the memory for) that's likely to be easy to work around.
Expanding on a comment I made to the original question:
If you want to disable swapping, use the swapoff command (sudo swapoff -a). I usually run my machine that way, to avoid it freezing when firefox does something it shouldn't. You can use setrlimit() (or the ulimit command) to set a maximum VM size, but that won't properly compensate for some other process suddenly deciding to be a memory hog (see above).
Even if you choose one of the above options, you should read the rest of this answer to see how to avoid unnecessary initialisation on the first call to calloc().
As for your precise test harness, it turns out that you are triggering an unfortunate exception to GNU calloc()'s optimisation.
Here's a comment (now deleted) I made to another answer, which turns out to not be strictly speaking accurate:
I checked the glibc source for the default gnu/linux malloc library, and verified that calloc() does not normally manually clear memory which has just been mmap'd. And malloc() doesn't touch the memory at all.
It turns out that I missed one exception to the calloc optimisation. Because of the way the GNU malloc implementation initialises the malloc system, the first call to calloc always uses memset() to set the newly-allocated storage to 0. Every other call to calloc() passes through the entire calloc logic, which avoids calling memset() on storage which has been freshly mmap'd.
So the following modification to the test program shows radically different behaviour:
#include <stdio.h>
#include <stdlib.h>
int main ( int argc, char *argv[] ) {
/* These three lines were added */
void* tmp = calloc(1000, 1); /* force initialization */
printf("Allocated 1000 bytes at %p\n", tmp);
free(tmp);
/* The rest is unchanged */
unsigned int nmalloc = (argc>1? atoi(argv[1]) : 10000000 ),
size = (argc>2? atoi(argv[2]) : (0) );
unsigned char *pmalloc = (size>0? calloc(nmalloc,size):malloc(nmalloc));
fprintf( stdout," %s malloc'ed %d elements of %d bytes each.\n",
(pmalloc==NULL? "UNsuccessfully" : "Successfully"),
nmalloc, (size>0?size:1) );
if ( pmalloc != NULL ) free(pmalloc);
}
Note that if you set MALLOC_PERTURB_ to a non-zero value, then it is used to initialise malloc()'d blocks, and forces calloc()'d blocks to be initialised to 0. That's used in the test below.
In the following, I used /usr/bin/time to show the number of page faults during execution. Pay attention to the number of minor faults, which are the result of the operating system zero-initialising a previously unreferenced page in an anonymous mmap'd region (and some other occurrences, like mapping a page already present in Linux's page cache). Also look at the resident set size and, of course, the execution time.
$ gcc -Og -ggdb -Wall -o mall mall.c
$ # A simple malloc completes instantly without page faults
$ /usr/bin/time ./mall 4000000000
Allocated 1000 bytes at 0x55b94ff56260
Successfully malloc'ed -294967296 elements of 1 bytes each.
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 1600maxresident)k
0inputs+0outputs (0major+61minor)pagefaults 0swaps
$ # Unless we tell malloc to initialise memory
$ MALLOC_PERTURB_=35 /usr/bin/time ./mall 4000000000
Allocated 1000 bytes at 0x5648c2436260
Successfully malloc'ed -294967296 elements of 1 bytes each.
0.19user 1.23system 0:01.43elapsed 99%CPU (0avgtext+0avgdata 3907584maxresident)k
0inputs+0outputs (0major+976623minor)pagefaults 0swaps
# Same, with calloc. No page faults, instant completion.
$ /usr/bin/time ./mall 1000000000 4
Allocated 1000 bytes at 0x55e8257bb260
Successfully malloc'ed 1000000000 elements of 4 bytes each.
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 1656maxresident)k
0inputs+0outputs (0major+62minor)pagefaults 0swaps
$ # Again, setting the magic malloc config variable changes everything
$ MALLOC_PERMUTE_=35 /usr/bin/time ./mall 1000000000 4
Allocated 1000 bytes at 0x5646f391e260
Successfully malloc'ed 1000000000 elements of 4 bytes each.
0.00user 0.00system 0:00.00elapsed 100%CPU (0avgtext+0avgdata 1656maxresident)k
0inputs+0outputs (0major+62minor)pagefaults 0swaps

What means "Verifies the write" when erasing hdd's?

Im looking at some code for (secure) erasing harddrives. I notice when looking at the different methods (Infosec 5, DoD 5220.22-M, etc.) i read "Verify the write".
examples:
https://www.lifewire.com/dod-5220-22-m-2625856
Pass 3: Writes a random character and verifies the write
https://en.wikipedia.org/wiki/Infosec_Standard_5#cite_note-5
Regardless of which level is used, verification is needed to ensure
that overwriting was successful.
My question is, what does it technical imply?
Do you just check when you write a block of data, if the bytes written are equal the size of the block (seems fast)?
Do you read back the block you wrote and compare (memcmp?) if its exactly same as the block of data written?
any other solution i overlooked?
Here a bit of sample code to illustrate my question. Which solution would indicate "verify on write"? If you only check the return value (from write) can you be sure, as long is matches the length of the block that all bytes are indeed written?
// erasure example
// erasure single block, start from 0
// block size
unsigned long block_size = 512;
// data-block we will write
char block[block_size];
// zero the block
bzero(&block, block_size);
// open device
int fd = open('/dev/sdb', O_RDWR);
// write 1st block
int bytes_written = write(fd, &block, block_size);
// verify option: A
if (bytes_written == block_size) {
// all good
}
// verify option B
// go back the number of bytes-wtitten from the current pos.
lseek(fd, -1 * bytes_written);
// read the same number of bytes
int bytes_read = read(fd, &block, bytes_written);
// should be same
if (bytes_read == bytes_written) {
// here code to check i block indeed contains zero's
// use memcmp ?
}
What means “Verifies the write” when erasing hdd's?
From Here
The write(...) function attempts to write nbytes from buffer to the
file associated with handle. On text files, it expands each LF to a
CR/LF.
The function returns the number of bytes written to the file. A return
value of -1 indicates an error, with errno set appropriately.
So in your example, int bytes_written = write(fd, &block, block_size);
the value for bytes_written, if equal to the block size, verifies that all memory locations of block have been written to.
And in answer to your other questions:
(1) Always check the return value of a non-void function anyway, and there is no significant efficiency or speed loss by simply performing a compare of two int values. You can even do it in one step, but I am not sure it is an efficiency gain (view the assembly to know for sure). So instead of:
// verify option: A
if (bytes_written == block_size)
{
// all good
}
//You can do the comparison in the same line:
if(write(fd, &block, block_size) == block_size) { // all good }
(2) Optional.. It provides an additional verification that the memory block contains the new contents.
You can find a more detailed description of write() in this Linux man page.

linux proc size limit problems

I'm trying to write a linux kernel module that can dump the contents of other modules to a /proc file (for analysis). In principle it works but it seems I run into some buffer limit or the like. I'm still rather new to Linux kernel development so I would also appreciate any suggestions not concerning the particular problem.
The memory that is used to store the module is allocated in this function:
char *get_module_dump(int module_num)
{
struct module *mod = unhiddenModules[module_num];
char *buffer;
buffer = kmalloc(mod->core_size * sizeof(char), GFP_KERNEL);
memcpy((void *)buffer, (void *)startOf(mod), mod->core_size);
return buffer;
}
'unhiddenModules' is an array of module structs
Then it is handed over to the proc creation here:
void create_module_dump_proc(int module_number)
{
struct proc_dir_entry *dump_module_proc;
dump_size = unhiddenModules[module_number]->core_size;
module_buffer = get_module_dump(module_number);
sprintf(current_dump_file_name, "%s_dump", unhiddenModules[module_number]->name);
dump_module_proc = proc_create_data(current_dump_file_name, 0, dump_proc_folder, &dump_fops, module_buffer);
}
The proc read function is as follows:
ssize_t dump_proc_read(struct file *filp, char *buf, size_t count, loff_t *offp)
{
char *data;
ssize_t ret;
data = PDE_DATA(file_inode(filp));
ret = copy_to_user(buf, data, dump_size);
*offp += dump_size - ret;
if (*offp > dump_size)
return 0;
else
return dump_size;
}
Smaller Modules are dumped correctly but if the module is larger than 126,796 bytes only the first 126,796 bytes are written and this error is displayed when reading from the proc file:
*** Error in `cat': free(): invalid next size (fast): 0x0000000001f4a040 ***
I've seem to run into some limit but I couldn't find anything on it. The error seems to be related so memory leaks but the buffer should be large enough so I don't see where this actually happens.
The procfs has a limit of PAGE_SIZE (one page) for read and write operations. Usually seq_file is used to iterate over the entries (modules in your case ?) to read and/or write smaller chunks. Since you are running into problems with only larger data, I suspect this is the case here.
Please have a look here and here if you are not familiar with seq_files.
A suspicious thing is that in dump_proc_read you are not using "count" parameter. I would have expected copy_to_user to take "count" as third argument instead of "dump_size" (and in subsequent calculations too). The way you do, always dump_size bytes are copied to user space, regardless the data size the application was expecting. The bigger dump_size is, the larger the user area that gets corrupted.

Ethernet header from recvfrom is not as expected

I am writing some network packet sniffing code in C (running on an Ethernet LAN). While attempting to print out the Ethernet header, I've run into a bit of confusion. According to Wikipedia the first 8 bytes consist of the preamble and a delimiter and the next 6 are the MAC destination address.
However, when I actually run my code, I see that in the bytes I get from the recvfrom call, the initial 8 bytes (preamble and delimiter) are missing. In other words, I can start reading the destination address from the first byte itself.
Here is the relevant part of the code
char buffer[BUFFERSIZE];
struct addrinfo servinfo;
servinfo.ai_family = PF_PACKET;
servinfo.ai_socktype = SOCK_RAW;
servinfo.ai_protocol = htons(ETH_P_ALL);
int fd = socket(servinfo.ai_family, servinfo.ai_socktype, servinfo.ai_protocol);
int plen = recvfrom(fd, buffer, BUFFERSIZE, 0, &caddr, &clen);
int c = 0;
printf("Destination Address: %02x:%02x:%02x:%02x:%02x:%02x\n",buffer[c], buffer[c+1], buffer[c+2], buffer[c+3], buffer[c+4], buffer[c+5]);
printf("Source Address: %02x:%02x:%02x:%02x:%02x:%02x\n",buffer[c+6], buffer[c+7], buffer[c+8], buffer[c+9], buffer[c+10], buffer[c+11]);
This prints the correct destination address, whereas I should have gotten the correct result by printing after skipping the first 8 bytes in the buffer.
What am I missing here, or doing wrong?
This prints the correct destination address, whereas I should have
gotten the correct result by printing after skipping the first 8 bytes
in the buffer
The preamble is a very low-level concept, handled strictly by the NIC. It's not even visible to the OS, let alone returned by recvfrom.

Resources