How to find processes that holds a file in C - c

I want to find which processes holds some file in C code (Linux).
One way that comes to my mind is looking at proc/<PID>/fd for all running processes.
However, it would take so long time and because of sweeping all files under the fd files of all processes.
Could you give another method that is more lightweight?
Thank you in advance.

Enumerating all the numerical pseudo-files under /proc, and then examining the fd/ directory for each one, is the standard way of doing this. It is the way that utilities like "lsof" are typically implemented. All this data is held in memory, so accessing it should be fast enough for most purposes.

Related

Number of threads running? [duplicate]

If I search for counting the number of threads an application has, all the answers involve external programs like top. I want to count the threads within the application itself.
I can't add code at the point of thread creation because it happens inside an immutable library.
I can't read /proc.
It's a C/pthreads program running on a few different Unices.
If you can't read /proc you are a bit in trouble, unless your program communicate with another program which reads /proc
If you don't want to read /proc because of portability concerns, you might use a library which abstracts that a bit, like libproc does
You could write a tiny wrapper for pthread_create that counts created threads and link against that wrapper after you linked against the immutable library.
Use top -H. But chances are, if you can't read proc, top won't work anyway. If thats the case, there is no easy way and it would depend on your specific system.

Handling lots of FILE pointers

TLDR
Is there a clean way to handle 1 to 65535 files through an entire program without allocating global variables where a lot of it is may never used and without using linked lists (mingw-w64 on windows)
Long Story
I have a tcp-server which allocates data from a lot of clients (up to 65535) and saves them in kind of a database. The "database" is a directory/file structure which looks like this: data\%ADDR%\%ADDR%-%DATATYPE%-%UTCTIME%.wwss where %ADDR% is the Address, %DATATYPE% is the type of data and %UTCTIME% is the utc time in seconds when the first data packet arrived on this socket. So every time a new connection is accepted it should create this file as specified.
How do I handle 65535 FILE handles correctly? First thought: Global variable.
FILE * PV_WWSS_FileHandles[0x10000]
//...
void tcpaccepted(uint16_t u16addr, uint16_t u16dataType, int64_t s64utc) {
char cPath[MAX_PATH];
snprintf(cPath, MAX_PATH, "c:\\%05u\\%05u-%04x-%I64d.wwss", u16addr, u16addr, u16dataType, s64utc);
PV_WWSS_FileHandles[u16addr] = fopen(cPath, "wb+");
}
This seems very lazy, as it will likely never happen that all addresses are connected at the same time and so it allocates memory which is never used.
Second thought: Creating a linked list which stores the handles. The bad thing here is, that it could be quite cpu intensive because I want to do this in a multithreading Environment and when f.e. 400 threads receive new data at the same time they all have to go through the entire list to find there FILE handle.
You really should look at other people's code. Apache comes to mind. Let's assume you can open 2^16 file handles on your machine. That's a matter of tuning.
Now... consider first what a file handle is. It's generally a construct of your C standard library... which is keeping an array (the file handle is the index to that array) of open files. You're probably going to want to keep an array, too, if you want to keep other information on those handles.
If you're concerned about the resources you're occupying, consider that each open network filehandle causes the OS to keep a 4k or 8k (it's configurable) buffer x2 (in and out) along with the file handle structure. That's easily a gigabyte of memory in use at the OS level.
When you do your equivalent of select(), if your OS is smart, you'll get the filehandle back --- so you can use that to index your array of "what to do" for that file handle. If your select() is not smart, you'll have to check every open filehandle ... which would make any attempt at performance a laugh.
I said "look at other people's solutions." I mean it. The original apache used one filehandle per process (effectively). When select()'s were dumb, this was a good strategy. Bad in that typically, dumb OS's would wake too many processes --- but that was circa 1999. These days apache defaults to it's hybrid MPM model... which is a hybrid of multi-threading and multi-tasking. It services a certain number of clients per process (threads) and has multiple processes. This keeps the number of files per process more reasonable.
If you go back further, for simplicity, there's the inetd approach. Fork one (say) ftp process per connect. The world's largest ftp server (ftp.freebsd.org) ran that way for many years.
Do not store file handles in files (silly). Do not store file handles in linked lists (your most popular code route will kill you). Take advantage of the fact that file handles are small integers and use an array. realloc() can help here.
Heh... I see other FreeBSD people have chipped in ... in the comments. Anyways... look up FreeBSD and kqueue() if you're going to try keeping that many things open in one process.

Information about file on Linux system?

I want known if a determinate file is in use by process, i.e. if file is open in read-only mode by that process.
I thought about searching through /proc/[pid]/[fd] directory, but this way I waste a lot of time, and I think that doing this is not beautiful.
Is there any way using some Linux API to determinate if X file is open by any process? Or maybe some structures data like /proc but for files?
Not that I know of. The lsof and fuser tools do precisely what you suggest, wander through /proc/*/fd.
Note that it is possible for open files to not have a name, if the file was deleted after being opened, and it is possible for a file to be open without the process holding a file descriptor (through mmap), and even the combination of both (this would be a process-private swap file that is automatically cleaned up on process exit).
Determining if a process is using a file is easy. The inverse less so. The reason is that the kernel does not keep track of the inverse directly. The information that IS kept is:
A file knows how many links refer to itself (inode table)
A processes knows what files it has open (file descriptor table)
This is why lsof's /proc walking is necessary. The file descriptors in use by a particular process are kept in /proj/$PID (among other things), and so lsof can use this (and other things) to spit out all of the pid <-> fd <-> inode relationships.
This is a nice article on lsof. As with any Linux util, you can always check out its source code for all of the details :)
lsof might be the tool you're searching for.
EDIT: I din't realize you are specifically searching for something to be integrated in your application, so my answer appears a little simplistic. But anyway, I think that this question is pretty much related to yours.

How do I count the number of running threads (pthreads)?

If I search for counting the number of threads an application has, all the answers involve external programs like top. I want to count the threads within the application itself.
I can't add code at the point of thread creation because it happens inside an immutable library.
I can't read /proc.
It's a C/pthreads program running on a few different Unices.
If you can't read /proc you are a bit in trouble, unless your program communicate with another program which reads /proc
If you don't want to read /proc because of portability concerns, you might use a library which abstracts that a bit, like libproc does
You could write a tiny wrapper for pthread_create that counts created threads and link against that wrapper after you linked against the immutable library.
Use top -H. But chances are, if you can't read proc, top won't work anyway. If thats the case, there is no easy way and it would depend on your specific system.

Simulating file system access

I am designing a file system in user space and need to test it. I do not want to use the available benchmarking tools as my requirements are different. So to test the file system I wish to simulate file access operation. To do this, I first use the ftw() function to walk through one f my existing file system(experimental) and list all the files and directories in a file.
Then I invoke a simulator to simulate file access by a number of processes. Thus, the simulator randomly starts a process i.e it forks a thread which does what a real process would have done. The thread randomly selects a file operation (read, write, rename etc) selects arguments to this operation from the list(generated by ftw()) . The thread does a number of such file operations and then exits marking the end of a process. The simulator continues to spawn threads; thread execution can overlap just as real processes do. Now, as operations are performed by threads, files get inserted, deleted, renamed and this is updated in the list of files.
I have not yet started coding. Does the plan seem sane? I am also not sure how to code the simulator...how will it spawn threads over a period of time. Should I be using some random delay to do this.
Thanks
Yep, that seems fairly reasonable to me. I would consider attempting to impose a statistical distribution over your file operations (and accesses to particular files) that is somehow matched to your expected workload. You might be able to find some statistics about typical filesystem workloads as a starting point.
That sounds about right for a decent test case just to make sure it's working. You could use sleep() to wait between spawning threads or just spawn them all at once and have them do an operation then wait a bit, then do another operation, etc... IMO if you hit it hard with a lot of requests and it works then there's a likely chance your filesystem will do just fine. Take an example from PostMark which all it does is append like crazy to different files and other benchmarks that do random access reads/writes in different locations to make sure that the page has to be read from disk.

Resources