Registry Search - c

I am trying to use C/C++ (Preferably C) to enumerate the entire Windows registry, I was using recursion to do this but I keep running into stack overflows, which i understand but im unable to think of anyway to do this without recusion.
Advice on how to do this without recursion would be great, thx.

As long as your recursion is just once per level of subkey, I don't see why this should overflow the stack. Sure the Windows registry is a nightmare, but I don't think its keys hierarchies are thousands of levels deep.
I suspect you're using some giant arrays on the stack, which is a bad idea in general but especially with recursion. Try allocating any large data you need with malloc instead.

A bread-first search would be an obvious possibility. The basic idea is to use a queue of places to search. Start by putting the root into the queue, then repeat the following steps until the queue is empty:
Get an item from the queue.
Enumerate its contents.
Add any links it contains to the queue.
...where "links" would be "subdirectories" for a file system, "subkeys" for the registry, etc.

Related

how to get added content of a file since last modification

I'm working on a project in golang that needs to index recently added file content (using framework called bleve), and I'm looking for a solution to get content of a file since last modification. My current work-around is to record the last indexed position of each file, and during indexing process later on I only retrieve file content starting from the previous recorded position.
So I wonder if there's any library or built-in functionality for this? (doesn't need to be restricted to go, any language could work)
I'll really appreciate it if anyone has a better idea than my work-around as well!
Thanks
It depends on how the files change.
If the files are append-only, then you only need to record the last offset where you stopped indexing, and start from there.
If the changes can happen anywhere, and the changes are mostly replacing old bytes with new bytes (like changing pixels of an image), then perhaps you can consider computing checksum for small chucks, and only index those chunks that has different checksums.
You can check out crypto package in Go standard library for computing hashes.
If the changes are line insertion/deletion to text files (like changes to source code), then maybe a diff algorithm can help you find the differences. Something like https://github.com/octavore/delta.
If you're running in a Unix-like system, you could just use tail. If you specify to follow the file, the process will keep waiting after reaching end of file. You can invoke this in your program with os/exec and pipe the Stdout to your program. Your program can then read from it periodically or with blocking.
The only way I can think of to do this natively in Go is like how you described. There's also a library that tries to emulate tail in Go here: https://github.com/hpcloud/tail

Open-source solutions for creating a cyclical logfile?

if (!wheel) { wheel = new Wheel(); } // or some such
My google goggles aren't working too well today. I figured this one must have been coded a gazillion times already and was looking for some FOSS code, but couldn't find any.
Before I reinvent the spherical axle-surrounding device, can anyone point me at a URL?
I am coding in C for an embedded system (Atmel UC3), but that shouldn't make any difference, just explain why I need a cyclical logfile (because of limited storage).
I want to log events to a file on an SD card and when the reaches a certain size I want to start writing again at the start. Any URLs for that? (fixed entry size is ok; otherwise it might get nasty on wraparound).
Thanks a 1,000,000 in advance!
Sourceforge has a project called Cyclic Logs which may be what you need.
If not, it's not the hardest thing to implement. Just treat it like a normal cyclic memory space. But instead of having it be resident in memory have it reside on the disk.
( maintain a pointer to the head of the log and the end of the log ( increment as needed ))
Store those as headers to the log or as another flat file.

LRU caches in C

I need to cache a large (but variable) number of smallish (1 kilobyte to 10 megabytes) files in memory, for a C application (in a *nix environment). Since I don't want to eat all my memory, I'd like to set hard memory limit (say, 64 megabytes) and push files into a hash table with the file name as the key and dispose of the entries with the least use. What I believe I need is an LRU cache.
Really, I'd rather not roll my own so if someone knows where I can find a workable library, please point the way? Failing that, can someone provide a simple example of an LRU cache in C? Related posts indicated that a hash table with a doubly-linked list, but I'm not even clear on how a doubly-linked list keeps LRU.
Side note: I realize this is almost exactly the function of memcache, but it's not an option for me. I also took a look at the source hoping to enlighten myself on LRU caching, with no success.
Related posts indicated that a hash table with a doubly-linked list, but I'm not even clear on how a doubly-linked list keeps LRU.
I'm just taking a guess here, but you could do something like this (using pseudo-C here because I'm lazy). Here are the basic data structures:
struct File
{
// hash key
string Name;
// doubly-linked list
File* Previous;
File* Next;
// other file data...
}
struct Cache
{
HashTable<string, File*> Table // some existing hashtable implementation
File* First; // most recent
File* Last; // least recent
}
And here's how you'd open and close a file:
File* Open(Cache* cache, string name)
{
if (look up name in cache->Table succeeds)
{
File* found = find it from the hash table lookup
move it to the front of the list
}
else
{
File* newFile = open the file and create a new node for it
insert it at the beginning of the list
if (the cache is full now)
{
remove the last file from the list
close it
remove it from the hashtable too
}
}
}
The hashtable lets you find nodes by name quickly, and the linked-list lets you maintain them in use order. Since they point to the same nodes, you can switch between them. This lets you look a file up by name, but then move it around in the list afterwards.
But I could be totally wrong about all of this.
If you're using Linux, I think the OS will do all you need, especially if you take advantage of the fadvise system call to let the system know what files you plan to use next.
koders.com locates a few; the one that's easiest to adapt and reuse (if you're OK with its license conditions) appears to be this one from the FreeType project (will take some figuring out for its, ahem, interesting preprocessor work). At worst, it should show you one approach whereby you can implement a LRU cache in C.
Most reusable LRU cache implementations (and there are many to be found on the net), of course, use handier languages (Java, C++, C#, Python, ...) which offer stronger data structures and, typically, memory management.
It seems you can build a LRU Cache in C with uthash.
What I like most of uthash is that it's a simple header file, with lots of macros, so your extra dependencies are kept to a minimum.
I'm not aware of any general unix environmental libraries in C, but it shouldn't be hard to implement.
For code samples, I suggest looking around at any of the gazillion (oi) hash table implementations out there. Whether the table uses a linked list or a tree structure for the actual processing, it is not uncommon for some form of caching to be used (such as MRU), so it may give you an idea of what an implementation might look like. Some simple Garbage Collectors and various bits of software needing a page replacement algorithm may also be worth a look.
Basically, you mark things when they are accessed and age the references. If you increase the age of things on access rather than every peer of the item accessed, you obviously save a loop at accesses and push the weight onto the expiration operation. You'll want to do some light profiling in order to find a general idea of how least recent is sufficiently !recent enough for your task. When you get to that point, you just update the cache accordingly.

One large file or multiple small files?

I have an application (currently written in Python as we iron out the specifics but eventually it will be written in C) that makes use of individual records stored in plain text files. We can't use a database and new records will need to be manually added regularly.
My question is this: would it be faster to have a single file (500k-1Mb) and have my application open, loop through, find and close a file OR would it be faster to have the records separated and named using some appropriate convention so that the application could simply loop over filenames to find the data it needs?
I know my question is quite general so direction to any good articles on the topic are as appreciated as much as suggestions.
Thanks very much in advance for your time,
Dan
Essentially your second approach is an index - it's just that you're building your index in the filesystem itself. There's nothing inherently wrong with this, and as long as you arrange things so that you don't get too many files in the one directory, it will be plenty fast.
You can achieve the "don't put too many files in the one directory" goal by using multiple levels of directories - for example, the record with key FOOBAR might be stored in data/F/FO/FOOBAR rather than just data/FOOBAR.
Alternatively, you can make the single-large-file perform as well by building an index file, that contains a (sorted) list of key-offset pairs. Where the directories-as-index approach falls down is when you want to search on key different from the one you used to create the filenames - if you've used an index file, then you can just create a second index for this situation.
You may want to reconsider the "we can't use a database" restriction, since you are effectively just building your own database anyway.
Reading a directory is in general more costly than reading a file. But if you can find the file you want without reading the directory (i.e. not "loop over filenames" but "construct a file name") due to your naming convention, it may be benefical to split your database.
Given your data is 1 MB, I would even consider to store it entirely in memory.
To give you some clue about your question, I'd consider that having one single big file means that your application is doing the management of the lines. Having multiple small files is relying an the system and the filesystem to manage the data. The latter can be quite slow though, because it involves system calls for all your operations.
Opening File and Closing file in C Would take much time
i.e. you have 500 files 2 KB each... and if you process it 1000 Additonal Operation would be added to your application (500 Opening file and 500 Closing)... while only having 1 file with 1 MB of size would save you that 1000 additional operation...(That is purely my personal Opinion...)
Generally it's better to have multiple small files. Keeps memory usage low and performance is much better when searching through it.
But it depends on the amount of operations you'll need, because filesystem calls are much more expensive when compared to memory storage for instance.
This all depends on your file system, block size and memory cache among others.
As usual, measure and find out if this is a real problem since premature optimization should be avoided. It may be that using one file vs many small files does not matter much for performance in practice and that the choice should be based on clarity and maintainability instead.
(What I can say for certain is that you should not resort to linear file search, use a naming convention to pinpoint the file in O(1) time instead).
The general trade off is that having one big file can be more difficult to update but having lots of little files is fiddly. My suggestion would be that if you use multiple files and you end up having a lot it can get very slow traversing a directory with a million files in it. If possible break the files into some sort of grouping so they can be put into separate directories and "keyed". I have an application that requires the creation of lots of little pdf documents for all user users of the system. If we put this in one directory it would be a nightmare but having a directory per user id makes it much more manageable.
Why can't you use a DB, I'm curious? I respect your preference, but just want to make sure it's for the right reason.
Not all DBs require a server to connect to or complex deployment. SQLite, for instance, can be easily embedded in your application. Python already has it built-in, and it's very easy to connect with C code (SQLite itself is written in C and its primary API is for C). SQLite manages a feature-complete DB in a single file on the disk, where you can create multiple tables and use all the other nice features of a DB.

Truncate file at front

A problem I was working on recently got me to wishing that I could lop off the front of a file. Kind of like a “truncate at front,” if you will. Truncating a file at the back end is a common operation–something we do without even thinking much about it. But lopping off the front of a file? Sounds ridiculous at first, but only because we’ve been trained to think that it’s impossible. But a lop operation could be useful in some situations.
A simple example (certainly not the only or necessarily the best example) is a FIFO queue. You’re adding new items to the end of the file and pulling items out of the file from the front. The file grows over time and there’s a huge empty space at the front. With current file systems, there are several ways around this problem:
As each item is removed, copy the
remaining items up to replace it, and
truncate the file. Although it works,
this solution is very expensive
time-wise.
Monitor the size of the empty space at
the front, and when it reaches a
particular size or percentage of the
entire file size, move everything up
and truncate the file. This is much
more efficient than the previous
solution, but still costs time when
items are moved in the file.
Implement a circular queue in the
file, adding new items to the hole at
the front of the file as items are
removed. This can be quite efficient,
especially if you don’t mind the
possibility of things getting out of
order in the queue. If you do care
about order, there’s the potential of
having to move items around. But in
general, a circular queue is pretty
easy to implement and manages disk
space well.
But if there was a lop operation, removing an item from the queue would be as easy as updating the beginning-of-file marker. As easy, in fact, as truncating a file. Why, then, is there no such operation?
I understand a bit about file systems implementation, and don't see any particular reason this would be difficult. It looks to me like all it would require is another word (dword, perhaps?) per allocation entry to say where the file starts within the block. With 1 terabyte drives under $100 US, it seems like a pretty small price to pay for such functionality.
What other tasks would be made easier if you could lop off the front of a file as efficiently as you can truncate at the end?
Can you think of any technical reason this function couldn't be added to a modern file system? Other, non-technical reasons?
On file systems that support sparse files "punching" a hole and removing data at an arbitrary file position is very easy. The operating system just has to mark the corresponding blocks as "not allocated". Removing data from the beginning of a file is just a special case of this operation. The main thing that is required is a system call that will implement such an operation: ftruncate2(int fd, off_t offset, size_t count).
On Linux systems this is actually implemented with the fallocate system call by specifying the FALLOC_FL_PUNCH_HOLE flag to zero-out a range and the FALLOC_FL_COLLAPSE_RANGE flag to completely remove the data in that range. Note that there are restrictions on what ranges can be specified and that not all filesystems support these operations.
Truncate files at front seems not too hard to implement at system level.
But there are issues.
The first one is at programming level. When opening file in random access the current paradigm is to use offset from the beginning of the file to point out different places in the file. If we truncate at beginning of file (or perform insertion or removal from the middle of the file) that is not any more a stable property. (While appendind or truncating from the end is not a problem).
In other words truncating the beginning would change the only reference point and that is bad.
At a system level uses exist as you pointed out, but are quite rare. I believe most uses of files are of the write once read many kind, so even truncate is not a critical feature and we could probably do without it (well some things would become more difficult, but nothing would become impossible).
If we want more complex accesses (and there are indeed needs) we open files in random mode and add some internal data structure. Theses informations can also be shared between several files. This leads us to the last issue I see, probably the most important.
In a sense when we using random access files with some internal structure... we are still using files but we are not any more using files paradigm. Typical such cases are the databases where we want to perform insertion or removal of records without caring at all about their physical place. Databases can use files as low level implementation but for optimisation purposes some database editors choose to completely bypass filesystem (think about Oracle partitions).
I see no technical reason why we couldn't do everything that is currently done in an operating system with files using a database as data storage layer. I even heard that NTFS has many common points with databases in it's internals. An operating system can (and probably will in some not so far future) use another paradigm than files one.
Summarily i believe that's not a technical problem at all, just a change of paradigm and that removing the beginning is definitely not part of the current "files paradigm", but not a big and useful enough change to compell changing anything at all.
NTFS can do something like this with it's sparse file support but it's generaly not that useful.
I think there's a bit of a chicken-and-egg problem in there: because filesystems have not supported this kind of behavior efficiently, people haven't written programs to use it, and because people haven't written programs to use it, there's little incentive for filesystems to support it.
You could always write your own filesystem to do this, or maybe modify an existing one (although filesystems used "in the wild" are probably pretty complicated, you might have an easier time starting from scratch). If people find it useful enough it might catch on ;-)
Actually there are record base file systems - IBM have one and I believe DEC VMS also had this facility. I seem to remember both allowed (allow? I guess they are still around) deleting and inserting at random positions in a file.
There is also a unix command called head -- so you could do this via:
head -n1000 file > file_truncated
may can achieve this goal in two steps
long fileLength; //file total length
long reserveLength; //reserve length until the file ending
int fd; //file open for read & write
sendfile(fd, fd, fileLength-reserveLength, reserveLength);
ftruncate(fd, reserveLength);

Resources