I'm working on a project in golang that needs to index recently added file content (using framework called bleve), and I'm looking for a solution to get content of a file since last modification. My current work-around is to record the last indexed position of each file, and during indexing process later on I only retrieve file content starting from the previous recorded position.
So I wonder if there's any library or built-in functionality for this? (doesn't need to be restricted to go, any language could work)
I'll really appreciate it if anyone has a better idea than my work-around as well!
Thanks
It depends on how the files change.
If the files are append-only, then you only need to record the last offset where you stopped indexing, and start from there.
If the changes can happen anywhere, and the changes are mostly replacing old bytes with new bytes (like changing pixels of an image), then perhaps you can consider computing checksum for small chucks, and only index those chunks that has different checksums.
You can check out crypto package in Go standard library for computing hashes.
If the changes are line insertion/deletion to text files (like changes to source code), then maybe a diff algorithm can help you find the differences. Something like https://github.com/octavore/delta.
If you're running in a Unix-like system, you could just use tail. If you specify to follow the file, the process will keep waiting after reaching end of file. You can invoke this in your program with os/exec and pipe the Stdout to your program. Your program can then read from it periodically or with blocking.
The only way I can think of to do this natively in Go is like how you described. There's also a library that tries to emulate tail in Go here: https://github.com/hpcloud/tail
Related
So I know that each file may use a bunch of clusters with each cluster having a pointer to the next cluster having the rest of the file, but I don't understand what happens when we try to delete or insert something in the middle of a file in a certain sector. How is this issue resolved in FAT?
The first Idea that came to me, was shifting the data, but that doesn't seem to be a very efficient approach.
So I know that each file may use a bunch of clusters with each cluster
having a pointer to the next cluster having the rest of the file, but
I don't understand what happens when we try to delete or insert
something in the middle of a file in a certain sector. How is this
issue resolved in FAT?
Generally speaking, you can't delete or insert in the middle of a file, by which I mean that these operations are not directly supported by filesystem drivers. You modify a file by writing a block of data starting at a particular offset from original start of the file. You use writes to implement insertions or deletions, and this is managed at the userspace level, not the filesystem level.
The first Idea that came to me, was shifting the data, but that
doesn't seem to be a very efficient approach.
There are two basic options:
you overwrite the tail of the file in place, starting at the starting position of the insertion or deletion. This is effectively a shift, yes, and the program has to manage that itself.
you write a whole new file, then replace the original with it.
The latter is usually the preferred option for modifications other than at the end of the file, because the original file remains in a consistent state throughout the process, and because you don't need additional intermediate storage for the portion of the file contents that need to be shifted.
None of this is specific to FAT. I can't rule out the possibility that there is some esoteric filesystem or storage medium out there somewhere to which different rules apply, but for the most part, this is the nature of persistent storage.
I have this funny idea: write some data (say variable of integer type) to the end of the executable itself and then read it on the next run.
Is this possible? Is it a bad thing to do (I'm pretty sure it's :) )? How one would approach this problem?
Additional:
I would prefer to do this with C under Linux OS, but answers with any combination of programming language/OS would be appreciated.
EDIT:
After some time playing with the idea, it became apparent that Linux won't allow to write to a file while it's being executed. However, it allows to delete it.
My vision of the writing process at this point:
make a copy of the program from withing a program
append data to the end of the copy
make a program to delete itself
rename copy to the original name
Will try to implement that as soon as I have some time.
If anyone is interested about how "delete itself" works under Linux - look for info about inode. It's not possible to do this under Windows, as far as I know (might be wrong).
EDIT 2:
Have implemented a working example under Linux with C.
It basically use a strategy described above, i.e. appending bits of data to the end of the copy program, deletes itself and renaming program to the original name. It accepts integers to save as single argument in the CLI, and prints old data as well.
This surely won't work under Windows (although I found some options on a quick search), but I'm curious how it's gonna behave under OS X.
Efficiency thoughts:
Obviously copying whole executable isn't efficient. I guess that something faster is possible with another helper executable that will do the same after program stops executing.
It's not reusing old space but just appending new data to the end on each run. This can be fixed with some footer reservation process (maybe will try to implement this in the future)
EDIT 3:
Surprisingly, it works with OS X! (ver. 10.11.5, default gcc).
I am writing a program which outputs a file. This file has two parts of the content. The second part however, is computed before the first. I was thinking of creating a temporary file, writing the data to it. And then creating a permanent file and then dumping the temp file content into the permanent one and deleting that file. I saw some posts that this does not work, and it might produce some problems among different compilers or something.
The data is a bunch of chars. Every 32 chars have to appear on a different line. I can store it in a linked list or something, but I do not want to have to write a linked list for that.
Does anyone have any suggestions or alternative methods?
A temporary file can be created, although some people do say they have problems with this, i personally have used them with no issues. Using the platform functions to obtain a temporary file is the best option. Dont assume you can write to c:\ etc on windows as this isnt always possible. Dont assume a filename incase the file is already used etc. Not using temporary files correctly is what causes people problems, rather than temporary files being bad
Is there any reason you cannot just keep the second part in ram until you are ready for the first? Otherwise, can you work out the size needed for the first part and leave that section of the file blank to come back to fill in later on. This would eliminate the needs of the temporary file.
Both solutions you propose could work. You can output intermediate results to a temporary file, and then later append that file to the file that contains the dataset that you want to present first. You could also store your intermediate data in memory. The right data structure depends on how you want to organize the data.
As one of the other answerers notes, files are inherently platform specific. If your code will only run on a single platform, then this is less of a concern. If you need to support multiple platforms, then you may need to special case some or all of those platforms, if you go with the temporary file solution. Whether this is a deal-breaker for you depends on how much complexity this adds compared to structuring and storing your data in memory.
I'm making a program and one of the things it needs to do is transfer files. I would like to be able to check before I start moving files if the File system supports files of size X. What is the best way of going about this?
Go on with using a function like ftruncate to create a file of the desired size in advance, before the moving, and do the appropriate error-handling in case it fails.
There's no C standard generic API for this. You could simply try creating a file and writing junk to it until it is the size you want, then deleting it, but even that isn't guaranteed to give you the info you need - for instance another process might have come and written a large file in between your test and your transfer, taking up space you were hoping to use.
A game that I play stores all of its data in a .DAT file. There has been some work done by people in examining the file. There are also some existing tools, but I'm not sure about their current state. I think it would be fun to poke around in the data myself, but I've never tried to examine a file, much less anything like this before.
Is there anything I should know about examining a file format for data extraction purposes before I dive headfirst into this?
EDIT: I would like very general tips, as examining file formats seems interesting. I would like to be able to take File X and learn how to approach the problem of learning about it.
You'll definitely want a hex editor before you get too far. It will let you see the raw data as numbers instead of as large empty blocks in whatever font notepad is using (or whatever text editor).
Try opening it in any archive extractors you have (i.e. zip, 7z, rar, gz, tar etc.) to see if it's just a renamed file format (.PK3 is something like that).
Look for headers of known file formats somewhere within the file, which will help you discover where certain parts of the data are stored (i.e. do a search for "IPNG" to find any (uncompressed) png files somewhere within).
If you do find where a certain piece of data is stored, take a note of its location and length, and see if you can find numbers equal to either of those values near the beginning of the file, which usually act as pointers to the actual data.
Some times you just have to guess, or intuit what a certain value means, and if you're wrong, well, keep moving. There's not much you can do about it.
I have found that http://www.wotsit.org is particularly useful for known file type formats, for help finding headers within the .dat file.
Back up the file first. Once you've restricted the amount of damage you can do, just poke around as Ed suggested.
Looking at your rep level, I guess a basic primer on hexadecimal numbers, endianness, representations for various data types, and all that would be a bit superfluous. A good tool that can show the data in hex is of course essential, as is the ability to write quick scripts to test complex assumptions about the data's structure. All of these should be obvious to you, but might perhaps help someone else so I thought I'd mention them.
One of the best ways to attack unknown file formats, when you have some control over contents is to take a differential approach. Save a file, make a small and controlled change, and save again. Do a binary compare of the files to find the difference - preferably using a tool that can detect inserts and deletions. If you're dealing with an encrypted file, a small change will trigger a massive difference. If it's just compressed, the difference will not be localized. And if the file format is trivial, a simple change in state will result in a simple change to the file.
The other thing is to look at some of the common compression techniques, notably zip and gzip, and learn their "signatures". Most of these formats are "self identifying" so when they start decompressing, they can do quick sanity checks that what they're working on is in a format they understand.
Barring encryption, an archive file format is basically some kind of indexing mechanism (a directory or sorts), and a way located those elements from within the archive via pointers in the index.
With the the ubiquitousness of the standard compression algorithms, it's mostly a matter of finding where those blocks start, and trying to hunt down the index, or table of contents.
Some will have the index all in one spot (like a file system does), others will simply precede each element within the archive with its identity information. But in the end somewhere, there is information about offsets from one block to another, there is information about data types (for example, if they're storing GIF files, GIF have a signature as well), etc.
Those are the patterns that you're trying to hunt down within the file.
It would be nice if somehow you can get your hand on two versions of data using the same format. For example, on a game, you might be able to get the initial version off the CD and a newer, patched version. These can really highlight the information you're looking for.