This problem might be a common one, but since I don't know the terms associated with it, I couldn't search for it (unless Google accepted entire paragraphs as search queries).
I have a file - Can be a text file, or an MP3 file or a video clip or even a HUGE mkv file.
I have access to this file and now I have to process it in some way so that I get some kind of a value or unique identifier.. a hash, or something. I store it somewhere. This "hash" has to be small - several byte. It shouldnt be half the file size!
Later on, when I am presented with a file again, I have to verify whether it was the same original file using that value I got in step 1. I will NOT have access to the original file this time. All I have will be that value from step 1.
This algorithm should return true if the second file contains the exact same data - every single bit - as the first file (basically the same file) even if the file name, attributes, location etc have all changed.
Basically I need to know whether I am dealing with the same file, even if it moved, renamed and has all its attributes changed - but when NOT having access to both the files at the same time.
This has to be OS or FileSystem independent.
Is there a way to accomplish this?
What you're looking for are cryptographic hash algorithms. Read about them:
http://en.wikipedia.org/wiki/SHA-1
http://en.wikipedia.org/wiki/MD5
All robust languages and libraries offer support for calculating hashes.
Your dilemma is simple. Get an MD5 (or whatever algorithm can produce 1 way hash) hash every time you process the file.
Here it is in simple steps:
Step 1: Load file stream into a byte array
Step 2: Obtain MD5 hash from byte array
Step 3: Check your db if it already contains hash.
Step 4: return false if not exist
Step 5: return true if found
Step 6: If not exist process file
Step 7: Save hash
Compare two files by a key ( file-1(1,20,a), file-2(2,20,a)) Using sort card , matched records should be moved to file-3 (a new file) from file file-2 ?
Related
I am trying to delete all files from a directory apart from two (which will be erased, then re-written). One of these files, not to be deleted, contains the names of all files in the folder/directory (the other contains the number of files in the directory).
I believe there (possibly?) are 2 solutions:
Read the names of each file from the un-deleted file and delete them individually until there is only the final 2 files remaining,
or...
Because all other files end in .txt I could use some sort of filter which would only delete files with this ending.
Which of these 2 would be most efficient and how could it be done?
Any help would be appreciated.
You are going to end up deleting files one by one, regardless of which method you use. Any optimizations you make are going to be very miniscule. Without actually timing your algorithms, I'd say they'd both take about the same amount of time (and this would vary from one computer to the next, based on CPU speed, HDD type, etc). So, instead of debating that, I'll provide you code for both the ways you've mentioned:
Method1:
import os
def deleteAll(infilepath):
with open(infilepath) as infile:
for line in infile:
os.remove(line)
Method 2:
import os
def deleteAll():
blacklist = set(['names/of/files/to/be/deleted', 'number/of/files'])
for fname in (f for f in os.listdir() if f not in blacklist):
os.remove(fname)
I have a 2D matrix with 1100x1600 data points. Initially, I stored it in an ascii-file which I tar-zipped using the command
tar -cvzf ascii_file.tar.gz ascii_file
Now, I wanted to switch to hdf5 files, but they are too large, at least in the way I am using them... First, I write the array into an hdf5-file using the c-procedures
H5Fcreate, H5Screate_simple, H5Dcreate, H5Dwrite
in that order. The data is not compressed within the hdf-file and it is relatively large, so I compressed it using the command
h5repack --filter=GZIP=9 hdf5_file hdf5_file.gzipped
Unfortunatelly, this hdf file with the zipped content is still larger than the compressed ascii file by a factor of 5, see the following table:
file size
--------------------------
ascii_file 5721600
ascii_file.tar.gz 287408
hdf5_file 7042144
hdf5_file.gzipped 1117033
Now my question(s): Why is the gzipped ascii-file so much smaller and is there a way to make the hdf-file smaller?
Thanks.
well, after reading Mark Adler's comment, I realized that this question is somehow stupid: In the ascii case, the values are truncated after a certain number of digits, whereas in the hdf case the "real" values ("real" = whatever precision the data type has I am using) are stored.
There was, however, one possibility to further reduce the size of my hdf file: by applying the shuffle filter using the option
--filter=SHUF
I'm writing a program/utility in C to find (and then move to a new directory) the files in the current directory that have been modified after the last time the utility was run.
What I'm trying to find out is if there is a way to find the last time this utility ran. Or alternatively, a way to store the time in the program (so as to compare the last stored time against the current time, and then update the "last time" variable to current time).
As I type this it occurs to me that I could write the time to a file (overwriting the single entry as the utility is run) and retrieve the value from the file in the program, although I don't know if this would be the best approach.
you can make a class contains info and serialize it to a text file , it's more easy to access and can store multiple values,
then to store new values first delete file and then create file again.
another approach could be a register key containing information.
hope it would be useful ;)
You can use the last access time from the filesystem (In GNU/linux you can use ls -lu to see last access time).
This is not a portable solution because it depends on filesystem and filesystem settings (see JoachimPileborg edit below)
Moreover look at this question to get last acces time in C (use atime instead of mtime).
I am reading info (numbers) from a txt file and after that I am adding to those numbers, others I had in another file, with the same structure.
At the start of each line in the file is a number, that identifies a specific product. That code will allow me to search for the same product in the other file. In my program I have to add the other "variables" from one file to the other, and then replace it, in the same place in one of those files.
I didn't open any of those files with a or a+, I did it with r and r+ because i want to replace the information in the lines that may be in the middle of the file, and not in the end of it.
The program compiles, and runs, but when it comes to replace the info in the file, it just doesn't do anything.
How should I resolve the problem?
A program can replace (overwrite) text in the middle of the file. But the question is whether or not this should be performed.
In order to insert larger text or smaller text (and close up the gap), a new text file must be written. This is assuming the file is not fixed width. The fundamental rule is to copy all original text before the insertion to a new file. Write the new text. Finally write the remaining original text. This is a lot of work and will slow down even the simplest programs.
I suggest you design your data layout before you go any further. Also consider using a database, see my post: At what point is it worth using a database?
Your objective is to design the data to minimize duplication and data fetching.
consider the following task :
1) read a target directory contents, pass each found dirent structure to some filter function and remember filtered elements somehow for the later processing
2) some time later, iterate through the filtered elements and process them (do some I/O)
The most obvious way is to save names of sub-directories.
However, I want to keep memory usage to the minimum and to avoid additional I/O.
According to POSIX manuals, I can save position of each directory entry using telldir() and restore them later using seekdir(). To keep these positions valid, I have to keep target directory opened and to not use rewinddir() call.
Keeping a directory stream open and storing a list of dir positions(long int`s) seems to be an appropriate solution.
However, it is unclear whether stored positions remain valid after folder modification. I didn`t found any comments on these conditions in the POSIX standard.
1) Whether stored positions remain valid when only new directory entries are added/removed ?
2) Whether stored positions of unmodified directory entries remain valid in case of some of the filtered directory entries were removed ?
3) Is it possible for the stored position to point to another directory entry after folder modification ?
It is easy to test and find out the answer on these questions for the particular system, but I would like to know what standards say on this topic
Thank you
Until you call rewinddir or close and reopen the directory, your view of the directory contents should not change. Sorry I don't have the reference handy. I'll find it later if you need it.