Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
There two computers having same number of many files. How do we find out if there is a slight change in any one of the file in One computer. The Network communication is very slow between these computers
You can use md5sum utility. For windows please check [this] (https://support.microsoft.com/en-us/help/889768/how-to-compute-the-md5-or-sha-1-cryptographic-hash-values-for-a-file) and for linux use md5sum filename and then compare the hash values.
One idea would be to generate a hash for each file. Hashes convert an arbitrary length file to a fixed size. You could further hash the hashes together, then upload it and compare. Hashing is something used extensively to ensure downloads are not corrupt.
You could hash the files and compare the hashes via the network.
A good hashfunction is designed that if there is only a little difference in the input of the function, then the output will be totally different. Furthermore most hashfunctions have a output-length of 160-512 bits nowadays. Meaning although you might want to compare two files which are several gigabyte big you would only need to send a small string of 512 bits over the network to see if the hashes match.
If you have millions of files maybe this would be already to much. A solution would look like this:
Hash each file on each computer
Then concatenate the hashes and hash the concatenated string again
Now compare this output if it differs you know, that there is a difference in those files.
To find which file differs (or even where exactly in the file) you can use binary search:
Split the millions of files into two parts, now go to step 1-3 (if you have enough space you could save the hash of each file to speed up).
Now for each of the two hashes which differ go to step 4-6 recursively.
If you located the files that differ you could again split up the file by the number of lines and work like in 4-6.
At some point the number of lines will be such small that the hash maybe would be longer than the actual content of the lines. Now it is of course more efficient to compare the actual content in the naive way.
Assuming you would have only one file which differs this would only need logarithmic many hashes to be sent over the network and therefore minimize the network traffic.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Suppose I'm designing a file manager and want to implement searching of a file by its type hypothetically then which one of these methods will be more efficient -
use the name of the file and trim the extension of each file.
use of specific bytes for the type of file we are searching for example in the case of jpeg images.
bytes 0xFF, 0xD8 indicate start of image
bytes 0xFF, 0xD9 indicate end of image
Since you have to know it's filename before open it, the name trim option will be probably faster. However, you could have false results with that method if an extension does not match with actual file type.
Doing that way will save you some system calls (open, read, maybe fseek, close).
Assuming your goal being: "search a file by its type" without further limitations you have to do it by checking the actual data.
But you might be OK with some false positives and false negatives. If you are searching for image files by looking for extensions only, you can get "image.jpg?width=1024&height=800" instead of "image.jpg" for an image file, a false negative, or "image.jpg" instead of "image.exe", a false positive.
You can, on the other side, check for the first couple of bytes in the file--most schemes for image data have an individual header. This method has much less points of failure. You can get a false positive if you got a chunk of random data with the first bytes resembling the header of an image file. Possible, but highly unlikely. You can get a false negative if the header got stripped (e.g.: on the transport, somehow, or a bad script that produced the file). Also possible and also unlikely, even more so, if not much more.
The small Unix tool file does that and once had an easy to parse text-file you could use for you own project. It's nowadays a large folder with several single files that doesn't even gets installed, only in a precompiled form. You can find the folder with the text-files online, at e.g.: http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/saucy/file/saucy/files/head:/magic/Magdir/ The format is described in the manpage magic(5) which is also online at e.g.: https://linux.die.net/man/5/magic
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I wish to compare a value from a particular location in a binary file (say, value from index n x i, where i = 0,1,2,3... and n = any number, say 10).
I want to see if that value is equal to another, say "m". The location of that value in the file is always in n x i only.
I can think of three methods to this:
I maintain a temp variable which stores the value of n x i and I directly use fseek go to that index and see if it is equal to m.
I do an fseek for the value of m in the file.
I search for the value of m in locations 0, n, 2n, 3n,... and so on using fseek.
I don't know how each of these operations work, but which one of these is the most efficient with respect to space and time taken?
Edit:
This process is a part of a bigger process, which has many more files and hence time is important.
If there is any other way than using fseek, please do tell.
Any help appreciated. Thanks!
Without any prior knowledge of the values and ordering in the file you are searching, the quickest way is just to look through the file linearly and compare values.
This might be best done by using fseek() repeatedly, but repeatedly calling fseek and read may be slower than just reading big chunks of the file and looking through them in memory - because system calls have a lot of overhead.
However if you are doing a lot of searches of the same files, you would be better off building an index and/or sorting your records. One way to do this would be put the data into a relational database with built-in indexes (pretty much any SQL database)
Edit:
Since you know your file is sorted, you can use a binary search.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
in my program, i'm using file.open(path_to_file); .
in the server side, i have a directory that contains plenty of files, and i'm afraid that the program will take longer time to run if the directory is more and more bigger because of the file.open();
//code:
ofstream file;
file.open("/mnt/srv/links/154");//154 is the link id and in directory /mnt/srv/links i have plenty of files
//write to file
file.close();
Question: can the time to excecute file.open() vary according to the number of files in the directory?
I'm using debian, and I believe my filesystem is ext3.
I'm going to try to answer this - however, it is rather difficult, as it would depend on, for example:
What filesystem is used - in some filesystems, a directory consists of an unsorted list of files, in which case the time to find a particular file is O(n) - so with 900000 files, it would be a long list to search. On the other hand, some others use a hash algorithm or a sorted list, allow O(1) and O(log2(n)) respectively - of course, each part of a directory has to be found individually. With a number of 900k, O(n) is 900000 times slower than O(1), and O(log2(n)) for 900k is just under 20, so 18000 times "faster". However, with 900k files, even a binary search may take some doing, because if we have a size of each directory entry of 100 bytes [1], we're talking about 85MB of directory data. So it will be several sectors to read in, even if we only touch at 19 or 20 different places.
The location of the file itself - a file located on my own hard-disk will be much quicker to get to than a file on my Austin,TX colleague's file-server, when I'm in England.
The load of any file-server and comms links involved - naturally, if I'm the only one using a decent setup of a NFS or SAMBA server, it's going to be much quicker than using a file-server that is serving a cluster of 2000 machines that are all busy requesting files.
The amount of memory and overall memory usage on the system with the file, and/or the amount of memory available in the local machine. Most modern OS's will have a file-cache locally, and if you are using a server, also a file-cache on the server. More memory -> more space to cache things -> quicker access. Particularly, it may well cache the directory structure and content.
The overall performance of your local machine. Although nearly all of the above factors are important, the simple effort of searching files may well be enough to make some difference with a huge number of files - especially if the search is linear.
[1] A directory entry will have, at least:
A date/time for access, creation and update. With 64-bit timestamps, that's 24 bytes.
Filesize - at least 64-bits, so 8 bytes
Some sort of reference to where the file is - another 8 bytes at least.
A filename - variable length, but one can assume an average of 20 bytes.
Access control bits, at least 6 bytes.
That comes to 66 bytes. But I feel that 100 bytes is probably more typical.
Yes, it can. That depends entirely on the filesystem, not on the language. The times for opening/reading/writing/closing files are all dominated by the times of the corresponding syscalls. C++ should add relatively little overhead, even though you can get surprises from your C++ implementation.
There are a lot of variables which might affect the answer to this, but the general answer is that the number of files will influence the time taken to open a file.
The biggest variable is the filesystem used. Modern filesystems use directory index structures such as B-Trees, to allow searching for known files to be a relatively fast operation. On the other hand, listing all the files in the directory or searching for subsets using wildcards can take much longer.
Other factors include:
Whether symlinks need to be traversed to identify the file
Whether the file is local or mounter over a network
Cacheing
In my experience, using a modern filesystem, an individual file can be located in directories containing 100's of thousands of files in times less than a second.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
It's said that DES is insecure. I guess it's because the key is 55 bit long, so using brute-force would take max 2^55 iterations to find out the key which is not many nowadays. But if we iterate 2^55, when do we know when to stop?
It's 256 not 255.
There are a couple of options for how to know when to stop. One is that you're doing a known-plain-text attack -- i.e., you know the actual text of a specific message, use that to learn the key, then use that to be able to read other messages. In many cases, you won't know the full text, but may know some pieces anyway -- for example, you may know something about an address block that's used with all messages of the type you care about, or if a file has been encrypted may have a recognizable header even though the actual content is unknown.
If you don't know (any of) the text for any message, you generally depend on the fact that natural languages are generally fairly redundant -- quite a bit is known about their structure. For a few examples, in English, you know that a space is generally the most common character, e is the most common letter, almost no word has more than two identical letters in a row, nearly all words contain at least one vowel, etc. In a typical case, you do a couple different levels of statistical analysis -- a really simple one that rules out most possibilities very quickly. For those that pass that test you do a second analysis that rules out the vast majority of the rest fairly quickly as well.
When you're done, it's possible you may need human judgement to choose between a few possibilities -- but in all honesty, that's fairly unusual. Statistical analysis is generally entirely adequate.
I should probably add that some people find statistical analysis problematic enough that they attempt to prevent it, such as by compressing the data with an algorithm like Huffman compression to maximize entropy in the compressed data.
It depends on the content. Any key will produce some output, so there's no automatic way to know you've found the correct key unless you can guess what sort of thing you're looking for. If you expect that the encrypted data is text, you can check whether each decrypt contains mostly ASCII letters; similarly, if you expect that it's a JPEG file, you can check whether the decrypt starts with the characters "JFIF".
If you expect that the data is not compressed, you can run various entropy tests on the decrypts, looking for decrypts with low entropy.
I have got a repository where I store all my image files. I know that there are much images which are duplicated and I want to delete each one of duplicated ones.
I thought if I generate checksum for each image file and rename the file to its checksum, I can easily find out if there are duplicated ones by examining the filename. But the problem is that, I cannot be sure about selecting the checksum algorithm to use. For example, if I generate the checksums using MD5, can I exactly trust if the checksums are the same that means the files are exactly the same?
Judging from the response to a similar question in security forum (https://security.stackexchange.com/a/3145), the collision rate is about 1 collision per 2^64 messages. If your files are differenet and your collection is not huge (i.e. close to this number), md5 can be used safely.
Also, see response to a very similar question here: How many random elements before MD5 produces collisions?
The chances of getting the same checksum for 2 different files are extremely slim, but can never be absolutely guaranteed (Pigeonhole principle). An indication of how slim may be that GIT uses the SHA-1 checksum for software development source code including Linux and has never caused any known problems so I would say that you are safe. I would use SHA-1 instead of MD5 because it is slightly better if you are really paranoid.
To make really sure you best follow a two-step-procedure: first calculate a checksum for every file. If the checksums differ you're sure the files are not identical. If you happen to find some files with equal checksums there's no way around doing a bit-by-bit-comparison to make 100% sure if they are really identical. This holds regardless of the hashing-algorithm used.
What you'll get is a massive time-saving as doing bit-by-bit comparison for every possible pair of files will take forever and a day while comparing a hand full of possible candidates is fairly easy.