Find duplicate files using C - c

I am attempting to write a C program that searches for duplicate files, groups the files, and then returns any files that are duplicates. The user can enter either a file path or specify files to check for using from the command line (argc). I am going to use stat() to traverse the system, and I know I need to use a hash table to bin the files. However, I am a bit lost on what to do to actually check if the files are repeats.
I know there are already programs that will do this for you, but this is an academic exercise that I need to complete. I am not looking for the coding answer, just a higher level answer on how I should go about solving the problem. Any feedback is appreciated, including any suggestions other than the ones I have listed above (again, I have to write this program from scratch).
Thanks.

Related

My program relies on hashes to identify files, some are repeated. How can I work around this?

sorry for the messy title but I can't come up with something that really describes what's happening here. So I'm making a program that fetches .cue files for Playstation 1 roms. To do this, the program creates a SHA-1 hash of the file and checks it in a database. The database can be found in the "psx.hash" file in this repo. This has been working fine but I suddenly stumbled upon a very very nasty problem. There's plenty of files that have the same hash, because they are essentially the same file.
Let me break down the problem a bit. PSX roms are essentially cd files, and they can come in tracks. These tracks usually contain audio, and the .cue file is used to tell the emulator where each audio track is located [in the disc file]. So what I do is to identify each and every track file (based on their SHA-1 hash), see if they match the database, and then construct a link based on their name (minus the track text) to get to the original cue file. Then I read the text and add it to the cue, simple as that. Well, apparently many games use the same track for some reason? Exactly 175 of them
So... what can I do to difentiate them? This leads to the problem that I fetch the wrong cue file whenever this hash comes into play. This is the hash by the way: "d9f92af296360772e62caa4cb276de3fa74f5538". I tried other algorithms to see if it was just an extremely unlikely coincidence, but nope, all gave the same results. SHA-256 gave the same result, CRC gave the same result, MD5 gave the same result (by the same result I mean the same between files, of course the results of different algorithms for the same file will be different).
So I don't know what to do. This is a giant bug in my program that I have no idea on how to fix, any insight is welcome. I'm afraid I explained myself poorly, if so, I apologize, but I have a hard time seeing where I may not be clear enough, so if you have any doubts please, do ask.
It's worth noting that the database was not constructed by myself, but by redump.org, also, here's the code I'm using to retrieve the hashes of the files:
def getSha1(file):
hashSha1 = hashlib.sha1()
with open(file, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hashSha1.update(chunk)
return hashSha1.hexdigest()
The correct solution would be to construct the hash file in such a way that I can differentiate between track files for each game, but I ended up doing the following:
Sort the list of Tracks to have them ordered.
Get the first track file and retrieve the hash (this one will always be unique since it contains the game)
For every next track file that isn't Track 1, assume it belongs to the game before it. So if the next file is Track 2, assume it belongs to the previous file that had Track 1.
This nicely avoids the issue, although it's circumventing the bigger problem of not having properly formatted data.

how to get added content of a file since last modification

I'm working on a project in golang that needs to index recently added file content (using framework called bleve), and I'm looking for a solution to get content of a file since last modification. My current work-around is to record the last indexed position of each file, and during indexing process later on I only retrieve file content starting from the previous recorded position.
So I wonder if there's any library or built-in functionality for this? (doesn't need to be restricted to go, any language could work)
I'll really appreciate it if anyone has a better idea than my work-around as well!
Thanks
It depends on how the files change.
If the files are append-only, then you only need to record the last offset where you stopped indexing, and start from there.
If the changes can happen anywhere, and the changes are mostly replacing old bytes with new bytes (like changing pixels of an image), then perhaps you can consider computing checksum for small chucks, and only index those chunks that has different checksums.
You can check out crypto package in Go standard library for computing hashes.
If the changes are line insertion/deletion to text files (like changes to source code), then maybe a diff algorithm can help you find the differences. Something like https://github.com/octavore/delta.
If you're running in a Unix-like system, you could just use tail. If you specify to follow the file, the process will keep waiting after reaching end of file. You can invoke this in your program with os/exec and pipe the Stdout to your program. Your program can then read from it periodically or with blocking.
The only way I can think of to do this natively in Go is like how you described. There's also a library that tries to emulate tail in Go here: https://github.com/hpcloud/tail

Read own source file and text file line number?

We have a assignment and the teacher doesn't go into depth with explaining things so I'm a bit confused since I haven't really done much programming before. We have to write a program that when it's done being executed it's able to read its source file and can make another text file which is the same as its source file but the text file has a line number. My problem is I don't understand how to begin it. Could someone give me an example how to get started and what steps to take? I'm not asking for someone to do the programming for me just give an example. Thanks in advance.
Roughly the steps you'll want to take are:
Read each line of the input text file
Prepend the line number to the beginning of each line.
Write your modified lines into a new text file.
There's a lot of good information on how to read/write to files here, and string concatenation (for how to prepend the line number) here. You may also want to look into for loops so that you can hit every line in the input file.
There are really two parts to your question: "Who am I?" (what file are you) and "Write a copy of myself with line numbers"
The part that you describe above is the first -- "Who am I?" and for that, something external to your source code has to provide the info because the language itself can reside in any file.
Often, there is information available about what's being compiled made available by the preprocessor (just like it sounds, it's something that is run before compiling your source code). In this case, "preprocessor macros" commonly give you this sort of environmental data.
Take a look at this link for GNU C: https://gcc.gnu.org/onlinedocs/cpp/Standard-Predefined-Macros.html to start researching what is available under what conditions. Your compiler, if not gcc, should have similar docs.

Temporary File in C

I am writing a program which outputs a file. This file has two parts of the content. The second part however, is computed before the first. I was thinking of creating a temporary file, writing the data to it. And then creating a permanent file and then dumping the temp file content into the permanent one and deleting that file. I saw some posts that this does not work, and it might produce some problems among different compilers or something.
The data is a bunch of chars. Every 32 chars have to appear on a different line. I can store it in a linked list or something, but I do not want to have to write a linked list for that.
Does anyone have any suggestions or alternative methods?
A temporary file can be created, although some people do say they have problems with this, i personally have used them with no issues. Using the platform functions to obtain a temporary file is the best option. Dont assume you can write to c:\ etc on windows as this isnt always possible. Dont assume a filename incase the file is already used etc. Not using temporary files correctly is what causes people problems, rather than temporary files being bad
Is there any reason you cannot just keep the second part in ram until you are ready for the first? Otherwise, can you work out the size needed for the first part and leave that section of the file blank to come back to fill in later on. This would eliminate the needs of the temporary file.
Both solutions you propose could work. You can output intermediate results to a temporary file, and then later append that file to the file that contains the dataset that you want to present first. You could also store your intermediate data in memory. The right data structure depends on how you want to organize the data.
As one of the other answerers notes, files are inherently platform specific. If your code will only run on a single platform, then this is less of a concern. If you need to support multiple platforms, then you may need to special case some or all of those platforms, if you go with the temporary file solution. Whether this is a deal-breaker for you depends on how much complexity this adds compared to structuring and storing your data in memory.

Recommendations for encrypting/decrypting scripts elegantly?[Now on Sourceforge]

Update2:
Thanks for the input. I have implemented the algorithm and it is available for download at SourceForge. It is my first open source project so be merciful.
Update:
I am not sure I was clear enough or everyone responding to this understands how shells consume #! type of input. A great book to look at is Advanced Unix Programming. It is sufficient to call popen and feed its standard input as demonstrated here.
Original Question:
Our scripts run in highly distributed environment with many users. Using permissions to hide them is problematic for many reasons.
Since the first line can be used to designate the "interpreter" for a script the initial line can be used to define a a decrypter
#!/bin/decryptandrun
*(&(*S&DF(*SD(F*SDJKFHSKJDFHLKJHASDJHALSKJD
SDASDJKAHSDUAS(DA(S*D&(ASDAKLSDHASD*(&A*SD&AS
ASD(*A&SD(*&AS(D*&AS(*D&A(SD&*(A*S&D(A*&DS
Given that I can write the script to encrypt and place the appropriate header I want to decrypt the script (which itself may have an interpreter line such as #!/bin/perl at the top of it) without doing anything dumb like writing it out to a temporary file. I have found some silly commercial products to do this. I think this could be accomplished in a matter of hours. Is there a well known method to do this with pipes rather than coding the system calls? I was thinking of using execvp but is it better to replace the current process or to create a child process?
If your users can execute the decryptandrun program, then they can read it (and any files it needs to read such as decryption keys). So they can just extract the code to decrypt the scripts themselves.
You could work around this by making the decrtyptandrun suid. But then any bug in it could lead to the user getting root privileges (or at least privileges to the account that holds the decryption keys). So that's probably not a good idea. And of course, if you've gone to all the trouble of hiding the contents or keys of these decryption scripts by making them not readable to the user... then why can't you do the same with the contents of the scripts you're trying to hide?
Also, you can't have a #! interpreted executable as an interpreter for another #! interpreted executable.
And one of the fundamental rules of cryptography is, don't invent your own encryption algorithm (or tools) unless you're an experienced cryptanalyst.
Which leads me to wonder why you feel the need to encrypt scripts that your users will be running. Is there anything wrong with them seeing the contents of the scripts?
Brian Campbell's answer has the right idea, I'll spell it out:
You need to make your script unreadable but executable by the user (jbloggs), and to make decodeandrun setuid. You could make it setuid root, but it would be much safer to make it setgid for some group decodegroup instead, and then set the script file's group to decodegroup. You need to make sure that decodegroup has both read and execute permissions on the script file and that jbloggs is not a member of this group.
Note that decodegroup needs read permission for decodeandrun to be able to read the text of the script file.
With this setup, it is then possible (on Linux at least) for jbloggs to execute the script but not to look at it. But observe that this makes the decryption process itself unnecessary -- the script file might as well be plaintext, since jbloggs can't read it.
[UPDATE: Just realised that this strategy doesn't handle the case where the encrypted contents is itself a script that starts with #!. Oh well.]
You're solving the wrong problem. The problem is that you have data which you don't want your users to access, and that data's stored in a location to which the users have access. Start by attempting to fix the problem of users with more access than they require...
If you can't protect the whole script, you may want to look into just protecting the data. Move it to a separate location and encrypt it. Encrypt the data with a key only accessible by a specific ID (preferably not root), and write a small suid program to access the data. In your setuid program, do your validation of who should be running the program, and compare the name / checksum of the calling program (you can inspect the command line for the process in combination with the calling process's cwd to find the path, use lsof or the /proc filesystem) with the expected value before decrypting.
If it takes more than that, you really need to reevaluate the state of users on the system - they either have too much access or you have too little trust. :)
All of the exec()-family functions you link to accept a filename, not a memory address. I'm not sure at all how you would go about doing what you want, i.e. "hooking" in a decryption routine and then re-directing to the decrypted script's #! interpreter.
This would require you to decrypt the script into a temporary file, and pass that filename to the exec() call, but you (very reasonably) said you didn't want to expose the script by putting it in a temporary file.
If it were possible to tell the kernel to replace a new process with an existing one in memory, you would have a path to follow, but as far as I know, it isn't. So I don't think it will be very easy to do this "chained" #! following.

Resources