Unix combine a bunch of text files without taking extra disk space? - file

I have a bunch of text files I need to temporarily concatenate so that I can pass a single file (representing all of them) to some post-processing script.
Currently I am doing:
zcat *.rpt.gz > tempbigfile.txt
However this tempbigfile.txt is 3.3GB, while the original size of the folder with all the *.rpt.gz files is only 646MB! So I'm temporarily quadroupling the disk space used. Of course after I can call myscript.pl with tempbigfile.txt, it's done and I can rm the tempbigfile.txt.
Is there a solution to not create such a huge file and still get all those files together in one file object?

You're deflating the files with zcat, so you should compress the text once more with gzip:
zcat *.rpt.gz | gzip > tempbigfile.txt

Related

How do I check for a world in multiple txt files at once?

I have a huge data set of around 120gbs that contains several txt files of each 2-3 gb+ sizes. I want to look for a string inside all those files at once.
I tried notepad++ find by folder option but it doesn't allow me to open large txt files.

Want to compare delta files recursively using rsync when I have file list in left in text file and Folder at right hand

I have a list of files mentioned at left hand side in a text file and I have a separate folder where I have list of physical files. I have to compare the left hand's FileList.txt with Right hand's Directory files(Recursively) and copy delta part using rsync. I am using the below command but not getting any files to copy.
Below is the dry run attempt .
rsync -rvnc --include-from=/cygdrive/c/Users/SG066221/Desktop/scripts/diff_Lib_WITH_EMPLTY.txt /cygdrive/c/Users/SG066221/Desktop/scripts/FROM_LIST_2_ANOTHER/ 1>C:\Users\SG066221\Desktop\scripts\diff_FINAL.txt
Output is :
sending incremental file list
drwx------ 0 2018/11/12 14:26:18 .
sent 38 bytes received 64 bytes 204.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
The correct syntax for rsync is:
rsync <options> <include> <exclude> src/ dest/
Your problems:
If you only list one directory, nothing will happen.
If you have includes without excludes then it'll include everything.
(You have dry-run set, but you probably knew that.)
Try this:
rsync -rvc --include-from=file.txt --exclude='*' src/ dest/
Make sure that file.txt contains only the filenames within src/ (i.e. with "src/" stripped off). Make sure that any sub-directories you want files copied from are listed too, on their own line (alternatively, add --include='*/' before the exclude).
What it says is, copy from src to dest, including files in file.txt, and excluding everything else.

How would I store different types of data in one file

I need to store data in a file in this format
word, audio, jpeg
How would I store that all in one file? Is it even possible do would I need to store links to other data files in place of the audio and jpeg. Would I need a custom file format?
1. Your own filetype
As mentioned by #Ken White you would need to be creating your own custom file format for this sort of thing, which would then mean creating your own parser type. This could be achieved in almost any language you wanted but since you are planning on using word format, then maybe C# would be best for you. However, this technique could be quite complicated and take a relatively large amount of time to thoroughly test your file compresser / decompressor, but may be best depending on your needs.
2. Command line utilities
Another way to go about this would be to use a bash script to combine all of the files into one file, and then decompress it at the other end. For example the steps could involve:
Combine files using windows copy / linux cat command on command line
Create a metdata file of your own that says how many files are in this custom file, and how much memory each one takes up (could be a short XML or JSON file for example...)
Use the linux split command or install a Windows command line file splitter program (here's just one example) to split the file back into whatever components have made it up.
This way you only have to create a really small file type, and let the OS utilities handle the combining of them for you.
Example on Windows:
Copy all of the files in your current directory into one output file called 'file.custom'
copy /b * file.custom
Generate custom file format describing metadata (i.e. get the file size on disk in C# example here). This is just maybe what I would do in JSON. SO formatting was being annoying so here's a link (Copy paste it into an editor or online JSON viewer).
Use a decompress windows / linux command line tool to decompress each files to the exact length (and export it back to the exact name) specified in the JSON (metadata) file. (More info on splitting files on this post).
3. ZIP files
You could always store all of the files in a compressed zip file, and then just use a zip compressor, expander as and when you like to retreive any number of file formats stored within.
I found a couple of examples of :
Combining multiple files into one ZIP file in only C# .net,
Unzipping ZIP files in C#
Zipping & Unzipping with only windows built-in utilities
Zipping & Unzipping in Linux command line
Good Zipping/Unzipping library in Java
Zipping/Unzipping in Python

How to modify a single file inside a very large zip without re-writing the entire zip?

I have large zip files that contain huge files. There are "metadata" text files within the zip archives that need to be modified. However, it is not possible to extract the entire zip and re-compress it. I need to locate the target text file inside the zip, edit it, and possibly append the change to the zip file. The file name of the text file is always the same, so it can be hard-coded. Is this possible? Is there a better way?
There are two approaches. First, if you're just trying to avoid recompression of the entire zip file, you can use any existing zip utility to update a single file in the archive. This will entail effectively copying the entire archive and creating a new one with the replaced entry, then deleting the old zip file. This will not recompress the data not being replaced, so it should be relatively fast. At least, about the same time required to copy the zip archive.
If you want to avoid copying the entire zip file, then you can effectively delete the entry you want to replace by changing the name within the local and central headers in the zip file (keeping the name the same length) to a name that you won't use otherwise and that indicates that the file should be ignored. E.g. replacing the first character of the name with a tilde. Then you can append a new entry with the updated text file. This requires rewriting the central directory at the end of the zip file, which is pretty small.
(A suggestion in another answer to not refer to the unwanted entry in the central directory will not necessarily work, depending on the utility being used to read the zip file. Some utilities will read the local headers for the zip file entry information, and ignore the central directory. Other utilities will do the opposite. So the local and central entry information should be kept in sync.)
There are "metadata" text files within the zip archives that need to be modified.
However, it is not possible to extract the entire zip and re-compress it.
This is a good lesson why, when dealing with huge datasets, keeping the metadata in the same place with the data is a bad idea.
The .zip file format isn't particularly complicated, and it is definitely possible to replace something inside it. The problem is that the size of the new data might increase, and not fit anymore into the location of the old data. Thus there is no standard routine or tool to accomplish that.
If you are skilled enough, theoretically, you can create your own zip handling functions, to provide the "file replace" routine. If it is about the (smallish) metadata, you do not even need to compress them. The .zip's "central directory" is located in the end of the file, after the compressed data (the format was optimized for appending new files). General concept is: read the "central directory" into the memory, append the new modified file after the compressed data, update the central directory in memory with the new file offset of the modified file, and write the central directory back after the modified file. (The old file would be still sitting somewhere inside the .zip, but not referenced anymore by the "central directory".) All the operations would be happening at the end of the file, without touching the rest of the archive's content.
But practically speaking, I would recommend to simply keep the data and the metadata separately.

Delete all files except

I have a folder with a few files in it; I like to keep my folder clean of any stray files that can end up in it. Such stray files may include automatically generated backup files or log files, but could be a simple as someone accidentally saving to the wrong folder (my folder).
Rather then have to pick through all this all the time I would like to know if I can create a batch file that only keeps a number of specified files (by name and location) but deletes anything not on the "list".
[edit] Sorry when I first saw the question I read bash instead of batch. I don't delete the not so useful answer since as was pointed out in the comments it could be done with cygwin.
You can list the files, exclude the one you want to keep with grep and the submit them to rm.
If all the files are in one directory:
ls | grep -v -f ~/.list_of_files_to_exclude | xargs rm
or in a directory tree
find . | grep -v -f ~/.list_of_files_to_exclude | xargs rm
where ~/.list_of_files_to_exclude is a file with the list of patterns to exclude (one per line)
Before testing it make a backup copy and substitute rm with echo to see if the output is really what you want.
White lists for file survival is an incredibly dangerous concept. I would strongly suggest rethinking that.
If you must do it, might I suggest that you actually implement it thus:
Move ALL files to a backup area (one created per run such as a directory containing the current date and time).
Use your white list to copy back files that you wanted to keep, such as with copy c:\backups\2011_04_07_11_52_04\*.cpp c:\original_dir).
That way, you keep all the non-white-listed files in case you screw up (and you will at some point, trust me) and you don't have to worry about negative logic in your batch file (remove all files that _aren't of all these types), instead using the simpler option (move back every file that is of each type).

Resources