Why are compressed files modified at the end of compression?

Why are compressed files modified at the end of compression? - archive

Using 7zip I compressed ~15GB worth of pictures split in folders in 15 1024MB files.
Compression methode: LZMA2; Level: Ultra; Dictionary size: 64M;
At the end of compression some of the files had their "last modified" time changed to the time of completion, while some of the files didn't.
Why is this?
And if I have already uploaded most of the files will I be able to unarchive them successfully?

You would need to ask the author of the program for an explanation of why it modifies volumes at the end of the operation. If I had to make an educated guess, it might be because 7-zip doesn't know which is the last volume until it's finished (because this would depend on the compression ratio of the files being archived, which can't be predicted), and so it needs to go back and update parts of the volume file headers accordingly.
In general, though, quoting the relevant 7-zip help file entry:
NOTE: Please don't use volumes (and don't copy volumes) before
finishing archiving. 7-Zip can change any volume (including first
volume) at the end of archiving operation.
The only safe assumption is that you can't reliably use any of your individual 1GB volumes until 7-zip has finished processing the whole 15GB archive.

Related

Btrieve file only shows partial data

Almost ready to throw up the white flag but thought I'd throw it out there. I have an OLD program from 1994 that uses a btrieve dB and renders basic membership info for a gym. The btr file that holds the data will open in notepad and I can search and find all records although the formatting is nearly unreadable. When it opens in the program there is a huge chunk of records missing. It seems to stop on specific records up and down when scrolling.
I know almost nothing about btrieve as it predates my IT career by many years and I've honestly never seen it. Any suggestions on where I should troubleshoot or tools that may help would be much appreciated.

This sounds like the file may be corrupted although I would expect errors if it was corrupted. One way to rebuild the file is to use BUTIL (and a couple of OS commands).
The steps to rebuild are:
Make a backup of the original file to another directory.
Rename the original file. I like to use change the extension to .OLD.
Delete the original file. It will be recreated in the next step.
Issue the BUTIL -CLONE command (BUTIL -CLONE
Issue the BUTIL -COPY command (BUTIL -COPY
The rebuild is complete.
I've use the commands below in the past (changing 'filename' and the extensions to match my files).
copy filename.btr someother\location\filename.btr
ren filename.btr filename.old
del filename.btr
butil -clone filename.btr filename.old
butil -copy filename.old filename.btr

Date in NLog file name and limit the number of log files

I'd like to achieve the following behaviour with NLog for rolling files:
1. prevent renaming or moving the file when starting a new file, and
2. limit the total number or size of old log files to avoid capacity issues over time
The first requirement can be achieved e.g. by adding a timestamp like ${shortdate} to the file name. Example:
logs\trace2017-10-27.log <-- today's log file to write
logs\trace2017-10-26.log
logs\trace2017-10-25.log
logs\trace2017-10-24.log <-- keep only the last 2 files, so delete this one
According to other posts it is however not possible to use date in the file name and archive parameters like maxArchiveFiles together. If I use maxArchiveFiles, I have to keep the log file name constant:
logs\trace.log <-- today's log file to write
logs\archive\trace2017-10-26.log
logs\archive\trace2017-10-25.log
logs\archive\trace2017-10-24.log <-- keep only the last 2 files, so delete this one
But in this case every day on the first write it moves the yesterday's trace to archive and starts a new file.
The reason I'd like to prevent moving the trace file is because we use Splunk log monitor that is watching the files in the log folder for updates, reads the new lines and feeds to Splunk.
My concern is that if I have an event written at 23:59:59.567, the next event at 00:00:00.002 clears the previous content before the log monitor is able to read it in that fraction of a second.
To be honest I haven't tested this scenario as it would be complicated to set up as my team doesn't own Splunk, etc. - so please correct me if this cannot happen.
Note also I know that it is possible to directly feed Splunk other ways like via network connection, but the current setup for Splunk at our company is reading from log files so it would be easier that way.
Any idea how to solve this with NLog?

When using NLog 4.4 (or older) then you have to go into Halloween mode and make some trickery.
This example makes hourly log-files in the same folder, and ensure archive cleanup is performed after 840 hours (35 days):
fileName="${logDirectory}/Log.${date:format=yyyy-MM-dd-HH}.log"
archiveFileName="${logDirectory}/Log.{#}.log"
archiveDateFormat="yyyy-MM-dd-HH"
archiveNumbering="Date"
archiveEvery="Year"
maxArchiveFiles="840"
archiveFileName - Using {#} allows the archive cleanup to generate proper file wildcard.
archiveDateFormat - Must match the ${date:format=} of the fileName (So remember to correct both date-formats, if change is needed)
archiveNumbering=Date - Configures the archive cleanup to support parsing of filenames as dates.
archiveEvery=Year - Activates the archive cleanup, but also the archive file operation. Because the configured fileName automatically ensures the archive file operation, then we don't want any additional archive operations (Ex. avoiding generating extra empty files at midnight).
maxArchiveFiles - How many archive files to keep around.
With NLog 4.5 (Still in BETA), then it will be a lot easier (As one just have to specify MaxArchiveFiles). See also https://github.com/NLog/NLog/pull/1993

How to modify a single file inside a very large zip without re-writing the entire zip?

I have large zip files that contain huge files. There are "metadata" text files within the zip archives that need to be modified. However, it is not possible to extract the entire zip and re-compress it. I need to locate the target text file inside the zip, edit it, and possibly append the change to the zip file. The file name of the text file is always the same, so it can be hard-coded. Is this possible? Is there a better way?

There are two approaches. First, if you're just trying to avoid recompression of the entire zip file, you can use any existing zip utility to update a single file in the archive. This will entail effectively copying the entire archive and creating a new one with the replaced entry, then deleting the old zip file. This will not recompress the data not being replaced, so it should be relatively fast. At least, about the same time required to copy the zip archive.
If you want to avoid copying the entire zip file, then you can effectively delete the entry you want to replace by changing the name within the local and central headers in the zip file (keeping the name the same length) to a name that you won't use otherwise and that indicates that the file should be ignored. E.g. replacing the first character of the name with a tilde. Then you can append a new entry with the updated text file. This requires rewriting the central directory at the end of the zip file, which is pretty small.
(A suggestion in another answer to not refer to the unwanted entry in the central directory will not necessarily work, depending on the utility being used to read the zip file. Some utilities will read the local headers for the zip file entry information, and ignore the central directory. Other utilities will do the opposite. So the local and central entry information should be kept in sync.)

There are "metadata" text files within the zip archives that need to be modified.
However, it is not possible to extract the entire zip and re-compress it.
This is a good lesson why, when dealing with huge datasets, keeping the metadata in the same place with the data is a bad idea.
The .zip file format isn't particularly complicated, and it is definitely possible to replace something inside it. The problem is that the size of the new data might increase, and not fit anymore into the location of the old data. Thus there is no standard routine or tool to accomplish that.
If you are skilled enough, theoretically, you can create your own zip handling functions, to provide the "file replace" routine. If it is about the (smallish) metadata, you do not even need to compress them. The .zip's "central directory" is located in the end of the file, after the compressed data (the format was optimized for appending new files). General concept is: read the "central directory" into the memory, append the new modified file after the compressed data, update the central directory in memory with the new file offset of the modified file, and write the central directory back after the modified file. (The old file would be still sitting somewhere inside the .zip, but not referenced anymore by the "central directory".) All the operations would be happening at the end of the file, without touching the rest of the archive's content.
But practically speaking, I would recommend to simply keep the data and the metadata separately.

Replay a file-based data stream

I have a live stream of data based on files in different formats. Data comes over the network and is written to files in certain subdirectories in a directory hierarchy. From there it is picked up and processed further. I would like to replay e.g. one day of this data stream for testing and simulation purposes. I could duplicate the data stream for one day to a second machine and „record“ it this way, by just letting the files pile up without processing or moving them.
I need something simple like a Perl script which takes a base directory, looks at all contained files in subdirectories and their creation time and then copies the files at the same time of the day to a different base directory.
Simple example: I have files a/file.1 2012-03-28 15:00, b/file.2 2012-03-28 09:00, c/file.3 2012-03-28 12:00. If I run the script/program on 2012-03-29 at 08:00 it should sleep until 09:00, copy b/file.2 to ../target_dir/b/file.2, then sleep until 12:00, copy c/file.3 to ../target_dir/c/file.3, then sleep until 15:00 and copy a/file.1 to ../target_dir/a/file.1.
Does a tool like this already exist? It seems I’m missing the right search keywords to find it.
The environment is Linux, command line preferred. For one day it would be thousands of files with a few GB in total. The timing does not have to be ultra-precise. Second resolution would be good, minute resolution would be sufficient.

Which file types are worth compressing (zipping) for remote storage? For which of them the compressed size/original size ratio is << 1?

I am storing documents in sql server in varbinary(max) fileds, I use filestream optionally when a user has:
(DB_Size + Docs_Size) ~> 0.8 * ExpressEdition_Max_DB_Size
I am currently zipping all the files, anyway this is done because the Document Read/Write work was developed 10 years ago where Storage was more expensive than now.
Many files when zipped are almost as big as the original (a zipped pdf is about 95% of original size). And anyway unzipping has some overhead, that becomes twice when I need also to "Check-in"/Update the file because I need to zip it.
So I was thinking of giving to the users the option to choose whether the file type will be zipped or not by providing some meaningful default values. For my experience I would impose the following rules:
1) zip by default: txt, bmp, rtf
2) do not zip by default: jpg, jpeg, Microsoft Office files, Open Office files, png, tif, tiff
Could you suggest other file types chosen among the most common or comment on the ones I listed here?

.doc and .mdb files actually tend to compress rather well, if i remember correctly. The Office 2007 equivalents (.docx and .accdb), though, are zip files already...so compressing them is pretty much useless.
Don't forget HTML and XML files. Zip by default.

I commend you on being able to recognize what are and aren't compressed file types. You probably already understand this, but I'll rant here:
Do not double-up compression methods! Each compression method adds its own header adding to file size, and since the data has already had its statistical redundancies eliminated as best as it could by one method, it's probably not going to be able to compressed further via another method. Take this set of files for example:
46,494,380 level0.wav
43,209,258 level1.wav.zip
43,333,266 level2.wav.zip.rar
43,339,894 level3.wav.zip.rar.gz
43,533,989 level4.wav.zip.rar.gz.bz2
All of these files contain the same data.
The first compression method worked well to eliminate redundancies, but each successive compression method just added to the file size, not to mention the headache of decrypting the file later.
The best method of compression is usually the first one applied.
28,259,406 level1.wav.flac <~ using a compression method meant for the file.