Zip filesystem performance

Zip filesystem performance - filesystems

My program creates file system tree from zip file (file system viewer), but the performance are terrible.
To get the real offset of the file I'm reading the "file local header extra data" length.
I tried to get the file real offset ignoring the local file header by taking the extra data from the "central directory extra data" + constant of the local file header, but the files didn't extracted correctly, apparently the "local file header extra data" is different from "central directory extra data" and I notice that the simple case of zip files (store without encryption) - "local file header extra field length" is 0.
So I wrote POC that creates the file streams only from "the central directory" ignoring the "local file header" and the performance improved a lot.
Is there a way to create the file system only from the "central directory" but more robust (also when the "local file header extra field length" is not 0)?

Related

Pervasive SQL(10.3) File size exceeding 2GB resulting in a .^01 file being created

We have a database with a data file exceeding 2Gb, this resulted in a .^01 file being generated with the same file name. We now have a .DAT file and a .^01 with the same name.
I have subsequently deleted the unnecessary data (old history, no longer required) and the .DAT file is now only 372MB, but the .^01 file remains.
I would like to clone the .DAT file and save the data and reload it into the cloned (blank file. I normally use Butil (Clone, Save and Load) but am unsure what I need to do with the .^01 file as the Butil -Save FileName.^01 FileName.seq returns an error as it does not recognise the ^:
BUTIL-14: The file that caused the error is FileName.01.
BUTIL-100: MicroKernel error = 12. The MicroKernel cannot find the specified file.
I would greatly appreciate some direction/input in this regard
Thank you and kind regards,

You don't need to do anything with the .^XX file(s). They are called Extended files and are automatically handled by the PSQL engine. A BUTIL -CLONE / -COPY will read all of the data (original file and extended file(s)) and copy it to the new file.
To rebuild it, you should do something like:
BUTIL -CLONE <NEWFILE.DAT> <OLDFILE.DAT>
BUTIL -COPY <OLDFILE.DAT> <NEWFILE.DAT>
Also, if the file grows above 2GB again, the Extended File (.^01) will come back.

how to get the type of the file before its compression

For example, if we have the following file: file.txt that after the compression is now file.new (new is the new extension) , how to obtain that .txt extension, that is forgotten?
I need that to decompress the file.

In general, if you lose the file name extension you can't get it back. It's as simple as this.
However, there might be chances depending on the compression format. Some formats do store the original file name (along with other informations) in the compressed file. And the "decompressor" will be able to recreate those properties.
Anyway, it's good practise to name a compressed file with an additional extension, in your case file.txt.new.
Oh, and you don't need to know the file name extension to uncompress the compressed file. Just uncompress it and give it a temporary name. As #MarcoBonelli said, file contents and file name extensions have no fixed relation. They are just a convention to handle them conveniently.
For example: You can rename a EXE to DOCX. Windows will show the Word icon but it is still an executable. Windows will not attempt to run it, though.
To know what a file contains can be difficult. The magic number Marco linked to might give you some hint.

what is the difference between hadoop -appendToFile versus hadoop -put when used for updating stream data into hdfs continously

As per hadoop source code following descriptions are pulled out from the classes -
appendToFile
"Appends the contents of all the given local files to the
given dst file. The dst file will be created if it does not exist."
put
"Copy files from the local file system into fs. Copying fails if the file already exists, unless the -f flag is given.
Flags:
-p : Preserves access and modification times, ownership and the mode.
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk. Forces
replication factor of 1. This flag will result in reduced
durability. Use with care.
-d : Skip creation of temporary file(<dst>._COPYING_)."
I am trying to update a file into hdfs regularly as it is being updated dynamically from a streaming source in my local File System.
Which one should I use out of appendToFile and put, and Why?

appendToFile modifies the existing file in HDFS, so only the new data needs to be streamed/written to the filesystem.
put rewrites the entire file, so the entire new version of the file needs to be streamed/written to the filesystem.
You should favor appendToFile if you are just appending to the file (i.e. adding logs to the end of a file). This function will be faster if that's your use case. If the file is changing more than just simple appends to the end, you should use put (slower but you won't lose data or corrupt your file).

How to modify a single file inside a very large zip without re-writing the entire zip?

I have large zip files that contain huge files. There are "metadata" text files within the zip archives that need to be modified. However, it is not possible to extract the entire zip and re-compress it. I need to locate the target text file inside the zip, edit it, and possibly append the change to the zip file. The file name of the text file is always the same, so it can be hard-coded. Is this possible? Is there a better way?

There are two approaches. First, if you're just trying to avoid recompression of the entire zip file, you can use any existing zip utility to update a single file in the archive. This will entail effectively copying the entire archive and creating a new one with the replaced entry, then deleting the old zip file. This will not recompress the data not being replaced, so it should be relatively fast. At least, about the same time required to copy the zip archive.
If you want to avoid copying the entire zip file, then you can effectively delete the entry you want to replace by changing the name within the local and central headers in the zip file (keeping the name the same length) to a name that you won't use otherwise and that indicates that the file should be ignored. E.g. replacing the first character of the name with a tilde. Then you can append a new entry with the updated text file. This requires rewriting the central directory at the end of the zip file, which is pretty small.
(A suggestion in another answer to not refer to the unwanted entry in the central directory will not necessarily work, depending on the utility being used to read the zip file. Some utilities will read the local headers for the zip file entry information, and ignore the central directory. Other utilities will do the opposite. So the local and central entry information should be kept in sync.)

There are "metadata" text files within the zip archives that need to be modified.
However, it is not possible to extract the entire zip and re-compress it.
This is a good lesson why, when dealing with huge datasets, keeping the metadata in the same place with the data is a bad idea.
The .zip file format isn't particularly complicated, and it is definitely possible to replace something inside it. The problem is that the size of the new data might increase, and not fit anymore into the location of the old data. Thus there is no standard routine or tool to accomplish that.
If you are skilled enough, theoretically, you can create your own zip handling functions, to provide the "file replace" routine. If it is about the (smallish) metadata, you do not even need to compress them. The .zip's "central directory" is located in the end of the file, after the compressed data (the format was optimized for appending new files). General concept is: read the "central directory" into the memory, append the new modified file after the compressed data, update the central directory in memory with the new file offset of the modified file, and write the central directory back after the modified file. (The old file would be still sitting somewhere inside the .zip, but not referenced anymore by the "central directory".) All the operations would be happening at the end of the file, without touching the rest of the archive's content.
But practically speaking, I would recommend to simply keep the data and the metadata separately.

Delete original matching file on rename truncation

A friend of mine was hit with the Krypto virus. Thankfully they had Carbonite installed. So, they went ahead and restored only the affected files (.xls, .doc, jpg, scans, etc.. there were a bunch). Unfortunately they did not restore everything which would have simplified this considerably.
A restore from carbonite has been done to the directories.. now we have files that are the right ones (they all have the string: (Restored) starting at the 28th character from the end. Plus we have other valid files that krypto DID NOT encrypt.
Unfortunately, whomever did the carbonite restore put only selected files (xls, doc, ppt, etc) and left the others as is. So I can't just delete all the files. When I truncate the name... the other file exists. (THEY are frozen by the Krypto virus)... so I just want them gone. But I have to leave the other files that were not affected in place.
So lets assume I have the following files in a directory
afile1 (Restored) 11-23-2010 14.07.DOC (this filename would be truncated at
the " (" since there is already a file
called afile1.doc in the directory.
afile1.doc (This file needs to be DELETED BEFORE the truncation
the above filename so no conflict with duplicate FN
break.txt (This file has no matching file
with Restored in the name, leave alone)
cat.zip (This file has no matching file
with Restored in the name, leave alone)
fred (Restored) 01-14-14 13.28.JPG (This filename would be truncated at the " ("
but first, again the file below would have to be deleted
fred.jpg (This file needs to be deleted before the truncation
of the above filename so it won't create duplicate file
So far, I am trying to figure out... (or anything else I might have missed!)
How can I
test for existence of a matching file name where up to the " (Restored)" there is already an existing file with that name (as shown in the example above (afile1.doc & fred.jpg)
delete the already existing file (OR at least "prepend deleteme to a flie which we will have to manually delete??? then
Then truncate the last 28 characters of the filename with (Restored) in it so it is properly named and in the right place, thus preventing a duplicate file situation or worse, adding characters to the end of the valid file
I have to do this for all files under a directory too. (There are countless directories where files are stored).
I need to be left with the following files:
afile1.doc (this was the file that had (Restored) in the name and was truncated)
break.txt
cat.zip
fred.jpg (this was another file that had (Restored) in the name and was truncated)
I hope that makes sense.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Zip filesystem performance - filesystems

Related

Pervasive SQL(10.3) File size exceeding 2GB resulting in a .^01 file being created

how to get the type of the file before its compression

what is the difference between hadoop -appendToFile versus hadoop -put when used for updating stream data into hdfs continously

How to modify a single file inside a very large zip without re-writing the entire zip?

Delete original matching file on rename truncation

Categories

Resources