Hexadecimal virus signatures database - database

Over the past couple of weeks, I was in the process of developing a simple virus scanner. It works great but my question is does anybody know where I can get a database (a single file) that contains 8000 or more virus signatures WITH their names, and possibly risk meter (high, low, unknown)?

Try the ClamAV database. This also includes some more complex signatures, but some are just byte sequences.
The CVD file format is a compressed tar file with a header block attached; see here for header information, or this PDF for the real details.
As I understand it, you should be able to decompress it with
dd if=file.cvd bs=512 skip=1 | tar zxvf -
This will unpack to a collection of various files; for files that have simple hex signatures, these will be found in a file with the extension .db. Not all of these signatures are pure hex -- many of them contain wildcards such as ?? for "allow any byte here", * for "allow any number of intervening bytes here", (-4096) for "allow up to 4k of intervening bytes here", and so forth.

Related

(Text-) File that supports encoding

The project I'm working on takes xml files and input streams and converts them to pdf's and text. In the unit tests I compare this generated text with a .txt file that has the expected output.
I'm now facing the issue of these .txt files not being encoded in UTF-8 and been written without persisting this information (namely umlauts).
I have read few articles on the topic of persisting and encoding .txt files. Including correcting the encoding, saving and opening files in Visual Studio with encoding, and some more.
I was wondering if there is a text file format that supports meta information about encoding like xml or html for example does.
I'm looking for a solution that is:
Easy adaptable to any coworker on the same team
It being persitant and not depending on me choosing an encoding in an editor
Does not require any additional exotic program
Can be read without or only little modification of the File class and it's input reading of C#
Does at least support UTF-8 encoding
A Unicode Byte Order Mark (BOM) is sometimes used for this purpose. Systems that process Unicode are required to strip off this metadata when passing on the text. File.ReadAllText etc do this. A BOM should exist only at the beginning of files and streams.
A BOM is sometimes conflated with encoding because both affect the file format and BOM applies only to Unicode encodings. In Visual Studio, with UTF-8, it's called "Unicode (UTF-8 with signature) - Codepage 65001".
Some C# code that demonstrates these concepts:
var path = Path.GetTempFileName() + ".txt";
File.WriteAllText(path, "Test", new UTF8Encoding(true, true));
Debug.Assert(File.ReadAllBytes(path).Length == 7);
Debug.Assert(File.ReadAllText(path).Length == 4); // slightly mushy encoding detection
However, this doesn't get anyone past the agreement required when using text files. The fundamental rule is that a text file must be read with the same encoding it was written with. A BOM is not a communication that suffices as a complete agreement for text files in general.
Test editors almost universally adopt the principle that they should guess a file's character encoding first, and—for the most part—allow users to correct them later. Some IDEs with project systems allow recording which encoding a file actually uses.
A reasonable text editor would preserve both the encoding and the presence of a Unicode BOM for existing files.
It seems that you're after a universal strategy. Unfortunately, the history of the concept of a text file doesn't allow one.

Transferring external images & files to Google Cloud Storage

If I use the Google Cloud Storage File Transfer console
https://console.cloud.google.com/storage/transfer?project=XXXX
How do I generate an MD5 string for my image? Say my image is located at https://www.planwallpaper.com/static/images/desktop-year-of-the-tiger-images-wallpaper.jpg for example.
I can easily get the bytes value, but how would I generate the MD5 for this?
The docs were a bit vague. Any ideas?
An MD5 hash is used to ensure the data transferred into GCS is imported correctly. HTTPS data transfers include a variety of built-in checksums, but for very large imports of many, many files, errors can and do show up, and so GCS wants to be sure that each object that it downloads is exactly what you think it is.
An MD5 is a 128 bit number that is the result of running the MD5 algorithm on an object. This number can be represented in a variety of ways (the popular md5sum command uses hexadecimal strings). GCS asks that you represent this number as a base64 encoding. Here's a command that can generate an MD5 sum in the right format:
openssl md5 -binary NameOfSourceFile | openssl enc -base64
There's a standard GCS object that can be used to validate your MD5 logic. The object https://storage.googleapis.com/md5-test/md5-test has a base64'd MD5 string of BfnRTwvHpofMOn2Pq7EVyQ==.

Is there a tool to search differences in hex files?

I have two hex files, and want to search for values that have changed from 9C in one file to 9D in the next one.
Even a tool that synchronizes the windows would work, as I can't seem to find any that enable me to do this. All I've found are tools that let me search for the same value and same address between two files.
Thanks!
"There's no such thing as a hex file."
Oh well perhaps I should delete all my hex files.
I use Motorola and Intel formats of hex files (they are just text files representing bytes, but include a record structure for addresses, checksum etc per line. These files need to be treated as binary (in .gitattributes) since they cannot be merged. I use Kdiff3 to see differences (or any other text diff tool).

Does file type rely on file extension?

As a general question: What's the role of file extension when determining file types?
For example, I can change .jpeg file to .png extension and even .txt. Of course, in the case of changing to .txt, it will neither be opened as picture, nor readable.
To determine file type, it seems the safe way is to parse the first few bytes of the file. If extension is not trustable, extension is no more than file name.
As a general rule, you should ALWAYS parse the COMPLETE file in order to be sure that the file is what the extension says. As you can easily imagine, it is pretty simple to create a binary file resembling a e.g. BMP (with a correct header) but then containing something different.
You should never trust the extension neither the header because otherwise a malicious user could exploit some of your code to generate e.g. a buffer overflow, and this is absolutely paramount if you are writing programs that must run at root/admin privilege.
Having said the obvious, the file extension nowadays is mainly used so that the OS can associate a program to that particular file (usually calling the program and passing the selected file as first parameter), and then it's up to the program to determine the file content.
It is a little bit different when talking about executable files. Under Unix, in order to be executable a file has to have the "x" flag set, otherwise it would not run, regardless of the extension. Under Windows, there is not such thing and the OS relies on only a few extensions (EXE, COM, BAT, etc.) to determine which files can be executed.
The EXE file, for example, has to start with "MZ" followed by some information for its allocation and size (http://www.delorie.com/djgpp/doc/exe/) and the OS surely checks its internal headers. Other formats (e.g. the COM executable format of the MS-DOS era) is just "pure" assembly code, so there is no check done by the OS. It just interprets those opcodes, hoping that everything will be fine.
So, to summarize:
File extension is mainly used so that the OS can call the appropriate program to open it (and passing the filename as the first parameter, argc/argv in C language for example)
Windows relies on some file extension to know if a file is executable, while Unix/Mac relies on a particular flag (x) associated with the file
Two things that are not well known about file extensions: directory names can have extension too, and extension can be way longer than the usual 3 characters.
With the help of file extension, you know how to read the first few and all the rest of the bytes. You also know what program to use to read the file. Or if it is an executable, you know that it is to be executed and not shown as a picture.
Yes you can change the file extension, but what does it mean then? It only means that OS (or any program that tried to read the file) is working correctly. Only you are providing bad data to it.
File extension is not something that some bytes of data inherently have. Extensions are given to those bytes depending upon the protocol followed to write them that way. After you have encoded the letters in binary form, you provide that binary form with .txt extension so that the text reader knows that these bytes convert to letters. That's the role of file extension. With bad file extension, this role is not fulfilled, resulting in incomprehension of the data you saved in binary.
As a general question: What's the role of file extension when determining file types?
The file extension usually identifies the application that opens a file.
If you rename a .JPG to a .PNG and while having JPG and PNG opened by the same application (usually an image viewer) that application can read the image stream and process it correctly regardless of having an incorrect file stream.
The problem arises if you rename the file in such a way that the file gets routed to an application that cannot handle the file's content.
If you rename a .DOCX (word) file to an Autocad extension (.DWG), opening the word file in autocad is likely to produce errors (unless per chance autocad can read word files).

Header and structure of a tar format

I have a project for school which implies making a c program that works like tar in unix system. I have some questions that I would like someone to explain to me:
The dimension of the archive. I understood (from browsing the internet) that an archive has a define number of blocks 512 bytes each. So the header has 512 bytes, then it's followed by the content of the file (if it's only one file to archive) organized in blocks of 512 bytes then 2 more blocks of 512 bytes.
For example: Let's say that I have a txt file of 0 bytes to archive. This should mean a number of 512*3 bytes to use. Why when I'm doing with the tar function in unix and click properties it has 10.240 bytes? I think it adds some 0 (NULL) bytes, but I don't know where and why and how many...
The header chcksum. As I know this should be the size of the archive. When I check it with hexdump -C it appears like a number near the real size (when clicking properties) of the archive. For example 11200 or 11205 or something similar if I archive a 0 byte txt file. Is this size in octal or decimal? My bets are that is in octal because all information you put in the header it needs to be in octal. My second question at this point is what is added more from the original size of 10240 bytes?
Header Mode. Let's say that I have a file with 664, the format file will be 0, then I should put in header 0664. Why, on a authentic archive is printed 3 more 0 at the start (000064) ?
There have been various versions of the tar format, and not all of the extensions to previous formats were always compatible with each other. So there's always a bit of guessing involved. For example, in very old unix systems, file names were not allowed to have more than 14 bytes, so the space for the file name (including path) was plenty; later, with longer file names, it had to be extended but there wasn't space, so the file name got split in 2 parts; even later, gnu tar introduced the ##LongLink pseudo-symbolic links that would make older tars at least restore the file to its original name.
1) Tar was originally a *T*ape *Ar*chiver. To achieve constant througput to tapes and avoid starting/stopping the tape too much, several blocks needed to be written at once. 20 Blocks of 512 bytes were the default, and the -b option is there to set the number of blocks. Very often, this size was pre-defined by the hardware and using wrong blocking factors made the resulting tape unusable. This is why tar appends \0-filled blocks until the tar size is a multiple of the block size.
2) The file size is in octal, and contains the true size of the original file that was put into the tar. It has nothing to do with the size of the tar file.
The checksum is calculated from the sum of the header bytes, but then stored in the header as well. So the act of storing the checksum would change the header, thus invalidate the checksum. That's why you store all other header fields first, set the checksum to spaces, then calculate the checksum, then replace the spaces with your calculated value.
Note that the header of a tarred file is pure ascii. This way, In those old days, when a tar file (whose components were plain ascii) got corrupted, an admin could just open the tar file with an editor and restore the components manually. That's why the designers of the tar format were afraid of \0 bytes and used spaces instead.
3) Tar files can store block devices, character devices, directories and such stuff. Unix stores these file modes in the same place as the permission flags, and the header file mode contains the whole file mode, including file type bits. That's why the number is longer than the pure permission.
There's a lot of information at http://en.wikipedia.org/wiki/Tar_%28computing%29 as well.

Resources