How to specify the workload to run on a SPEC CPU 2006 benchmark? - benchmarking

I'm currently trying to benchmark a new FDO toolchain with SPEC CPU 2006.
Some benchmarks are run with several workloads. For instance 473.astar is run with rivers and BigLakes.
When I use FDO on this benchmark, it is compiled, then trained with BigLakes, then trained with rivers. The problem is that the tool does not provide aggregate, so the first training is useless.
I would like to be able to do something like running astar.rivers and astar.BigLakes2048 as two separate benchmarks. For instance, astar.rivers would do:
compile
train astar with rivers
compile
benchmark astar with rivers
Is something like that possible with runspec or do I have to use a script to do all by myself without using the SPEC tools ?
I read several papers where the results where presented this way, so I suppose that it is possible.
Thanks a lot

I've found a solution to the problem.
The way to do that is to copy the benchmark directory, rename it and then remove the input files that are not wanted. It is important to remove both the ref and train input files. Moreover, it is necessary to edit the Spec/object.pm file to set the new name. It is also necessary to edit either of the CFP2006.bset or CINT2006.bset file to add the new benchmark, otherwise runspec would not display any results even in the raw file.
It is not pretty, but it works well.

Related

How to add (and use) binary data to compiled executable?

There are several questions dealing with some aspects of this problem, but neither seems to answer it wholly. The whole problem can be summarized as follows:
You have an already compiled executable (obviously expecting the use of this technique).
You want to add an arbitrarily sized binary data to it (not necessarily by itself which would be another nasty problem to deal with).
You want the already compiled executable to be able to access this added binary data.
My particular use-case would be an interpreter, where I would like to make the user able to produce a single file executable out of an interpreter binary and the code he supplies (the interpreter binary being the executable which would have to be patched with the user supplied code as binary data).
A similar case are self-extracting archives, where a program (the archiving utility, such as zip) is capable to construct such an executable which contains a pre-built decompressor (the already compiled executable), and user-supplied data (the contents of the archive). Obviously no compiler or linker is involved in this process (Thanks, Mathias for the note and pointing out 7-zip).
Using existing questions a particular path of solution shows along the following examples:
appending data to an exe - This deals with the aspect of adding arbitrary data to arbitrary exes, without covering how to actually access it (basically simple append usually works, also true with Unix's ELF format).
Finding current executable's path without /proc/self/exe - In companion with the above, this would allow getting a file name to use for opening the exe, to access the added data. There are many more of these kind of questions, however neither focuses especially on the problem of getting a path suitable for the purpose of actually getting the binary opened as a file (which goal alone might (?) be easier to accomplish - truly you don't even need the path, just the binary opened for reading).
There also may be other, probably more elegant ways around this problem than padding the binary and opening the file for reading it in. For example could the executable be made so that it becomes rather trivial to patch it later with the arbitrarily sized data so it appears "within" it being in some proper data segment? (I couldn't really find anything on this, for fixed size data it should be trivial though unless the executable has some hash)
Can this be done reasonably well with as little deviation from standard C as possible? Even more or less cross-platform? (At least from maintenance standpoint) Note that it would be preferred if the program performing the adding of the binary data didn't rely on compiler tools to do it (which the user might not have), but solutions necessiting those might also be useful.
Note the already compiled executable criteria (the first point in the above list), which requires a completely different approach than solutions described in questions like C/C++ with GCC: Statically add resource files to executable/library or SDL embed image inside program executable , which ask for embedding data compile-time.
Additional notes:
The problems with the obvious approach outlined above and suggested in some comments, that to just append to the binary and use that, are as follows:
Opening the currently running program's binary doesn't seem something trivial (opening the executable for reading is, but not finding the path to supply to the file open call, at least not in a reasonably cross-platform manner).
The method of acquiring the path may provide an attack surface which probably wouldn't exist otherwise. This means that a potential attacker could trick the program to see different binary data (provided by him) like which the executable actually has, exposing any vulnerability which might reside in the parser of the data.
It depends on how you want other systems to see your binary.
Digital signed in Windows
The exe format allows for verifying the file has not been modified since publishing. This would allow you to :-
Compile your file
Add your data packet
Sign your file and publish it.
The advantage of following this system, is that "everybody" agrees your file has not been modified since signing.
The easiest way to achieve this scheme, is to use a resource. Windows resources can be added post- linking. They are protected by the authenticode digital signature, and your program can extract the resource data from itself.
It used to be possible to increase the signature to include binary data. Unfortunately this has been banned. There were binaries which used data in the signature section. Unfortunately this was used maliciously. Some details here msdn blog
Breaking the signature
If re-signing is not an option, then the result would be treated as insecure. It is worth noting here, that appended data is insecure, and can be modified without people being able to tell, but so is the code in your binary.
Appending data to a binary does break the digital signature, and also means the end-user can't tell if the code has been modified.
This means that any self-protection you add to your code to ensure the data blob is still secure, would not prevent your code from being modified to remove the check.
Running module
Windows GetModuleFileName allows the running path to be found.
Linux offers /proc/self or /proc/pid.
Unix does not seem to have a method which is reliable.
Data reading
The approach of the zip format, is to have a directory written to the end of the file. This means the data can be found at the end of the location, and then looked backwards for the start of the data. The advantage here, is the data blob is signposted from the end of the data, rather than the natural start.

How to get in Linux application statistics about Ethernet?

Is it possible to retrieve various statistics about errors (such as rx_crc_error) in C application, similar to those given in ethtool? I've searched some time how to do this, but did not find whether it is possible to be done from C application.
You probably should use /proc/, so read proc(5); perhaps read (sequentially) some file under /proc/sys/net/ (or, as commented by Nominal Animal, /proc/net/dev)

Is there any way to open a file using a diff without patching the original?

Example, I have a 40Mb file, and i want to make some minor changes to it, maybe 20Kb of changes.
I can create a diff between the resulting file and the original, simply enough, either by writing it manually with the application that is making the change, or by taking both the original file and the resulting file and generating the diff from that (using Rabin's polynomial fingerprinting algorithm for example)...
The issue is, in order to read the effective outcome of that diff (the new file), I have to patch the diff to the original and create the resulting new file and read that... this creates 2 40mb files with only 20kb of difference between them. It seems logical that one could use the initial file combined with the diff and parse (for reading anyway) the resulting final file without having to create a whole new copy of it.
I have looked through xdiff and it has the functions to create a diff given 2 files, or to apply a diff as a patch to a file, but none to get a simple file handle when provided with the original file and a diff file.
Does such a thing exist? It would be tremendously helpful for storage space savings on larger files, even if only for read-only (write operations could write to a new diff, possibly).
Examples in any language would be fine, although c, python or php would be great if readily available.
Using TortoiseMerge to View Diffs:
You could use TortoiseMerge to view the diff without creating a patch.
Here's an overview of what that looks like. I am also attaching the guide and a download link. If that doesn't suit you, here is a great list of alternative diff tools.
Further Consideration:
Depending on how often you are making changes and your interest in file size savings you may want to consider using a version control system (perhaps you do already). Common options include SVN, Git, and Mercurial.
What your are describing is a source code control with delta storage: you store many versions of a file, and delta are saved, then you can request entire files which are recomposed on the fly, so you can choose to access them directly (for example with the appropriate lib), or save locally before access.
Search for Subversion, git, mercurial and so on, how they implement their delta storage and you'll have working examples. Git has a maintenance task to do that internally, using delta storage when it considers it profitable. Git is in programmed in C.
Clearly it will give a sample of how to access sequentially this kind of files. Once you've got that composing patches is relatively simple, and if the patch commands list can be accessed efficiently you can as well build a random-access solution (as long as the literal part of the patch and the original are accessible).

One large file or multiple small files?

I have an application (currently written in Python as we iron out the specifics but eventually it will be written in C) that makes use of individual records stored in plain text files. We can't use a database and new records will need to be manually added regularly.
My question is this: would it be faster to have a single file (500k-1Mb) and have my application open, loop through, find and close a file OR would it be faster to have the records separated and named using some appropriate convention so that the application could simply loop over filenames to find the data it needs?
I know my question is quite general so direction to any good articles on the topic are as appreciated as much as suggestions.
Thanks very much in advance for your time,
Dan
Essentially your second approach is an index - it's just that you're building your index in the filesystem itself. There's nothing inherently wrong with this, and as long as you arrange things so that you don't get too many files in the one directory, it will be plenty fast.
You can achieve the "don't put too many files in the one directory" goal by using multiple levels of directories - for example, the record with key FOOBAR might be stored in data/F/FO/FOOBAR rather than just data/FOOBAR.
Alternatively, you can make the single-large-file perform as well by building an index file, that contains a (sorted) list of key-offset pairs. Where the directories-as-index approach falls down is when you want to search on key different from the one you used to create the filenames - if you've used an index file, then you can just create a second index for this situation.
You may want to reconsider the "we can't use a database" restriction, since you are effectively just building your own database anyway.
Reading a directory is in general more costly than reading a file. But if you can find the file you want without reading the directory (i.e. not "loop over filenames" but "construct a file name") due to your naming convention, it may be benefical to split your database.
Given your data is 1 MB, I would even consider to store it entirely in memory.
To give you some clue about your question, I'd consider that having one single big file means that your application is doing the management of the lines. Having multiple small files is relying an the system and the filesystem to manage the data. The latter can be quite slow though, because it involves system calls for all your operations.
Opening File and Closing file in C Would take much time
i.e. you have 500 files 2 KB each... and if you process it 1000 Additonal Operation would be added to your application (500 Opening file and 500 Closing)... while only having 1 file with 1 MB of size would save you that 1000 additional operation...(That is purely my personal Opinion...)
Generally it's better to have multiple small files. Keeps memory usage low and performance is much better when searching through it.
But it depends on the amount of operations you'll need, because filesystem calls are much more expensive when compared to memory storage for instance.
This all depends on your file system, block size and memory cache among others.
As usual, measure and find out if this is a real problem since premature optimization should be avoided. It may be that using one file vs many small files does not matter much for performance in practice and that the choice should be based on clarity and maintainability instead.
(What I can say for certain is that you should not resort to linear file search, use a naming convention to pinpoint the file in O(1) time instead).
The general trade off is that having one big file can be more difficult to update but having lots of little files is fiddly. My suggestion would be that if you use multiple files and you end up having a lot it can get very slow traversing a directory with a million files in it. If possible break the files into some sort of grouping so they can be put into separate directories and "keyed". I have an application that requires the creation of lots of little pdf documents for all user users of the system. If we put this in one directory it would be a nightmare but having a directory per user id makes it much more manageable.
Why can't you use a DB, I'm curious? I respect your preference, but just want to make sure it's for the right reason.
Not all DBs require a server to connect to or complex deployment. SQLite, for instance, can be easily embedded in your application. Python already has it built-in, and it's very easy to connect with C code (SQLite itself is written in C and its primary API is for C). SQLite manages a feature-complete DB in a single file on the disk, where you can create multiple tables and use all the other nice features of a DB.

What should I know before poking around an unknown archive file for things?

A game that I play stores all of its data in a .DAT file. There has been some work done by people in examining the file. There are also some existing tools, but I'm not sure about their current state. I think it would be fun to poke around in the data myself, but I've never tried to examine a file, much less anything like this before.
Is there anything I should know about examining a file format for data extraction purposes before I dive headfirst into this?
EDIT: I would like very general tips, as examining file formats seems interesting. I would like to be able to take File X and learn how to approach the problem of learning about it.
You'll definitely want a hex editor before you get too far. It will let you see the raw data as numbers instead of as large empty blocks in whatever font notepad is using (or whatever text editor).
Try opening it in any archive extractors you have (i.e. zip, 7z, rar, gz, tar etc.) to see if it's just a renamed file format (.PK3 is something like that).
Look for headers of known file formats somewhere within the file, which will help you discover where certain parts of the data are stored (i.e. do a search for "IPNG" to find any (uncompressed) png files somewhere within).
If you do find where a certain piece of data is stored, take a note of its location and length, and see if you can find numbers equal to either of those values near the beginning of the file, which usually act as pointers to the actual data.
Some times you just have to guess, or intuit what a certain value means, and if you're wrong, well, keep moving. There's not much you can do about it.
I have found that http://www.wotsit.org is particularly useful for known file type formats, for help finding headers within the .dat file.
Back up the file first. Once you've restricted the amount of damage you can do, just poke around as Ed suggested.
Looking at your rep level, I guess a basic primer on hexadecimal numbers, endianness, representations for various data types, and all that would be a bit superfluous. A good tool that can show the data in hex is of course essential, as is the ability to write quick scripts to test complex assumptions about the data's structure. All of these should be obvious to you, but might perhaps help someone else so I thought I'd mention them.
One of the best ways to attack unknown file formats, when you have some control over contents is to take a differential approach. Save a file, make a small and controlled change, and save again. Do a binary compare of the files to find the difference - preferably using a tool that can detect inserts and deletions. If you're dealing with an encrypted file, a small change will trigger a massive difference. If it's just compressed, the difference will not be localized. And if the file format is trivial, a simple change in state will result in a simple change to the file.
The other thing is to look at some of the common compression techniques, notably zip and gzip, and learn their "signatures". Most of these formats are "self identifying" so when they start decompressing, they can do quick sanity checks that what they're working on is in a format they understand.
Barring encryption, an archive file format is basically some kind of indexing mechanism (a directory or sorts), and a way located those elements from within the archive via pointers in the index.
With the the ubiquitousness of the standard compression algorithms, it's mostly a matter of finding where those blocks start, and trying to hunt down the index, or table of contents.
Some will have the index all in one spot (like a file system does), others will simply precede each element within the archive with its identity information. But in the end somewhere, there is information about offsets from one block to another, there is information about data types (for example, if they're storing GIF files, GIF have a signature as well), etc.
Those are the patterns that you're trying to hunt down within the file.
It would be nice if somehow you can get your hand on two versions of data using the same format. For example, on a game, you might be able to get the initial version off the CD and a newer, patched version. These can really highlight the information you're looking for.

Resources