Could someone explain these cache hits to me? [C] - c

I have been working on this for weeks. I am so close if I understood how this last part it supposed to work.
So, I have a file which contains:
10
20
22
18
E10
210
12
Ran with: ./lrucache -m 64 -s 4 -e 0 -b 4 -i address01 -r lru
Where,
m is bit size
2^s is the number of sets
2^e is the number of lines
2^b is the block size
I need to find the hit/misses using the LRU (least recently used) algorithm. 18 and 22 should be hits.
With what I have now, I get all misses. Which makes sense to me, because as it goes through the file, no values are the same, so every time a new one is added (and LRU ejected), it's just another miss. I've tested by adding another 210 at the end, and that gives me a hit. So my program essentially works, just missing an important part.
I'm leaving out any code because I think I might just be ignorant of a certain part of this. Am I supposed to cache it twice? As in cache it without printing hit/miss, then check again? Because I don't see how I would ever get a hit without repeating values.
Could someone maybe explain to me how 18 and 22 are supposed to be hits?

Related

how to read lines with irreguar number of records line by line in fortran

I have test.csv (300 lines) file as below
10 20 100 2 5 4 5 7 9 10 ....
55 600 7000 500 25
3 10
2 5 6
....
Each line has different number of integers (maximum number of records =1000) and I need to proceed these records line by line. I tried as below
integer,dimension(1000)::rec
integer::i,j
open(unit=5,file="test.csv",status="old",action="read")
do i=1,300
read(unit=5,fmt=*) (rec(j),j=1,1000)
!do some procedue with rec
enddo
close(unit=50)
but it seems like that rec array is not constructed by line by line. It means that when i=n, rec get the numbers from non-nth line. How can I solve this problem.
thank you
List directed formatting (as specified by the star in the read statement) reads what it needs to satisfy the list (hence it is "list directed"). As shown, your code will try and read 1000 values each iteration, consuming as many records (lines) as required each iteration in order to do that.
(List directed formatting has a number of surprising features beyond that, which may have made sense with card based input forms 40 years ago, but are probably misplaced today. Before using list directed input you should understand exactly what the rules around it say.)
A realistic and robust approach to this sort of situation is to read in the input line by line, then "manually" process each line, tokenising that line and extracting values as per whatever rules you are following.
(You should get in touch with whoever is naming files that have absolutely no commas or semicolons with an extension ".csv", and have a bit of a chat.)
(As a general rule, low value unit numbers are to be avoided. Due to historical reasons they may have been preconnected for other purposes. Reconnecting them to a different file is guaranteed to work for that specific purpose, but other statements in your program may implicitly be assuming that the low value unit is still connected as it was before the program started executing - for example, PRINT and READ statements intended to work with the console might start operating on your file instead.)

Cut a file in parts (without using more space) [duplicate]

I have a file, say 100MB in size. I need to split it into (for example) 4 different parts.
Let's say first file from 0-20MB, second 20-60MB, third 60-70MB and last 70-100MB.
But I do not want to do a safe split - into 4 output files. I would like to do it in place. So the output files should use the same place on the hard disk that is occupied by this one source file, and literally split it, without making a copy (so at the moment of split, we should loose the original file).
In other words, the input file is the output files.
Is this possible, and if yes, how?
I was thinking maybe to manually add a record to the filesystem, that a file A starts here, and ends here (in the middle of another file), do it 4 times and afterwards remove the original file. But for that I would probably need administrator privileges, and probably wouldn't be safe or healthy for the filesystem.
Programming language doesn't matter, I'm just interested if it would be possible.
The idea is not so mad as some comments paint it. It would certainly be possible to have a file system API that supports such reinterpreting operations (to be sure, the desired split is probably not exacly aligned to block boundaries, but you could reallocate just those few boundary blocks and still save a lot of temporary space).
None of the common file system abstraction layers support this; but recall that they don't even support something as reasonable as "insert mode" (which would rewrite only one or two blocks when you insert something into the middle of a file, instead of all blocks), only an overwrite and an append mode. The reasons for that are largely historical, but the current model is so entrenched that it is unlikely a richer API will become common any time soon.
As I explain in this question on SuperUser, you can achieve this using the technique outlined by Tom Zych in his comment.
bigfile="mybigfile-100Mb"
chunkprefix="chunk_"
# Chunk offsets
OneMegabyte=1048576
chunkoffsets=(0 $((OneMegabyte*20)) $((OneMegabyte*60)) $((OneMegabyte*70)))
currentchunk=$((${#chunkoffsets[#]}-1))
while [ $currentchunk -ge 0 ]; do
# Print current chunk number, so we know it is still running.
echo -n "$currentchunk "
offset=${chunkoffsets[$currentchunk]}
# Copy end of $archive to new file
tail -c +$((offset+1)) "$bigfile" > "$chunkprefix$currentchunk"
# Chop end of $archive
truncate -s $offset "$archive"
currentchunk=$((currentchunk-1))
done
You need to give the script the starting position (offset in bytes, zero means a chunk starting at bigfile's first byte) of each chunk, in ascending order, like on the fifth line.
If necessary, automate it using seq : The following command will give a chunkoffsets with one chunk at 0, then one starting at 100k, then one for every megabyte for the range 1--10Mb, (note the -1 for the last parameter, so it is excluded) then one chunk every two megabytes for the range 10--20Mb.
OneKilobyte=1024
OneMegabyte=$((1024*OneKilobyte))
chunkoffsets=(0 $((100*OneKilobyte)) $(seq $OneMegabyte $OneMegabyte $((10*OneMegabyte-1))) $(seq $((10*OneMegabyte-1)) $((2*OneMegabyte)) $((20*OneMegabyte-1))))
To see which chunks you have set :
for offset in "${chunkoffsets[#]}"; do echo "$offset"; done
0
102400
1048576
2097152
3145728
4194304
5242880
6291456
7340032
8388608
9437184
10485759
12582911
14680063
16777215
18874367
20971519
This technique has the drawback that it needs at least the size of the largest chunk available (you can mitigate that by making smaller chunks, and concatenating them somewhere else, though). Also, it will copy all the data, so it's nowhere near instant.
As to the fact that some hardware video recorders (PVRs) manage to split videos within seconds, they probably only store a list of offsets for each video (a.k.a. chapters), and display these as independent videos in their user interface.

Unit testing a binary format reader in C

I am writing a C library that reads a binary file format. I don't control the binary format; it's produced by a proprietary data acquisition program and is relatively complicated. As it is one of my first forays into C programming and binary file parsing, I am having a bit of trouble figuring out how to structure the code for testing and portability.
For testing purposes, I thought the easiest course of action was to build the library to read an arbitrary byte stream. But I ended up implementing a stream data type that encapsulates the type of stream (memstream, filestream, etc). The interface has functions like stream_read_uint8 such that the client code doesn't have to know anything about where the bytes are coming from. My tests are against a memstream, and the filestream stuff is essentially just a wrapper around FILE* and fread, etc.
From an OOP perspective, I think this is a reasonable design. However, I get the feeling that I am stuffing the wrong paradigm into the language and ending up with overly abstracted, overly complicated code as a result.
So my question: is there a simpler, more idiomatic way to do binary format reading in plain C while preserving automated tests?
Note: I realize that FILE* is essentially an abstract stream interface. But the implementation of memory streams (fmemopen) is non-standard and I want Standard C for portability.
What you described is a low-level I/O functionality. Since fmemopen() is not 100% portable (off Linux, it creaks, I suspect), then you need to provide yourself with something portable that you write that is sufficiently close that you can use your surrogate functions (only) when necessary and use the native functions when possible. Of course, you should be able to force the use of your functions even in your native habitat so that you can test your code.
This code can be tested with known data to ensure that you pick up all the characters in the input streams and can faithfully return them. If the raw data is in a specific endian-ness, you can ensure that your 'larger' types — hypothetically, functions such as stream_read-uint2(), stream_read_uint4(), stream_read_string() etc — all behave appropriately. For this phase, you don't really need the actual data; you can manufacture data to suit yourself and your testing.
Once you've got that in place, you will also need to write code for reading the data with the larger types, and ensuring that these higher level function actually can interpret the binary data accurately and invoke appropriate actions. For this, you finally need examples of what the format supplied; up until this phase you probably could get away with data you manufactured. But once you're reading the actual files, you need examples of those to work on. Or you'll have to manufacture them from your understanding and test as best you can. How easy this is depends on how clearly documented the binary format is.
One of the key testing and debugging tools will be canonical 'dump' functions that can present data for you. The scheme I use is:
extern void dump_XyzType(FILE *fp, const char *tag, const XyzType *data);
The stream is self-evident; usually it is stderr, but by making it an argument, you can get the data to any open file. The tag is included in the information printed; it should be unique to identify the location of call. The last argument is a pointer to the data type. You can analyze and print that. You should take the opportunity to assert all validity checks that you can think of, to head off problems.
You can extend the interface with , const char *file, int line, const char *func and arrange to add __FILE__, __LINE__ and __func__ to the calls. I've never quite needed it, but if I were to do it, I'd use:
#define DUMP_XyzType(fp, tag, data) \
dump_XyzType(fp, tag, data, __FILE__, __LINE__, __func__)
As an example, I deal with a type DATETIME, so I have a function
extern void dump_datetime(FILE *fp, const char *tag, const ifx_dtime_t *dp);
One of the tests I was using this week could be persuaded to dump a datetime value, and it gave:
DATETIME: Input value -- address 0x7FFF2F27CAF0
Qualifier: 3594 -- type DATETIME YEAR TO SECOND
DECIMAL: +20120913212219 -- address 0x7FFF2F27CAF2
E: +7, S = 1 (+), N = 7, M = 20 12 09 13 21 22 19
You might or might not be able to see a value 2012-09-13 21:22:19 in there. Interestingly, this function itself calls on another function in the family, dump_decimal() to print out the decimal value. One year, I'll upgrade the qualifier print to include the hex version, which is a lot easier to read (3594 is 0x0E0A, which is readily understandable by those in the know as 14 digits (E), starting with YEAR (the second 0) to second (A), which is certainly not so obvious from the decimal version. Of course, the information is the in the type string: DATETIME YEAR TO SECOND. (The decimal format is somewhat inscrutable to the outsider, but pretty clear to an insider who knows there's an exponent (E), a sign (S), a number of (centesimal) digits (N = 7), and the actual digits (M = ...). Yes, the name decimal is strictly a misnomer as it uses a base-100 or centesimal representation.)
The test doesn't produce that level of detail by default, but I simply had to run it with a high-enough level of debugging set (by command line option). I'd regard that as another valuable feature.
The quietest way of running the tests produces:
test.bigintcvasc.......PASS (phases: 4 of 4 run, 4 pass, 0 fail)(tests: 92 run, 89 pass, 3 fail, 3 expected failures)
test.deccvasc..........PASS (phases: 4 of 4 run, 4 pass, 0 fail)(tests: 60 run, 60 pass, 0 fail)
test.decround..........PASS (phases: 1 of 1 run, 1 pass, 0 fail)(tests: 89 run, 89 pass, 0 fail)
test.dtcvasc...........PASS (phases: 25 of 25 run, 25 pass, 0 fail)(tests: 97 run, 97 pass, 0 fail)
test.interval..........PASS (phases: 15 of 15 run, 15 pass, 0 fail)(tests: 178 run, 178 pass, 0 fail)
test.intofmtasc........PASS (phases: 2 of 2 run, 2 pass, 0 fail)(tests: 12 run, 8 pass, 4 fail, 4 expected failures)
test.rdtaddinv.........PASS (phases: 3 of 3 run, 3 pass, 0 fail)(tests: 69 run, 69 pass, 0 fail)
test.rdtimestr.........PASS (phases: 1 of 1 run, 1 pass, 0 fail)(tests: 16 run, 16 pass, 0 fail)
test.rdtsub............PASS (phases: 1 of 1 run, 1 pass, 0 fail)(tests: 19 run, 15 pass, 4 fail, 4 expected failures)
Each program identifies itself and its status (PASS or FAIL) and summary statistics. I've been bug hunting and fixing a bug other than the ones I found coincidentally, so there are some 'expected failures'. That should be a temporary state of affairs; it allows me to claim legitimately that the tests are all passing. If I wanted more detail, I could run any of the tests, with any of the phases (sub-sets of the tests which are somewhat related, though the 'somewhat' is actually arbitrary), and see the results in full, etc. As shown, it takes less than a second to run that set of tests.
I find this helpful where there are repetitive calculations - but I've had to calculate or verify the correct answer for every single one of those tests at some point.

Updating Output

I need to create an utility that "updates" its output, much like curl which keeps changing its last line:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 8434 100 8434 0 0 4064 0 0:00:02 0:00:02 --:--:-- 7695
I think using something like curses is not the way to go here. I don't want to manipulate the window, I want to simply change my last line of output.
The solution I have in mind is to print a number of backspaces enough to rewrite the line. But I haven't tested this yet. I'd like to know if this is a "correct" way of doing it, or if there is a better one.
Also, in my case I need to update the last line. So I don't need a so large number of backspaces (if that's the solution); however (to make it generic) if I need to update the -10 line, rewriting the same thing from -9th line on might not be that efficient (or maybe it is...).
You can backspace over the line, or (generally easier) print a carriage return, and just re-print the entire line. When you do, be sure to rewrite the whole line though -- if (for example) you have a number counting down to 0, when it drops from 100 to 99 (for example) it won't necessarily overwrite the '1' unless you assure that a space gets printed there.
In DOS, you can simply print the carriage return without the line feed and overwrite the last line.

Only decompress a specific bzip2 block

Say I have a bzip2 file (over 5GB), and I want to decompress only block #x, because there is where my data is (block is different every time). How would I do this?
I thought about making an index of where all the blocks are, then cut the block I need from the file and apply bzip2recover to it.
I also thought about compressing say 1MB at a time, then appending this to a file (and recording the location), and simply grabbing the file when I need it, but I'd rather keep the original bzip2 file intact.
My preferred language is Ruby, but any language's solution is fine by me (as long as I understand the principle).
There is a http://bitbucket.org/james_taylor/seek-bzip2
Grab the source, compile it.
Run with
./seek-bzip2 32 < bzip_compressed.bz2
to test.
the only param is bit displacement of wondered block header. You can get it with finding a "31 41 59 26 53 59 " hex string in the binary file. THIS WAS INCORRECT. Block start may be not aligned to byte boundary, so you should search for every possible bit shifts of "31 41 59 26 53 59" hex string, as it is done in bzip2recover - http://www.bzip.org/1.0.3/html/recovering.html
32 is bit size of "BZh1" header where 1 can be any digit from "1" to "9" (in classic bzip2) - it is a (uncompressed) block size in hundreds of kb (not exact).
It's true that bzip-table is almost as slow as decompressing but of course you only have to do it once and you can store the output in some fashion to use as an index. This is perfect for what I need but may not be what everybody needs.
I did need a little help getting it to compile on Windows though.

Resources