Your machine has an L1 cache and memory with the following properties.
Memory address space: 24 bits
Cache block size: 16 bytes
Cache associativity: direct-mapped
Caches size: 256 bytes
I am asked to determine the following: 1. the number of tag bits. 2. the number of bits of the cash index. 3. number of bits for cache size.
tag bits = m - (s+b)
m = 24. s = log2 S, S = C/(B*E). E = 1 due to it being direct mapped. so S = 256/16 = 16. s = log2 16 = 4. B = 16 (cache block size) b = log2 B; which is log2 16= 4. so s=4,b=4,m=24. t = 24-(4+4) = 16 total tag bits.
I am not sure how to figure this out.
I believe number of bits for cache size is just C*(num bits/byte) = 256*8 = 2048.
Can anyone help me figure out 2., and determine if the logic in 1. & 3. are correct?
1) This is correct for m=32 (isn't it 24?).
2) The number of index-bits: The number of bits to address a block in the cache when it'd direct-mapped, since it identifies the set (which consists only of one block in this case). If it was 2-way, one bit less would be needed for the index (and added to the tag-bits). For this problem, Since there are 16 sets you need 16 index bits which can be represented in 4 index bits.
3) It is not completely clear how to interpret this question. I would understand it as the number of bits needed to address the cache, which would be 4 in this case? If indeed, as you assume, the number of bits in the cache was meant, you would have to add 16*16 bits for the tag bits to your solution.
Related
I'm writing a cache simulation program in C on linux using gcc as the compiler and I'm done for the most part. Only a few test cases go wrong (a few things out of the thousands of fed addresses that should be hitting are missing). I specify the cache properties on the command line. I suspect the error within my code has to do with the tag (if things aren't hitting then their tags aren't matching up when they should be). So my question is: Am I calculating the tag right?
//setting sizes of bits
int offsetSize = log2(lineSize);
int indexSize = 0;
if (strcmp(associativity,"direct") == 0){//direct associativity
indexSize = log2(numLines);
}else if (assocNum == numLines){//fully associative
indexSize = 0;
}else{//set associative
indexSize = log2(assocNum);
}
address = (int) strtol(readAddress,&eptr,16);
unsigned long long int mask = 0;
//get the offset Bits
mask = (1 << offsetSize) - 1;
offsetBits = address & mask;
//get the index bits
mask = (1 << (indexSize)) - 1;
mask = mask << offsetSize;
indexBits = (address & mask) >> offsetSize;
//get tag bits
tagBits = address >> (offsetSize+indexSize);
The addresses that are being fed are usually 48 bits, so the variables address and mask is of type unsigned long long int. I think the problem I'm having is that I'm taking all the upper bits of the address, when I should only be taking a small set of bits from the large address.
For example: I have 32 cache lines in a 4-way set associative cache with a block size of 4.
offsetSize = log2(4) = 2
indexSize = log2(4) = 2
My code currently takes the upper bits of the address no matter the address size, minus the last 4 bits. Should I be taking only the upper 28 bits instead? (tagSize = (8*4)-3-2)
My code currently takes the upper bits of the address no matter the address size, minus the last 4 bits. Should I be taking only the upper 28 bits instead?
The tag has to contain all upper bits so that the tag can be used to determine if it is or isn't a cache hit.
If addresses are 48-bits and are split into 3 fields, you'd have a 2-bit "offset in cache line" field, a 2-bit "index in cache" field and a 44-bit "upper bits that have to be stored in the tag" field. If you only store 28 bits in the tag then you get cache hits when you should get cache misses (because the entry in the cache happens to contain data for a different address where the 28 bits happened to match).
Note that you can/should think of "associativity" as the number of sets of cache lines that happen to operate in parallel (where direct mapped is just "associativity = 1", and where fully associative is just "associativity = total_cache_size / cache_line_size"). The associativity has no direct effect on the index size (only the size of the sets of cache lines matters for index size), and the problem you're having is probably related to indexSize = log2(assocNum); (which doesn't make sense).
In other words:
if( direct_mapped ) {
associativity = 1;
} else {
max_associativity = total_cache_size / cache_line_size;
if( fully_associative || (associativity > max_associativity) ) {
associativity = max_associativity;
}
}
set_size = total_cache_size / associativity;
number_of_lines_in_set = set_size / cache_line_size;
offset_size = log2(cache_line_size);
index_size = log2(number_of_lines_in_set);
tag_size = address_size - index_size - offsetSize;
I need to create a large binary matrix that is over the array size limit for MATLAB.
By default, MATLAB creates integer arrays as double precision arrays. But since my matrix is binary, I am hoping that there is a way to create an array of bits instead of doubles and consume far less memory.
I created a random binary matrix A and converted it to a logical array B:
A = randi([0 1], 1000, 1000);
B=logical(A);
I saved both as .mat files. They take up about the same space on my computer so I don't think MATLAB is using a more compact data type for logicals, which seems very wasteful. Any ideas?
Are you sure that the variables take the same amount of space? logical data matrices / arrays are inherently 1 byte per number where as randi is double precision, which is 8 bytes per number. A simple call to whos will show you how much memory each variable takes:
>> A = randi([0 1], 1000, 1000);
>> B = logical(A);
>> whos
Name Size Bytes Class Attributes
A 1000x1000 8000000 double
B 1000x1000 1000000 logical
As you can see, A takes 8 x 1000 x 1000 = 8M bytes where as B takes up 1 x 1000 x 1000 = 1M bytes. There is most certainly memory savings between them.
The drawback with logicals is that it takes 1 byte per number, and you're looking for 1 bit instead. The best thing I can think of is to use an unsigned integer type and interleave chunks of N-bits where N is the associated bit precision of the data type, so uint8, uint16, uint32 etc. into a single interleaved array. As such, 32 digits can be interleaved per number and you can save this final matrix.
Going off on a tangent - Images
In fact, this is how Java packs colour pixels when reading images in using their BufferedImage class. Each pixel in a RGB image is 24 bits, where there are 8 bits per colour channel - red, green and blue. Each pixel is represented as a proportion of red, green and blue, and they concatenate the trio of 8 bits into a single 24-bit integer. Usually, integers are represented as 32 bits and so you may think that there are 8 extra bits being wasted. There is in fact an alpha channel that represents the transparency of each colour pixel and that is another 8 bits to represent this. If you don't use transparency, these are assumed to be all 1s, and so the collection of these 4 pairs of 8 bits constitute 32 bits per pixel. There is, however, compression algorithms to reduce the size of each pixel on average to significantly less than 32 bits per pixel, but that's outside the scope of what I'm talking about.
Going back to our discussion, one way to represent this binary matrix in bit form would be perhaps in a for loop like so:
Abin = zeros(1, ceil(numel(A)/32), 'uint32');
for ii = 1 : numel(Abin)
val = A((ii-1)*32 + 1:ii*32);
dec = bin2dec(sprintf('%d', val));
Abin(ii) = dec;
end
Bear in mind that this will only work for matrices where the total number of elements is divisible by 32. I won't go into how to handle the case where it isn't because I solely want to illustrate the point that you can do what you ask, but it requires a bit of manipulation. Your case of 1000 x 1000 = 1M is certainly divisible by 32 (you get 1M / 32 = 31250), and so this will work.
This is probably not the most optimized code, but it gets the point across. Basically, we take chunks of 32 numbers (0/1) going column-wise from left to right and determining the 32-bit unsigned integer representation of this number. We then store this in a single location in the matrix Abin. What you will get in the end, given your 1000 x 1000 matrix is 31250 32-bit unsigned integers, which corresponds to 1000 x 1000 bits, or 1M bits = 125,000 bytes.
Try looking at the size of each variable now:
>> whos
Name Size Bytes Class Attributes
A 1000x1000 8000000 double
Abin 1x31250 125000 uint32
B 1000x1000 1000000 logical
To perform a reconstruction, try:
Arec = zeros(size(A));
for ii = 1 : numel(Abin)
val = dec2bin(Abin(ii), 32) - '0';
Arec((ii-1)*32 + 1:ii*32) = val(:);
end
Also not the most optimized, but it gets the point across. Given the "compressed" matrix Abin that we calculated before, for each element, we reconstruct what the original 32-bit number was then assign these numbers in 32-bit chunks stored in Arec.
You can verify that Arec is indeed equal to the original matrix A:
>> isequal(A, Arec)
ans =
1
Also, check out the workspace with whos:
>> whos
Name Size Bytes Class Attributes
A 1000x1000 8000000 double
Abin 1x31250 125000 uint32
Arec 1000x1000 8000000 double
B 1000x1000 1000000 logical
You are storing your data in a compressed file format. For mat files in version 7.0 and 7.3 gzip compression is used. The uncompressed data has different sizes, but after compression both are compressed down to roughly the same size. That happened because both data contains only 0 and 1 which can be compressed efficient.
For a 32 bit integer, divide it into 32 bins of consecutive integers such that there are twice as many integers in each successive bin. The first bin contains 0, the second 0..1, etc up to 0..2^31-1.
The fastest algorithm I could come up with, given a 32 bit integer i, is 5 cycles on an i7 (bit scan is 3 cycles):
// bin is the number of leading zeroes, and then we clear the msb to get item
bin_index = bsr(i)
item = i ^ (1 << bin_index)
Or equivalently (well it stores the items 0..2^(32-1) in bin 0 and 0 in bin 31, but that doesn't matter):
// bin is the number of trailing zeroes, and then we shift down by that many bits + 1
bin_index = bsf(i)
item = i >> (bin_index + 1)
In each case the bin index is encoded as the number of leading/trailing zero bits, with a 1 to separate them from the item number. You could do the same with leading or trailing ones and a zero to separate them. Neither works with i=0, but that's not important.
The mapping between integers and the bins/items can be completely arbitrary, so long as twice as many consecutive integers end up in each successive bin and the total number of integers in the bins sums to 2^32-1. Can you think of a more efficient algorithm to bin the 32 integers on an i7? Keep in mind an i7 is superscalar so any operations that don't depend on each other can execute in parallel, up to the throughput for each instruction type.
You can improve your algorithm by trying to sort the data first before counting zeros.
For example , compare it to 2^31 first and if its greater put it in that bin, otherwise go on and count trailing zeros. With this you now have half your data set put into its bin in 2 instructions...probably two cycles. The other half would take a bit longer but the net result would be an improvement. You can likely optimize even further following this line if thought.
I guess this would also be dependent on the efficiency ofbranch prediction.
I'm making an associative cache simulator in C:
4 ways,
524288 bytes total size,
64 bytes block size,
32 bits addresses.
In this address:
00001000000000000000000100001100
what is the decimal value of the tag, the set and the word?
I think it's tag: 256, set: 4 , word: 12, but I have some errors in Hits and Misses and I think this might be the problem.
Thanks for your time.
I get this:
000010000000000 00000000100 001100
000010000000000 = tag = 1024
00000000100 = slot number in set = 4
001100 = offset in block = 12
Note: This is the same as your results except that the tag value is different by a factor of 4; which happens to be the same as the associativity. I'd be tempted to assume that this isn't a coincidence (and that you're consuming some bits for the "way" when you shouldn't be).
I want a hash function that takes a long number (64 bits) and produces result of 10 bits. What is the best hash function for such purpose. Inputs are basically addresses of variables (Addresses are of 64 bits or 8 bytes on Linux), so my hash function should be optimized for that purpose.
I would say somethig like this:
uint32_t hash(uint64_t x)
{
x >>= 3;
return (x ^ (x>>10) ^ (x>>20)) & 0x3FF;
}
The lest significant 3 bits are not very useful, as most variables are 4-byte or 8-byte aligned, so we remove them.
Then we take the next 30 bits and mix them together (XOR) in blocks of 10 bits each.
Naturally, you could also take the (x>>30)^(x>>40)^(x>>50) but I'm not sure if they'll make any difference in practice.
I wrote a toy program to see some real addresses on the stack, data area, and heap. Basically I declared 4 globals, 4 locals and did 2 mallocs. I dropped the last two bits when printing the addresses. Here is an output from one of the runs:
20125e8
20125e6
20125e7
20125e4
3fef2131
3fef2130
3fef212f
3fef212c
25e4802
25e4806
What this tells me:
The LSB in this output (3rd bit of the address) is frequently 'on' and 'off'. So I wouldn't drop it when calculating the hash. Dropping 2 LSBs seems enough.
We also see that there is more entropy in the lower 8-10 bits. We must use that when calculating the hash.
We know that on a 64 bit machine, virtual addresses are never more than 48 bits wide.
What I would do next:
/* Drop two LSBs. */
a >>= 2;
/* Get rid of the MSBs. Keep 46 bits. */
a &= 0x3fffffffffff;
/* Get the 14 MSBs and fold them in to get a 32 bit integer.
The MSBs are mostly 0s anyway, so we don't lose much entropy. */
msbs = (a >> 32) << 18;
a ^= msbs;
Now we pass this through a decent 'half avalanche' hash function, instead of rolling our own. 'Half avalanche' means each bit of the input gets a chance to affect bits at the same position and higher:
uint32_t half_avalanche( uint32_t a)
{
a = (a+0x479ab41d) + (a<<8);
a = (a^0xe4aa10ce) ^ (a>>5);
a = (a+0x9942f0a6) - (a<<14);
a = (a^0x5aedd67d) ^ (a>>3);
a = (a+0x17bea992) + (a<<7);
return a;
}
For an 10-bit hash, use the 10 MSBs of the uint32_t returned. The hash function continues to work fine if you pick N MSBs for an N bit hash, effectively doubling the bucket count with each additional bit.
I was a little bored, so I wrote a toy benchmark for this. Nothing fancy, it allocates a bunch of memory on the heap and tries out the hash I described above. The source can be had from here. An example result:
1024 buckets, 256 values generated, 29 collissions
1024 buckets, 512 values generated, 103 collissions
1024 buckets, 1024 values generated, 370 collissions
Next: I tried out the other two hashes answered here. They both have similar performance. Looks like: Just pick the fastest one ;)
Best for most distributions is mod by a prime, 1021 is the largest 10-bit prime. There's no need to strip low bits.
static inline int hashaddress(void *v)
{
return (uintptr_t)v % 1021;
}
If you think performance might be a concern, have a few alternates on hand and race them in your actual program. Microbenchmarks are waste; a difference of a few cycles is almost certain to be swamped by cache effects, and size matters.