I need to generate a matrix of 3 columns and 11 rows where in each position I can have either 0, 1 or 2. The problem is that if I introduce this in a loop, I will end up having 3^33 possible combinations, which I cannot save or even go through. What I want to to is to generate all of those matrices to operate with them in the minimum amount of time and avoiding that the Kernel dies.
Thank you very much! :)
Well each matrix has like 53 bits of data so your question is "how to generate all possible 53 bit integers" - log_2(3^33) < 53
If that doesn't sound alarming to you then I don't really know what to tell you but most of the time you don't have to have them in memory (or even on disk) AT ONCE.
If you're hard committed to generate all of them to have them in memory at once for some reason (and I can't really figure that out as even in integer format we're talking about 53 * 3^33 bits of data (quite possibly 64 * 3^33 as I'd use long integers but let use 53 bit integers) - that's ~3 * 10^17 bits, ~3.7 * 10^16 bytes, ~3.6 * 10^13 KiB, ~35122658018 MiB, ~34299470 GiB, ~33496 TiB, ~33 PiB.
33 PiB is A LOT of data. Unless you have NASDAQ quote or are federal agency (like NSA ;) ), you don't have access to hardware to process them at the same time.
Whatever you're trying to accomplish with this approach it's dead end. It sounds like you're facing a problem that brute force shouldn't be able to solve.
Related
I'm trying to find the average value of an array of floating point values using multiple threads on a single machine. I'm not concerned with the size of the array or memory constraints (assume a moderately sized array, large enough to warrant multiple threads). In particular, I'm looking for the most efficient scheduling algorithm. It seems to me that a static block approach would be the most efficient.
So, given that I have x machine cores, it would seem reasonable to chunk the array into array.size/x values and have each core sum the results for their respective array chunk. Then, the summed results from each core are added and the final result is this value divided by the total number of array elements (note: in the case of the # of array elements not being exactly divisible by x, I am aware of the optimization to distribute the elements as evenly as possible across the threads).
The array will obviously be shared between threads, but since there are no writes involved, I won't need to involve any locking mechanisms or worry about synchronization issues.
My question is: is this actually the most efficient approach for this problem?
In contrast, for example, consider the static interleaved approach. In this case, if I had four cores (threads), then thread one would operate on array elements 0, 4, 8, 12... whereas thread two would operate on elements 1, 5, 9, 13... This would seem worse since each core would be continually getting cache misses, whereas the static block approach means that each core operates on success values and takes advantage of data locality. Some tests that I've run seem to back this up.
So, can anyone point out a better approach than static block, or confirm that this is most likely the best approach?
Edit:
I'm using Java and Linux (Ubuntu). I'm trying not to think to much about the language/platform involved, and just look at the problem abstractly from a scheduling point of view that involves manually assigning workload to multiple threads. But I understand that the language and platform are important factors.
Edit-2:
Here's some timings (nano time / 1000) with varying array sizes (doubles).
Sequential timings used a single java thread.
The others implemented their respective scheduling strategies using all available (4) cores working in parallel.
1,000,000 elements:
---Sequential
5765
1642
1565
1485
1444
1511
1446
1448
1465
1443
---Static Block
15857
4571
1489
1529
1547
1496
1445
1415
1452
1661
---Static Interleaved
9692
4578
3071
7204
5312
2298
4518
2427
1874
1900
50,000,000 elements:
---Sequential
73757
69280
70255
78510
74520
69001
69593
69586
69399
69665
---Static Block
62827
52705
55393
53843
57408
56276
56083
57366
57081
57787
---Static Interleaved
179592
306106
239443
145630
171871
303050
233730
141827
162240
292421
Your system doesn't seem to have the memory bandwidth to take advantage of 4 threads on this problem. Doing floating point addition of elements is just not enough work to keep the CPU busy at the rate memory can deliver data... your 4 cores are sharing the same memory controller/DRAM... and are waiting on memory. You probably will see the same speedup if you use 2 threads instead of 4.
Interleaving is a bad idea, as you said and as you confirmed, why waste precious memory bandwidth bringing data into a core and then only using one fourth of it. If you are lucky and the threads run somewhat in sync then you will get reuse of data in the Level 2 or Level 3 cache, but you will still bring data into the L1 cache and only use a fraction.
Update: when adding 50 million elements one concern is loss of precision, log base 2 of 50 million is about 26 bits, and double precision floating point has 53 effective bits (52 explicit and 1 implied). The best case is when all elements have similar exponents (similar in magnitude). Things get worse if the numbers in the array have a large range of exponents, in the worst case the range is large and they are sorted in descending order of magnitude. Precision of your final average can be improved by sorting the array and adding in ascending order. See this SO question for more insight into the precision problem when adding a large number of items: Find the average within variable number of doubles.
I need to allocate memory of order of 10^15 to store integers which can be of long long type.
If i use an array and declare something like
long long a[1000000000000000];
that's never going to work. So how can i allocate such a huge amount of memory.
Really large arrays generally aren't a job for memory, more one for disk. 1015 array elements at 64 bits apiece is (I think) 8 petabytes. You can pick up 8G memory slices for about $15 at the moment so, even if your machine could handle that much memory or address space, you'd be outlaying about $15 million dollars.
In addition, with upcoming DDR4 being clocked up to about 4GT/s (giga-transfers), even if each transfer was a 64-bit value, it would still take about one million seconds just to initialise that array to zero. Do you really want to be waiting around for eleven and a half days before your code even starts doing anything useful?
And, even if you go the disk route, that's quite a bit. At (roughly) $50 per TB, you're still looking at $400,000 and you'll possibly have to provide your own software for managing those 8,000 disks somehow. And I'm not even going to contemplate figuring out how long it would take to initialise the array on disk.
You may want to think about rephrasing your question to indicate the actual problem rather than what you currently have, a proposed solution. It may be that you don't need that much storage at all.
For example, if you're talking about an array where many of the values are left at zero, a sparse array is one way to go.
You can't. You don't have all this memory, and you'll don't have it for a while. Simple.
EDIT: If you really want to work with data that does not fit into your RAM, you can use some library that work with mass storage data, like stxxl, but it will work a lot slower, and you have always disk size limits.
MPI is what you need, that's actually a small size for parallel computing problems the blue gene Q monster at Lawerence Livermore National Labs holds around 1.5 PB of ram. you need to use block decomposition to divide up your problem and viola!
the basic approach is dividing up the array into equal blocks or chunks among many processors
You need to uppgrade to a 64-bit system. Then get 64-bit-capable compiler then put a l at the end of 100000000000000000.
Have you heard of sparse matrix implementation? In one of the sparse matrices, you just use very little part of the matrix despite of the matrix being huge.
Here are some libraries for you.
Here is a basic info about sparse-matrices You dont actually use all of it. Just the needed few points.
I am writing a program in C to solve an optimisation problem, for which I need to create an array of type float with an order of 1013 elements. Is it practically possible to do so on a machine with 20GB memory.
A float in C occupies 4 bytes (assuming IEEE floating point arithmetic, which is pretty close to universal nowadays). That means 1013 elements are naïvely going to require 4×1013 bytes of space. That's quite a bit (40 TB, a.k.a. quite a lot of disk for a desktop system, and rather more than most people can afford when it comes to RAM) so you need to find another approach.
Is the data sparse (i.e., mostly zeroes)? If it is, you can try using a hash table or tree to store only the values which are anything else; if your data is sufficiently sparse, that'll let you fit everything in. Also be aware that processing 1013 elements will take a very long time. Even if you could process a billion items a second (very fast, even now) it would still take 104 seconds (several hours) and I'd be willing to bet that in any non-trivial situation you'll not be able to get anything near that speed. Can you find some way to make not just the data storage sparse but also the processing, so that you can leave that massive bulk of zeroes alone?
Of course, if the data is non-sparse then you're doomed. In that case, you might need to find a smaller, more tractable problem instead.
I suppose if you had a 64 bit machine with a lot of swap space, you could just declare an array of size 10^13 and it may work.
But for a data set of this size it becomes important to consider carefully the nature of the problem. Do you really need random access read and write operations for all 10^13 elements? Is the array at all sparse? Could you express this as a map/reduce problem? If so, sequential access to 10^13 elements is much more practical than random access.
I would like to represent a structure containing 250 M states(1 bit each) somehow into as less memory as possible (100 k maximum). The operations on it are set/get. I cold not say that it's dense or sparse, it may vary.
The language I want to use is C.
I looked at other threads here to find something suitable also. A probabilistic structure like Bloom filter for example would not fit because of the possible false answers.
Any suggestions please?
If you know your data might be sparse, then you could use run-length encoding. But otherwise, there's no way you can compress it.
The size of the structure depends on the entropy of the information. You cannot squeeze information something in less than a given size if you have no repeated pattern. The worst case would still be about 32Mb of storage in your case. If you know something about the relation between the bits then it's maybe possible...
I don't think it's possible to do what you're asking. If you need to cover 250 million states of 1 bit each, you'd need 250Mbits/8 = 31.25MBytes. A far cry from 100KBytes.
You'd typically create a large array of bytes, and use functions to determine the byte (index >> 3) and bit position (index & 0x07) to set/clear/get.
250M bits will take 31.25 megabytes to store (assuming 8 bits/byte, of course), much much more than your 100k goal.
The only way to beat that is to start taking advantage of some sparseness or pattern in your data.
The max number of bits you can store in 100K of mem is 819,200 bits. This is assuming that 1 K = 1024 bytes, and 1 byte = 8 bits.
are files possible in your environment ?
if so, you might swap, say for example 4k sized segmented bit buffer.
your solution shoud access those bits in a serialized way to
minimize disk load/save operation.
Is there a historical reason or something ? I've seen quite a few times something like char foo[256]; or #define BUF_SIZE 1024. Even I do mostly only use 2n sized buffers, mostly because I think it looks more elegant and that way I don't have to think of a specific number. But I'm not quite sure if that's the reason most people use them, more information would be appreciated.
There may be a number of reasons, although many people will as you say just do it out of habit.
One place where it is very useful is in the efficient implementation of circular buffers, especially on architectures where the % operator is expensive (those without a hardware divide - primarily 8 bit micro-controllers). By using a 2^n buffer in this case, the modulo, is simply a case of bit-masking the upper bits, or in the case of say a 256 byte buffer, simply using an 8-bit index and letting it wraparound.
In other cases alignment with page boundaries, caches etc. may provide opportunities for optimisation on some architectures - but that would be very architecture specific. But it may just be that such buffers provide the compiler with optimisation possibilities, so all other things being equal, why not?
Cache lines are usually some multiple of 2 (often 32 or 64). Data that is an integral multiple of that number would be able to fit into (and fully utilize) the corresponding number of cache lines. The more data you can pack into your cache, the better the performance.. so I think people who design their structures in that way are optimizing for that.
Another reason in addition to what everyone else has mentioned is, SSE instructions take multiple elements, and the number of elements input is always some power of two. Making the buffer a power of two guarantees you won't be reading unallocated memory. This only applies if you're actually using SSE instructions though.
I think in the end though, the overwhelming reason in most cases is that programmers like powers of two.
Hash Tables, Allocation by Pages
This really helps for hash tables, because you compute the index modulo the size, and if that size is a power of two, the modulus can be computed with a simple bitwise-and or & rather than using a much slower divide-class instruction implementing the % operator.
Looking at an old Intel i386 book, and is 2 cycles and div is 40 cycles. A disparity persists today due to the much greater fundamental complexity of division, even though the 1000x faster overall cycle times tend to hide the impact of even the slowest machine ops.
There was also a time when malloc overhead was occasionally avoided at great length. Allocation's available directly from the operating system would be (still are) a specific number of pages, and so a power of two would be likely to make the most use of the allocation granularity.
And, as others have noted, programmers like powers of two.
I can think of a few reasons off the top of my head:
2^n is a very common value in all of computer sizes. This is directly related to the way bits are represented in computers (2 possible values), which means variables tend to have ranges of values whose boundaries are 2^n.
Because of the point above, you'll often find the value 256 as the size of the buffer. This is because it is the largest number that can be stored in a byte. So, if you want to store a string together with a size of the string, then you'll be most efficient if you store it as: SIZE_BYTE+ARRAY, where the size byte tells you the size of the array. This means the array can be any size from 1 to 256.
Many other times, sizes are chosen based on physical things (for example, the size of the memory an operating system can choose from is related to the size of the registers of the CPU etc) and these are also going to be a specific amount of bits. Meaning, the amount of memory you can use will usually be some value of 2^n (for a 32bit system, 2^32).
There might be performance benefits/alignment issues for such values. Most processors can access a certain amount of bytes at a time, so even if you have a variable whose size is let's say) 20 bits, a 32 bit processor will still read 32 bits, no matter what. So it's often times more efficient to just make the variable 32 bits. Also, some processors require variables to be aligned to a certain amount of bytes (because they can't read memory from, for example, addresses in the memory that are odd). Of course, sometimes it's not about odd memory locations, but locations that are multiples of 4, or 6 of 8, etc. So in these cases, it's more efficient to just make buffers that will always be aligned.
Ok, those points came out a bit jumbled. Let me know if you need further explanation, especially point 4 which IMO is the most important.
Because of the simplicity (read also cost) of base 2 arithmetic in electronics: shift left (multiply by 2), shift right (divide by 2).
In the CPU domain, lots of constructs revolve around base 2 arithmetic. Busses (control & data) to access memory structure are often aligned on power 2. The cost of logic implementation in electronics (e.g. CPU) makes for arithmetics in base 2 compelling.
Of course, if we had analog computers, the story would be different.
FYI: the attributes of a system sitting at layer X is a direct consequence of the server layer attributes of the system sitting below i.e. layer < x. The reason I am stating this stems from some comments I received with regards to my posting.
E.g. the properties that can be manipulated at the "compiler" level are inherited & derived from the properties of the system below it i.e. the electronics in the CPU.
I was going to use the shift argument, but could think of a good reason to justify it.
One thing that is nice about a buffer that is a power of two is that circular buffer handling can use simple ands rather than divides:
#define BUFSIZE 1024
++index; // increment the index.
index &= BUFSIZE; // Make sure it stays in the buffer.
If it weren't a power of two, a divide would be necessary. In the olden days (and currently on small chips) that mattered.
It's also common for pagesizes to be powers of 2.
On linux I like to use getpagesize() when doing something like chunking a buffer and writing it to a socket or file descriptor.
It's makes a nice, round number in base 2. Just as 10, 100 or 1000000 are nice, round numbers in base 10.
If it wasn't a power of 2 (or something close such as 96=64+32 or 192=128+64), then you could wonder why there's the added precision. Not base 2 rounded size can come from external constraints or programmer ignorance. You'll want to know which one it is.
Other answers have pointed out a bunch of technical reasons as well that are valid in special cases. I won't repeat any of them here.
In hash tables, 2^n makes it easier to handle key collissions in a certain way. In general, when there is a key collission, you either make a substructure, e.g. a list, of all entries with the same hash value; or you find another free slot. You could just add 1 to the slot index until you find a free slot; but this strategy is not optimal, because it creates clusters of blocked places. A better strategy is to calculate a second hash number h2, so that gcd(n,h2)=1; then add h2 to the slot index until you find a free slot (with wrap around). If n is a power of 2, finding a h2 that fulfills gcd(n,h2)=1 is easy, every odd number will do.