I would like to represent a structure containing 250 M states(1 bit each) somehow into as less memory as possible (100 k maximum). The operations on it are set/get. I cold not say that it's dense or sparse, it may vary.
The language I want to use is C.
I looked at other threads here to find something suitable also. A probabilistic structure like Bloom filter for example would not fit because of the possible false answers.
Any suggestions please?
If you know your data might be sparse, then you could use run-length encoding. But otherwise, there's no way you can compress it.
The size of the structure depends on the entropy of the information. You cannot squeeze information something in less than a given size if you have no repeated pattern. The worst case would still be about 32Mb of storage in your case. If you know something about the relation between the bits then it's maybe possible...
I don't think it's possible to do what you're asking. If you need to cover 250 million states of 1 bit each, you'd need 250Mbits/8 = 31.25MBytes. A far cry from 100KBytes.
You'd typically create a large array of bytes, and use functions to determine the byte (index >> 3) and bit position (index & 0x07) to set/clear/get.
250M bits will take 31.25 megabytes to store (assuming 8 bits/byte, of course), much much more than your 100k goal.
The only way to beat that is to start taking advantage of some sparseness or pattern in your data.
The max number of bits you can store in 100K of mem is 819,200 bits. This is assuming that 1 K = 1024 bytes, and 1 byte = 8 bits.
are files possible in your environment ?
if so, you might swap, say for example 4k sized segmented bit buffer.
your solution shoud access those bits in a serialized way to
minimize disk load/save operation.
Related
I need to allocate memory of order of 10^15 to store integers which can be of long long type.
If i use an array and declare something like
long long a[1000000000000000];
that's never going to work. So how can i allocate such a huge amount of memory.
Really large arrays generally aren't a job for memory, more one for disk. 1015 array elements at 64 bits apiece is (I think) 8 petabytes. You can pick up 8G memory slices for about $15 at the moment so, even if your machine could handle that much memory or address space, you'd be outlaying about $15 million dollars.
In addition, with upcoming DDR4 being clocked up to about 4GT/s (giga-transfers), even if each transfer was a 64-bit value, it would still take about one million seconds just to initialise that array to zero. Do you really want to be waiting around for eleven and a half days before your code even starts doing anything useful?
And, even if you go the disk route, that's quite a bit. At (roughly) $50 per TB, you're still looking at $400,000 and you'll possibly have to provide your own software for managing those 8,000 disks somehow. And I'm not even going to contemplate figuring out how long it would take to initialise the array on disk.
You may want to think about rephrasing your question to indicate the actual problem rather than what you currently have, a proposed solution. It may be that you don't need that much storage at all.
For example, if you're talking about an array where many of the values are left at zero, a sparse array is one way to go.
You can't. You don't have all this memory, and you'll don't have it for a while. Simple.
EDIT: If you really want to work with data that does not fit into your RAM, you can use some library that work with mass storage data, like stxxl, but it will work a lot slower, and you have always disk size limits.
MPI is what you need, that's actually a small size for parallel computing problems the blue gene Q monster at Lawerence Livermore National Labs holds around 1.5 PB of ram. you need to use block decomposition to divide up your problem and viola!
the basic approach is dividing up the array into equal blocks or chunks among many processors
You need to uppgrade to a 64-bit system. Then get 64-bit-capable compiler then put a l at the end of 100000000000000000.
Have you heard of sparse matrix implementation? In one of the sparse matrices, you just use very little part of the matrix despite of the matrix being huge.
Here are some libraries for you.
Here is a basic info about sparse-matrices You dont actually use all of it. Just the needed few points.
I am writing a program in C to solve an optimisation problem, for which I need to create an array of type float with an order of 1013 elements. Is it practically possible to do so on a machine with 20GB memory.
A float in C occupies 4 bytes (assuming IEEE floating point arithmetic, which is pretty close to universal nowadays). That means 1013 elements are naïvely going to require 4×1013 bytes of space. That's quite a bit (40 TB, a.k.a. quite a lot of disk for a desktop system, and rather more than most people can afford when it comes to RAM) so you need to find another approach.
Is the data sparse (i.e., mostly zeroes)? If it is, you can try using a hash table or tree to store only the values which are anything else; if your data is sufficiently sparse, that'll let you fit everything in. Also be aware that processing 1013 elements will take a very long time. Even if you could process a billion items a second (very fast, even now) it would still take 104 seconds (several hours) and I'd be willing to bet that in any non-trivial situation you'll not be able to get anything near that speed. Can you find some way to make not just the data storage sparse but also the processing, so that you can leave that massive bulk of zeroes alone?
Of course, if the data is non-sparse then you're doomed. In that case, you might need to find a smaller, more tractable problem instead.
I suppose if you had a 64 bit machine with a lot of swap space, you could just declare an array of size 10^13 and it may work.
But for a data set of this size it becomes important to consider carefully the nature of the problem. Do you really need random access read and write operations for all 10^13 elements? Is the array at all sparse? Could you express this as a map/reduce problem? If so, sequential access to 10^13 elements is much more practical than random access.
Can anybody give pointers how I can implement lzw compression/decompression in low memory conditions (< 2k). is that possible?
The zlib library that everyone uses is bloated among other problems (for embedded). I am pretty sure it wont work for your case. I had a little more memory maybe 16K and couldnt get it to fit. It allocates and zeros large chunks of memory and keeps copies of stuff, etc. The algorithm can maybe do it but finding existing code is the challenge.
I went with http://lzfx.googlecode.com The decompression loop is tiny, it is the older lz type compression that relies on the prior results so you need to have access to the uncompressed results...The next byte is a 0x5, the next byte is a 0x23, the next 15 bytes are a copy of the 15 200 bytes ago, the next 6 bytes are a copy of 127 ago...the newer lz algorithm is variable width table based that can be big or grow depending on how implemented.
I was dealing with repetitive data and trying to squeeze a few K down into a few hundred, I think the compression was about 50%, not great but did the job and the decompression routine was tiny. The lzfx package above is small, not like zlib, like two main functions that have the code right there, not dozens of files. You could likely change the depth of the buffer, perhaps improve the compression algorithm if you so desire. I did have to modify the decompression code (like 20 or 30 lines of code perhaps) it was pointer heavy and I switched it to arrays because in my embedded environment the pointers were in the wrong place. Burns maybe an extra register or not depending on how you implement it and your compiler. I also did that so I could abstract the fetches and the stores of the bytes as I had them packed into memory that wasnt byte addressable.
If you find something better please post it here or ping me through stackoverflow, I am also very interested in other embedded solutions. I searched quite a bit and the above was the only useful one I found and I was lucky that my data was such that it compressed well enough using that algorithm...for now.
Can anybody give pointers how I can implement lzw compression/decompression in low memory conditions (< 2k). is that possible?
Why LZW? LZW needs lots of memory. It is based on a hash/dictionary and compression ratio is proportional to the hash/dictionary size. More memory - better compression. Less memory - output can be even larger than input.
I haven't touched encoding for very long time, but IIRC Huffman coding is little bit better when it comes to memory consumption.
But it all depends on type of information you want to compress.
I have used LZSS. I used code from Haruhiko Okumura as base. It uses the last portion of uncompressed data(2K) as dictionary. The code I linked can be modified to use almost no memory if you have all the uncompressed data available in memory. With a bit of googling you will find that a lot of different implementations.
If the choice of compression algorithm isn't set in stone, you might try gzip/LZ77 instead. Here's a very simple implementation I used and adapted once:
ftp://quatramaran.ens.fr/pub/madore/misc/myunzip.c
You'll need to clean up the way it reads input, error handling, etc. but it's a good start. It's probably also way too big if your data AND code need to fit in 2k, but at least the data size is small already.
Big plus is that it's public domain so you can use it however you like!
It has been over 15 years since I last played with the LZW compression algorithm, so take the following with a grain of salt.
Given the memory constraints, this is going to be difficult at best. The dictionary you build is going to consume the vast majority of what you have available. (Assuming that code + memory <= 2k.)
Pick a small fixed size for your dictionary. Say 1024 entries.
Let each dictionary entry take the form of ....
struct entry {
intType prevIdx;
charType newChar;
};
This structure makes the dictionary recursive. You need the item at the previous index to be valid in order for it to work properly. Is this workable? I'm not sure. However, let us assume for the moment that it is and find out where it leads us ....
If you use the standard types for int and char, you are going to run out of memory fast. You will want to pack things together as tightly as possible. 1024 entries will take 10 bits to store. Your new character, will likely take 8 bits. Total = 18 bits.
18 bits * 1024 entries = 18432 bits or 2304 bytes.
At first glance this appears too large. What do we do? Take advantage of the fact that the first 256 entries are already known--your typical extended ascii set or what have you. This means we really need 768 entries.
768 * 18 bits = 13824 bits or 1728 bytes.
This leaves you with about 320 bytes to play with for code. Naturally, you can play around with the dictionary size and see what's good for you, but you will not end up with very much space for your code. Since you are looking at so little code space, I would expect that you would end up coding in assembly.
I hope this helps.
My best recommendation is to examine the BusyBox source and see if their LZW implementation is sufficiently small to work in your environment.
The lowest dictionary for lzw is trie on linked list. See original implementation in LZW AB. I've rewrited it in fork LZWS. Fork is compatible with compress. Detailed documentation here.
n bit dictionary requires (2 ** n) * sizeof(code) + ((2 ** n) - 257) * sizeof(code) + (2 ** n) - 257.
So:
9 bit code - 1789 bytes.
12 bit code - 19709 bytes.
16 bit code - 326909 bytes.
Please be aware that it is a requirements for dictionary. You need to have about 100-150 bytes for state or variables in stack.
Decompressor will use less memory than compressor.
So I think that you can try to compress your data with 9 bit version. But it won't provide good compression ratio. More bits you have - ratio is better.
typedef unsigned int UINT;
typedef unsigned char BYTE;
BYTE *lzw_encode(BYTE *input ,BYTE *output, long filesize, long &totalsize);
BYTE *lzw_decode(BYTE *input ,BYTE *output, long filesize, long &totalsize);
I have a bit array that can be very dense in some parts and very sparse in others. The array can get as large as 2**32 bits. I am turning it into a bunch of tuples containing offset and length to make it more efficient to deal with in memory. However, this sometimes is less efficient with things like 10101010100011. Any ideas on a good way of storing this in memory?
If I understand correctly, you're using tuples of (offset, length) to represent runs of 1 bits? If so, a better approach would be to use runs of packed bitfields. For dense areas, you get a nice efficient array, and in non-dense areas you get implied zeros. For example, in C++, the representation might look like:
// The map key is the offset; the vector's length gives you the length
std::map<unsigned int, std::vector<uint32_t> >
A lookup would consist of finding the key before the bit position in question, and seeing if the bit falls in its vector. If it does, use the value from the vector. Otherwise, return 0. For example:
typedef std::map<unsigned int, std::vector<uint32_t> > bitmap; // for convenience
typedef std::vector<uint32_t> bitfield; // also convenience
bool get_bit(const bitmap &bm, unsigned int idx) {
unsigned int offset = idx / 32;
bitmap::const_iterator it = bm.upper_bound(offset);
// bm is the element /after/ the one we want
if (it == bm.begin()) {
// but it's the first, so we don't have the target element
return false;
}
it--;
// make offset be relative to this element start
offset -= it.first;
// does our bit fall within this element?
if (offset >= it.second.size())
return false; // nope
unsigned long bf = it.second[offset];
// extract the bit of interest
return (bf & (1 << (offset % 32))) != 0;
}
It would help to know more. By "very sparse/dense," do you mean millions of consecutive zeroes/ones, or do you mean local (how local?) proportions of 0's very close to 0 or 1? Does one or the other value predominate? Are there any patterns that might make run-length encoding effective? How will you use this data structure? (Random access? What kind of distribution of accessed indexes? Are huge chunks never or very rarely accessed?)
I can only guess you aren't going to be randomly accessing and modifying all 4 billion bits at rates of billions of bits/second. Unless it is phenomenally sparse/dense on a local level (such as any million consecutive bits are likely to be the same except for 5 or 10 bits) or full of large scale repetition or patterns, my hunch is that the choice of data structure depends more on how the array is used than on the nature of the data.
How to structure things will be dependent on what is your data. For trying to represent large amounts of data, you will need to have long runs of zeros or ones. This would eliminate the need to have it respresented. If this is not the case and you have approxiately the same amount of one's and zeros, you would be better off with all of the memory.
It might help to think of this as a compression problem. For compression to be effective there has to be a pattern (or a limit set of items used out of an entire space) and an uneven distribution in order for compression to work. If all the elements are used and evenly distributed, compression is hard to do, or could take more space then the actual data.
If there are only runs of zero and ones, (more then just one), using offset and length might make some sense. If there is inconsistent runs, you could just copy the bits as a bit array where you have offset, length, and values.
How efficient the above is will depend upon if you have a large runs of ones or zeros. You will want to be careful to make sure you are not using more memory to reperesent your memory, then just using memory itself, (i.e. your are using more memory to represent the memory then just placing it into memory).
Check out bison source code. Look at biset implementation. It provides several flavors of implementations to deal with bit arrays with different densities.
How many of these do you intend to keep in memory at once?
As far as I can see, 2**32 bits = 512M, only half a gig, which isn't very much memory nowadays. Do you have anything better to do with it?
Assuming your server has enough ram, allocate it all at startup, then keep it in memory, the network handling thread can execute in just a few instructions in constant time - it should be able to keep up with any workload.
Is there a historical reason or something ? I've seen quite a few times something like char foo[256]; or #define BUF_SIZE 1024. Even I do mostly only use 2n sized buffers, mostly because I think it looks more elegant and that way I don't have to think of a specific number. But I'm not quite sure if that's the reason most people use them, more information would be appreciated.
There may be a number of reasons, although many people will as you say just do it out of habit.
One place where it is very useful is in the efficient implementation of circular buffers, especially on architectures where the % operator is expensive (those without a hardware divide - primarily 8 bit micro-controllers). By using a 2^n buffer in this case, the modulo, is simply a case of bit-masking the upper bits, or in the case of say a 256 byte buffer, simply using an 8-bit index and letting it wraparound.
In other cases alignment with page boundaries, caches etc. may provide opportunities for optimisation on some architectures - but that would be very architecture specific. But it may just be that such buffers provide the compiler with optimisation possibilities, so all other things being equal, why not?
Cache lines are usually some multiple of 2 (often 32 or 64). Data that is an integral multiple of that number would be able to fit into (and fully utilize) the corresponding number of cache lines. The more data you can pack into your cache, the better the performance.. so I think people who design their structures in that way are optimizing for that.
Another reason in addition to what everyone else has mentioned is, SSE instructions take multiple elements, and the number of elements input is always some power of two. Making the buffer a power of two guarantees you won't be reading unallocated memory. This only applies if you're actually using SSE instructions though.
I think in the end though, the overwhelming reason in most cases is that programmers like powers of two.
Hash Tables, Allocation by Pages
This really helps for hash tables, because you compute the index modulo the size, and if that size is a power of two, the modulus can be computed with a simple bitwise-and or & rather than using a much slower divide-class instruction implementing the % operator.
Looking at an old Intel i386 book, and is 2 cycles and div is 40 cycles. A disparity persists today due to the much greater fundamental complexity of division, even though the 1000x faster overall cycle times tend to hide the impact of even the slowest machine ops.
There was also a time when malloc overhead was occasionally avoided at great length. Allocation's available directly from the operating system would be (still are) a specific number of pages, and so a power of two would be likely to make the most use of the allocation granularity.
And, as others have noted, programmers like powers of two.
I can think of a few reasons off the top of my head:
2^n is a very common value in all of computer sizes. This is directly related to the way bits are represented in computers (2 possible values), which means variables tend to have ranges of values whose boundaries are 2^n.
Because of the point above, you'll often find the value 256 as the size of the buffer. This is because it is the largest number that can be stored in a byte. So, if you want to store a string together with a size of the string, then you'll be most efficient if you store it as: SIZE_BYTE+ARRAY, where the size byte tells you the size of the array. This means the array can be any size from 1 to 256.
Many other times, sizes are chosen based on physical things (for example, the size of the memory an operating system can choose from is related to the size of the registers of the CPU etc) and these are also going to be a specific amount of bits. Meaning, the amount of memory you can use will usually be some value of 2^n (for a 32bit system, 2^32).
There might be performance benefits/alignment issues for such values. Most processors can access a certain amount of bytes at a time, so even if you have a variable whose size is let's say) 20 bits, a 32 bit processor will still read 32 bits, no matter what. So it's often times more efficient to just make the variable 32 bits. Also, some processors require variables to be aligned to a certain amount of bytes (because they can't read memory from, for example, addresses in the memory that are odd). Of course, sometimes it's not about odd memory locations, but locations that are multiples of 4, or 6 of 8, etc. So in these cases, it's more efficient to just make buffers that will always be aligned.
Ok, those points came out a bit jumbled. Let me know if you need further explanation, especially point 4 which IMO is the most important.
Because of the simplicity (read also cost) of base 2 arithmetic in electronics: shift left (multiply by 2), shift right (divide by 2).
In the CPU domain, lots of constructs revolve around base 2 arithmetic. Busses (control & data) to access memory structure are often aligned on power 2. The cost of logic implementation in electronics (e.g. CPU) makes for arithmetics in base 2 compelling.
Of course, if we had analog computers, the story would be different.
FYI: the attributes of a system sitting at layer X is a direct consequence of the server layer attributes of the system sitting below i.e. layer < x. The reason I am stating this stems from some comments I received with regards to my posting.
E.g. the properties that can be manipulated at the "compiler" level are inherited & derived from the properties of the system below it i.e. the electronics in the CPU.
I was going to use the shift argument, but could think of a good reason to justify it.
One thing that is nice about a buffer that is a power of two is that circular buffer handling can use simple ands rather than divides:
#define BUFSIZE 1024
++index; // increment the index.
index &= BUFSIZE; // Make sure it stays in the buffer.
If it weren't a power of two, a divide would be necessary. In the olden days (and currently on small chips) that mattered.
It's also common for pagesizes to be powers of 2.
On linux I like to use getpagesize() when doing something like chunking a buffer and writing it to a socket or file descriptor.
It's makes a nice, round number in base 2. Just as 10, 100 or 1000000 are nice, round numbers in base 10.
If it wasn't a power of 2 (or something close such as 96=64+32 or 192=128+64), then you could wonder why there's the added precision. Not base 2 rounded size can come from external constraints or programmer ignorance. You'll want to know which one it is.
Other answers have pointed out a bunch of technical reasons as well that are valid in special cases. I won't repeat any of them here.
In hash tables, 2^n makes it easier to handle key collissions in a certain way. In general, when there is a key collission, you either make a substructure, e.g. a list, of all entries with the same hash value; or you find another free slot. You could just add 1 to the slot index until you find a free slot; but this strategy is not optimal, because it creates clusters of blocked places. A better strategy is to calculate a second hash number h2, so that gcd(n,h2)=1; then add h2 to the slot index until you find a free slot (with wrap around). If n is a power of 2, finding a h2 that fulfills gcd(n,h2)=1 is easy, every odd number will do.