I know there is an upper limit on array size. I want to understand what determines the maximum size of an array? Can someone give a simple detailed explanation?
It depends, on the language, library, operating system.
For an in-memory array which is what most languages offer by default, the upper limit is the address space given to the process. For Windows this is 2 or 3 GB for 32-bit applications, and for 64-bit applications it is the smaller of 8 TB and (physical RAM + page file size limit).
For a custom library using disk space to (very slowly) access an array, the limit will probably be the size of the largest storage volume. With RAID and 10+ TB drives that could be a very large number.
Once you know the memory limit for an array, the upper limit on the number of elements is (memory / element size). The actual limit will often be less if the element is small, since the array addressing might use 32-bit unsigned integers which can only address 4 GB elements.
This is for simple contiguous, typed arrays offered by languages like C++. Languages like PHP where you can define a['hello'] = 'bobcat'; a[12] = 3.14; are more like maps and can use much more memory per element since they store a value for the index
Related
In Java and other JVM based languages there's often a need to serialize things into an array of bytes. For example, when you want to send something over the network you first serialize it into an array of bytes.
Do people do that in Swift? Or how is data usually serialized when you want to send it over the network?
The problem is that byte[] and other arrays are indexed using ints and when you create an array you also use an int, for example: byte[] a = new byte[your int goes here]. You can't create an array using a long (64 bit integer), therefore your max array size is limited by the maximum integer, which is 2,147,483,647 (in reality the max array size is a bit lower: 2,147,483,6475), so the biggest array of type byte[] can only store about 2gb of data.
When I use Spark (a distributed computing library), when I create a Dataset, each element has to occupy no more than 2gbs of RAM, because internally it gets serialized when sending data from one node of your cluster to another, and when I am working with huge objects I am forced to split 1 big object into small ones to avoid serialization exceptions.
I think C# and many other languages have the same issue. Is the 32-bit .NET max byte array size < 2GB?
Am I right when saying that Swift doesn't suffer from this issue, since arrays are indexed using Int (which is 64 bits on a 64 bit system), and byte arrays can be of size min(Int.max, maximum_number_available_by_the_os)?
Yes, you are correct. Swift's Int type, the preferred type for integer bounds, is word-size, i.e., 64 bits on a 64-bit machine, and 32 bits on a 32-bit machine. This means that indexing into an Array, you can go well beyond the 2^31-1 limit.
And while idiomatically, higher-level types like Foundation's Data or NIOCore.ByteBuffer from swift-nio are typically preferred as "bag of bytes" types over [UInt8] (the Swift equivalent to byte[]), these types are also indexed using Int, and so are also not limited to 2GiB in size.
I need to allocate memory of order of 10^15 to store integers which can be of long long type.
If i use an array and declare something like
long long a[1000000000000000];
that's never going to work. So how can i allocate such a huge amount of memory.
Really large arrays generally aren't a job for memory, more one for disk. 1015 array elements at 64 bits apiece is (I think) 8 petabytes. You can pick up 8G memory slices for about $15 at the moment so, even if your machine could handle that much memory or address space, you'd be outlaying about $15 million dollars.
In addition, with upcoming DDR4 being clocked up to about 4GT/s (giga-transfers), even if each transfer was a 64-bit value, it would still take about one million seconds just to initialise that array to zero. Do you really want to be waiting around for eleven and a half days before your code even starts doing anything useful?
And, even if you go the disk route, that's quite a bit. At (roughly) $50 per TB, you're still looking at $400,000 and you'll possibly have to provide your own software for managing those 8,000 disks somehow. And I'm not even going to contemplate figuring out how long it would take to initialise the array on disk.
You may want to think about rephrasing your question to indicate the actual problem rather than what you currently have, a proposed solution. It may be that you don't need that much storage at all.
For example, if you're talking about an array where many of the values are left at zero, a sparse array is one way to go.
You can't. You don't have all this memory, and you'll don't have it for a while. Simple.
EDIT: If you really want to work with data that does not fit into your RAM, you can use some library that work with mass storage data, like stxxl, but it will work a lot slower, and you have always disk size limits.
MPI is what you need, that's actually a small size for parallel computing problems the blue gene Q monster at Lawerence Livermore National Labs holds around 1.5 PB of ram. you need to use block decomposition to divide up your problem and viola!
the basic approach is dividing up the array into equal blocks or chunks among many processors
You need to uppgrade to a 64-bit system. Then get 64-bit-capable compiler then put a l at the end of 100000000000000000.
Have you heard of sparse matrix implementation? In one of the sparse matrices, you just use very little part of the matrix despite of the matrix being huge.
Here are some libraries for you.
Here is a basic info about sparse-matrices You dont actually use all of it. Just the needed few points.
Good day everyone,
I'm new in C programming and I don't have a lot of knowledge on how to handle very huge matrix in C. e.g. Matrix size of 30.000 x 30.000.
My first approach is to store dynamically memory:
int main()
{ int **mat;
int j;
mat = (int **)malloc(R*sizeof(int*));
for(j=0;j<R;j++)
mat[j]=(int*)malloc(P*sizeof(int));
}
And it is a good idea to handle +/- matrix of 8.000 x 8.000. But, not bigger. So, I want to ask for any light to handle this kind of huge matrix, please.
As I said before: I am new to C, so please don't expect too much experience.
Thanks in advance for any suggestion,
David Alejandro.
PD: My laptop conf is linux ubuntu, 64bit, i7, and 4gb of ram.
For a matrix as large as that, I would try to avoid all those calls to malloc. This will reduce the time to set up the datastructure and remove the memory overhead with dynamic memory (malloc stores additional information as to the size of the chunk)
Just use malloc once - i.e:
#include <stdlib.h>
int *matrix = malloc(R * P * sizeof(int));
Then to compute the index as
index = column + row * P;
Also access the memory sequentially i.e. by column first. Better performance for the cache.
Well, a two-dimensional array (roughly analogous C representation of a matrix) of 30000 * 30000 ints, assuming 4 bytes per int, would occupy 3.6 * 10^9 bytes, or ~3.35 gigabytes. No conventional system is going to allow you to allocate that much static virtual memory at compile time, and I'm not certain you could successfully allocate it dynamically with malloc() either. If you only need to represent a small numerical range, then you could drastically (i.e., by a factor of 4) reduce your program's memory consumption by using char. If you need to do something like, e.g., assign boolean values to specific numbers corresponding to the indices of the array, you could perhaps use bitsets and further curtail your memory consumption (by a factor of 32). Otherwise, the only viable approach would involve working with smaller subsets of the matrix, possibly saving intermediate results to disk if necessary.
If you could elaborate on how you intend to use these massive matrices, we might be able to offer some more specific advice.
Assuming you are declaring your values as float rather than double, your array will be about 3.4 GB in size. As long as you only need one, and you have virtual memory on your Ubuntu system, I think you could just code this in the obvious way.
If you need multiple matrices this large, you might want to think about:
Putting a lot more RAM into your computer.
Renting time on a computing cluster, and using cluster-based processing to compute the values you need.
Rewriting your code to work on subsets of your data, and write each subset out to disk and free the memory before reading in the next subset.
You might want to do a Google search for "processing large data sets"
I dont know how to add comments so dropping an answer here.
1 thing tha I can think is, you are not going to get those values in running program. Those will come from some files only. So instead taking all values, keep reading 30,000x2 one by one so that will not come into memory.
For 30k*30k matrix, if init value is 0(or same) for all elements what you can do is, instead creating the whole matrix, create a matrix of 60k*3 (3 cols will be : row no, col no and value). This is becasue you will have max 60k different location which will be affected.
I know this is going to be a little slow because you always need to see if the element is already added or not. So, if speed is not your concern, this will work.
Numerous sources I have found have suggested that the size of arrays for VBA code depends upon the amount of memory in the machine. This however hasn't been the case for me. I'm running the following, very simple, code to test:
Sub test6()
Dim arr(500, 500, 500) As Boolean
End Sub
However, if I change the size to be 600x600x600, I get an out of memory error. The machine I'm using has 16Gb of RAM, so I doubt that physical RAM is the issue.
I'm using Excel 2007. Is there a trick to getting VBA to use more RAM?
It would be nice if there was an Application.UseMoreMemory() function that we could just call :-)
Alas, I know of none.
All the docs I've seen say that it's limited by memory, but it's not physical memory that's the issue, it's the virtual address space you have available to you.
You should keep in mind that, while the increase from 500 to 600 only looks like a moderate increase (though 20% is large enough on its own), because you're doing that in three dimensions, it works out to be close to double the storage requirements.
From memory, Excel 2007 used short integers (16 bits) for boolean type so, at a minimum, your 5003 array will take up about 250M (500x500x500x2).
Increasing all dimensions to 600 would give you 600x600x600x2, or about 432M.
All well within the 2G usable address space that you probably have in a 32-bit machine (I don't know that Excel 2007 had a 64-bit version), but these things are not small, and you have to share that address space with other things as well.
It'd be interesting to see at what point you started getting the errors.
As a first step, I'd be looking into the need for such a large array. It may be doable a different way, such as partitioning the array so that only part of it is in memory at any one time (sort of manual virtual memory).
That's unlikely to perform that well for truly random access but shouldn't be too bad for more sequential access and will at least get you going (a slow solution is preferable to a non-working one).
Another possibility is to abstract away the bit handling so that your booleans are actually stored as bits rather than words.
You would have to provide functions for getBool and setBool, using bitmask operators on an array of words and, again, the performance wouldn't be that crash-hot, but you would at least be able to then go up to the equivalent of:
' Using bits instead of words gives 16 times as much. '
Dim arr(8000, 8000, 8000) As Boolean
As always, it depends on what you need the array for, and its usage patterns.
Having run some tests it looks like there is a limit of about 500MB for 32-bit VBA and about 4GB for 64-bit VBA (Excel 2010-64). I don't know if these VBA limits decrease if you are using a lot of workbook/pivot memory within the same Excel instance.
As #paxdiablo mentioned the size of array is about 400+ Mb, with theoretical maximum of 2 Gb for 32 bit Excel. Most probably VBA macroses are limited in memory usage. Also memory block for array must be a contiguous block of memory which makes it allocation even harder. So it's possible that you can allocate ten arrays of 40 Mb size, but not one of 400 Mb size. Check it.
I would like to represent a structure containing 250 M states(1 bit each) somehow into as less memory as possible (100 k maximum). The operations on it are set/get. I cold not say that it's dense or sparse, it may vary.
The language I want to use is C.
I looked at other threads here to find something suitable also. A probabilistic structure like Bloom filter for example would not fit because of the possible false answers.
Any suggestions please?
If you know your data might be sparse, then you could use run-length encoding. But otherwise, there's no way you can compress it.
The size of the structure depends on the entropy of the information. You cannot squeeze information something in less than a given size if you have no repeated pattern. The worst case would still be about 32Mb of storage in your case. If you know something about the relation between the bits then it's maybe possible...
I don't think it's possible to do what you're asking. If you need to cover 250 million states of 1 bit each, you'd need 250Mbits/8 = 31.25MBytes. A far cry from 100KBytes.
You'd typically create a large array of bytes, and use functions to determine the byte (index >> 3) and bit position (index & 0x07) to set/clear/get.
250M bits will take 31.25 megabytes to store (assuming 8 bits/byte, of course), much much more than your 100k goal.
The only way to beat that is to start taking advantage of some sparseness or pattern in your data.
The max number of bits you can store in 100K of mem is 819,200 bits. This is assuming that 1 K = 1024 bytes, and 1 byte = 8 bits.
are files possible in your environment ?
if so, you might swap, say for example 4k sized segmented bit buffer.
your solution shoud access those bits in a serialized way to
minimize disk load/save operation.