Given a 2D array sparsely filled with objects containing position and size data (X,Y,W,H) representing objects in 2D space which has been transformed into a 1D array with excess data removed(any empty spaces) is it possible to index into the one dimensional array given a starting point (X,Y) corresponding to a location in the original array using some minimal set of metadata acquired from the original array?
Essentially i'm searching for a way to represent large sets of 2D spatial data without consuming an excess of memory and computing power(a 16,000 by 16,000 array would contain 256,000,000 objects) so getting rid of potentially hundreds of millions of "empty spaces" would represent a large performance gain in both terms of memory usage and cycles wasted looping over empty space. Computations are being made in real time so keeping data in memory is preferred.
Linked is an example of what i'm trying to accomplish. Hope it makes sense.
example
Related
Why is array considered a data structure ? How is array a data structure in tetms of efficiency? Please explain by giving some examples
It's a data structure because it's collection of data and the tools to work it.
Primary features:
Extremely fast lookup by index.
Extremely fast index-order traversal.
Minimal memory footprint (not so with the optional modifications I mentioned).
Insertion is normally O(N) because you may need to copy the array when you reallocate the array to make space for new elements. However, you can bring the cost of appending down to amortized O(1) by over-allocating (i.e. by doubling the size of the array every time you reallocate).[1]
Deletion is O(N) because you will need to shift N/2 elements on average. You could keep track the number of unused elements at the start and end of the array to make removals from the ends O(1).[1]
Lookup by index is O(1). It's a simple pointer addition.
Lookup by value is O(N). If the data is ordered, one can use a binary search to reduce this to O(log N).
Keeping track of the first used element and the last used element would technically qualify as a different data structure because the functions to access the data structure are different, but it would still be called an array.
I have a very large array stored in an hdf5 file. I am trying to load it and manage it as a Dask array.
At the moment my challenge is that i need to shuffle this array time to time in a process, this is a challenge by itself to shuffle an array bigger than memory.
So what i am trying to do without success is to shuffle the dask array chunks.
#Prepare data
f=h5py.File('Data.hdf5')
dset = f['/Data']
dk_array = da.from_array(dset, chunks=dset.chunks)
So given the context above how can i shuffle the chunks?
If your array is tabular in nature then you might consider adding a column of random data (see da.concatenate and da.random), turning it into a dask.dataframe, and setting that column as the index.
As a warning, this will be somewhat slow as it will need to do an on-disk shuffle.
Do maps with char-type keys have faster access time than normal arrays?
The reason I think this is true is because normal arrays have integer-type indexing while the maps I think about have char-type indexing.
Integers are 4 bytes while chars are only 1 byte, so it seems reasonable to believe that accessing a map item at a given char key is faster than accessing a normal array item at a given integer index. In other words, the CPU has fewer bytes of the index/key value to examine to determine which element in the array is being referred to in memory.
Maps are slower than Arrays.
Because, Maps are actually implementation of arrays.
But for larger amount of data, you can use HashMap since you get rid of comparisons (if used correctly).
Is there a function in Fortran that deletes a specific element in an array, such that the array upon deletion shrinks its length by the number of elements deleted?
Background:
I'm currently working on a project which contain sets of populations with corresponding descriptions to the individuals (i.e, age, death-age, and so on).
A method I use is to loop through the array, find which elements I need, place it in another array, and deallocate the previous array and before the next time step, this array is moved back to the array before going through the subroutines to find once again the elements not needed.
You can use the PACK intrinsic function and intrinsic assignment to create an array value that is comprised of selected elements from another array. Assuming array is allocatable, and the elements to be removed are nominated by a logical mask logical_mask that is the same size as the original value of array:
array = PACK(array, .NOT. logical_mask)
Succinct syntax for a single element nominated by its index is:
array = [array(:index-1), array(index+1:)]
Depending on your Fortran processor, the above statements may result in the compiler creating temporaries that may impact performance. If this is problematic then you will need to use the subroutine approach that you describe.
Maybe you want to look into linked lists. You can insert and remove items and the list automatically resizes. This resource is pretty good.
http://www.iag.uni-stuttgart.de/IAG/institut/abteilungen/numerik/images/4/4c/Pointer_Introduction.pdf
To continue the discussion, the solution you might want to implement depends on the number of delete operation and access you do, where you insert/delete the elements (the first, the last, randomly in the set?), how do you access the data (from the first to the last, randomly in the set?), what are your efficiency requirements in terms of CPU and memory.
Then you might want to go for linked list or for static or dynamic vectors (other types of data structures might also fit better your needs).
For example:
a static vector can be used when you want to access a lot of elements randomly and know the maximum number nmax of elements in the vector. Simply use an array of nmax elements with an associated length variable that will track the last element. A deletion can simply and quickly be done my exchanging the last element with the deleted one and reducing the length.
a dynamic vector can be implemented when you don't know the maximum number of elements. In order to avoid systematic array allocation+copy+unallocation at for each deletion/insertion, you fix the maximum number of elements (as above) and only augment its size (eg. nmax becomes 10*nmax, then reallocate and copy) when reaching the limit (the reverse system can also be implemented to reduce the number of elements).
Is there a better approach to use of Multidimensional Arrays to compute values to be displayed in a table. Please note each of the dimension of the array is huge but is sparse. Can something like a HashTable be considered?
Output Table after the computation looks like this
This answer is outdated, because the OP added the information, that the data is a sparse matrix
Not really. Maybe a one dimensional array (would save the pointers to the dimensions - but that's err... pointless).
An array is the data structure with the fewest metadata (because there is no metadata at all). So your approach can't be optimized much, if you really need to store all that data in memory.
Any other data structure (tree, linked lists, etc.) would contain extra metadata and would therefore consume more memory.
The only way for you to use less memory is to actually use less memory (by only loading the data into memory you really need and leaving the rest on your hard drive or whatever).
You want to display a table, so maybe you can limit the rows you save in memory to an area slightly bigger than the viewport of your table (so you can scroll through the table fluently). Then you can dynamically compute and overwrite the rows according to the scroll state of your table.
There are a number of different ways to manage memory for a sparse matrix. I would start by defining a struct to hold an individual entry in your matrix
struct sparse_matrix_data{
int i;
int j;
int /* or double or whatever */ value;
};
so that you would store the two indices and the value for each non-zero entry. From there, you need to decide what data structure works best for the computations you need to do: hash table on one or both indices, array of these structs, linked list, ...
Note that this will only decrease the memory required if the additional memory required to store the indices is less than the memory you used to store the zeros in your original multidimensional array.