Saving memory, huge array alternative c programming - c

I'm using an two arrays (unsigned int) with dimensions: 20000x20000.
I have a lot of empty spacing inside the arrays, many zeros or nulls.
There is something I can do to save memory?, because I'm running out of it.
I tried reading from a list in a file, but it's extremely slow.
I have heard that in other languages they have vectors.

You are looking for a sparse matrix, which basically works by storing entries as a list of (index1, index2, value), and only has entries for nonzero elements.

Related

Numpy concatenate is slow: any alternative approach?

I am running the following code:
for i in range(1000)
My_Array=numpy.concatenate((My_Array,New_Rows[i]), axis=0)
The above code is slow. Is there any faster approach?
This is basically what is happening in all algorithms based on arrays.
Each time you change the size of the array, it needs to be resized and every element needs to be copied. This is happening here too. (some implementations reserve some empty slots; e.g. doubling space of internal memory with each growing).
If you got your data at np.array creation-time, just add these all at once (memory will allocated only once then!)
If not, collect them with something like a linked list (allowing O(1) appending-operations). Then read it in your np.array at once (again only one memory allocation).
This is not much of a numpy-specific topic, but much more about data-strucures.
Edit: as this quite vague answer got some upvotes, i feel the need to make clear that my linked-list approach is one possible example. As indicated in the comment, python's lists are more array-like (and definitely not linked-lists). But the core-fact is: list.append() in python is fast (amortized: O(1)) while that's not true for numpy-arrays! There is also a small part about the internals in the docs:
How are lists implemented?
Python’s lists are really variable-length arrays, not Lisp-style linked lists. The implementation uses a contiguous array of references to other objects, and keeps a pointer to this array and the array’s length in a list head structure.
This makes indexing a list a[i] an operation whose cost is independent of the size of the list or the value of the index.
When items are appended or inserted, the array of references is resized. Some cleverness is applied to improve the performance of appending items repeatedly; when the array must be grown, some extra space is allocated so the next few times don’t require an actual resize.
(bold annotations by me)
Maybe creating an empty array with the correct size and than populating it?
if you have a list of arrays with same dimensions you could
import numpy as np
arr = np.zeros((len(l),)+l[0].shape)
for i, v in enumerate(l):
arr[i] = v
works much faster for me, it only requires one memory allocation
It depends on what New_Rows[i] is, and what kind of array do you want. If you start with lists (or 1d arrays) that you want to join end to end (to make a long 1d array) just concatenate them all at once. Concatenate takes a list of any length, not just 2 items.
np.concatenate(New_Rows, axis=0)
or maybe use an intermediate list comprehension (for more flexibility)
np.concatenate([row for row in New_Rows])
or closer to your example.
np.concatenate([New_Rows[i] for i in range(1000)])
But if New_Rows elements are all the same length, and you want a 2d array, one New_Rows value per row, np.array does a nice job:
np.array(New_Rows)
np.array([i for i in New_Rows])
np.array([New_Rows[i] for i in range(1000)])
np.array is designed primarily to build an array from a list of lists.
np.concatenate can also build in 2d, but the inputs need to be 2d to start with. vstack and stack can take care of that. But all those stack functions use some sort of list comprehension followed by concatenate.
In general it is better/faster to iterate or append with lists, and apply the np.array (or concatenate) just once. appending to a list is fast; much faster than making a new array.
I think #thebeancounter 's solution is the way to go.
If you do not know the exact size of your numpy array ahead of time, you can also take an approach similar to how vector class is implemented in C++.
To be more specific, you can wrap the numpy ndarray into a new class which has a default size which is larger than your current needs. When the numpy array is almost fully populated, copy the current array to a larger one.
Assume you have a large list of 2D numpy arrays, with the same number of columns and different number of rows like this :
x = [numpy_array1(r_1, c),......,numpy_arrayN(r_n, c)]
concatenate like this:
while len(x) != 1:
if len(x) == 2:
x = np.concatenate((x[0], x[1]))
break
for i in range(0, len(x), 2):
if (i+1) == len(x):
x[0] = np.concatenate((x[0], x[i]))
else:
x[i] = np.concatenate((x[i], x[i+1]))
x = x[::2]

Storing strings of different sizes in a MATLAB array?

I want to be able to store a series of strings of different sizes such as
userinput=['AJ48 NOT'; 'AH43 MANA'; 'AS33 NEWEF'];
This of course returns an error as the number of columns differs per row. I'm aware that all that is needed for this to work is adequate spaces in the first and second rows. However I need to be able to put this into an array without forcing the user to add these spaces on his/her own. Is there a command that allows me to do this? If possible I'd also like to know why this problem doesn't arise with numbers e.g.
a=[1; 243; 23524];
You cannot do this with standard Matlab arrays. A string is really just a vector of characters in Matlab. And you cannot have a matrix with rows of different lengths.
You can, however, use a cell array:
userinput={'AJ48 NOT'; 'AH43 MANA'; 'AS33 NEWEF'};
disp(userinput{1});
Be aware that there are many situations where cell arrays don't work like normal arrays.
To just answer to your last part of your question; simply because strings may be variable length but numbers (in Matlab) are fixed length. It's one of the main ideas of arrays to let them hold only fixed sizes entities (for example because the need of efficient look up), see more on the topic here.

Sparse matrix conversion in C

I'm trying to develop a program in C to convert a sparse matrix file into a dense matrix. From what I've read, the best approach would be the use of linked lists but I have no experience with them and haven't found a good online resource explaining the subject. I'm not looking for a quick solution but rather a website or text source that can explain how the process works so I can apply it to this project. What resources I have seen, suggest using three arrays to handle the values in the matrix (The row, column, and individual value) and two arrays for the vector (one for the row, the other for the column). Thanks!
The file format you've specified is for a dense matrix. A 10x10 matrix with 100 elements is dense. A sparse matrix has fewer than n*m elements and all "missing" elements are assumed to be 0. The point of doing it this way is so that matrices that are almost all zero (which happens in a lot of applications) will use less space. But using a sparse matrix format to store a dense matrix will use far more space than just a plain array.
One common sparse matrix file format is called MatrixMarket and it looks very similar to what you described. The first line has three values, # of rows, # of columns, # of nonzero elements (called nnz). Then you have nnz lines of the actual elements in a triplet: (row #) (column #) (value)
If your sparse matrix is in a similar format then you don't need any sparse matrix in memory. Just scan the values and fill in your dense array directly.
If you do want to have a sparse matrix in memory then there are several options for how to store it. Triplets is the easiest, and it's just an in-memory version of the MatrixMarket file. 3 arrays, or 1 array of structs.
The most common structure for linear algebra operations is Compressed Sparse Columns (CSC) or Compressed Sparse Rows (CSR). I'll let you look that up, but if you want a C implementation to play with you should look at Tim Davis' CSparse. This is also how MatLAB stores sparse matrices, Tim was one of the people who wrote that part of MatLAB.
It sounds like a linked list may not be what you're looking for, but this site offers a pretty comprehensive tutorial on the subject. It may help shed some light on whether or not it would be appropriate for your problem... Good luck!

How can I efficiently copy 2-dimensional arrays of bytes into a larger 2D array?

I have a structure called Patch that represents a 2D array of data.
newtype Size = (Int, Int)
data Patch = Patch Size Strict.ByteString
I want to construct a larger Patch from a set of smaller Patches and their assigned positions. (The Patches do not overlap.) The function looks like this:
newtype Position = (Int, Int)
combinePatches :: [(Position, Patch)] -> Patch
combinePatches plan = undefined
I see two sub-problems. First, I must define a function to translate 2D array copies into a set of 1D array copies. Second, I must construct the final Patch from all those copies.
Note that the final Patch will be around 4 MB of data. This is why I want to avoid a naive approach.
I'm fairly confident that I could do this horribly inefficiently, but I would like some advice on how to efficiently manipulate large 2D arrays in Haskell. I have been looking at the "vector" library, but I have never used it before.
Thanks for your time.
If the spec is really just a one-time creation of a new Patch from a set of previous ones and their positions, then this is a straightforward single-pass algorithm. Conceptually, I'd think of it as two steps -- first, combine the existing patches into a data structure with reasonable lookup for any give position. Next, write your new structure lazily by querying the compound structure. This should be roughly O(n log(m)) -- n being the size of the new array you're writing, and m being the number of patches.
This is conceptually much simpler if you use the Vector library instead of a raw ByteString. But it is simpler still if you simply use Data.Array.Unboxed. If you need arrays that can interop with C, then use Data.Array.Storable instead.
If you ditch purity, at least locally, and work with an ST array, you should be able to trivially do this in O(n) time. Of course, the constant factors will still be worse than using fast copying of chunks of memory at a time, but there's no way to keep that code from looking low-level.

Matrices and databases

I went through the topic and found out this link quite useful and simple at the same time.
Storing matrices in a relational database
But can you please let me know if the way mentioned as
A B C D
E F G H
I J K L
[A B C D E F G H I J K L]
is the best and simple or even reliable way of storing the matrix elements in the database. Moreover I need to multiply two matrices and make the operation dynamic. So will the storage of data this create any problems for the task?
In postgresql you can actually have multidimensional arrays, define your own types and define your own functions on those types. For instance one could simply do:
CREATE TABLE tictactoe (
squares integer[3][3]
);
See The PostgreSQL manual for info on how to create your own types.
I think it pretty much depends on how you want to use the matrices in your application.
Is the DB only for persistence for the same application, speed is important, and sizes cannot be known in advance? Make your own serialization scheme, and save the binary blob.
Is the DB for sharing in between applications, with the size not known in advance? Use the comma delimited list.
Are you concerned with data integrity, type safety, and would like to query individual cells? Then use the (row, col, cell value) schema.
Do you know that your matrices are of fixed size and relatively small, for example 4X4 transformation matrices, and will have a 1 to 1 relationship to whatever element you have in the DB? Then you could actually have 16 rows in your table, layed out in line.
Think about your use cases, and experiment!
is the best and simple or even reliable way of storing the matrix elements in the database. Moreover I need to multiply two matrices and make the operation dynamic. So will the storage of data this create any problems for the task?
I'll start by saying both approaches are valid, but the second one is not sufficient as written by you. You have to have some other information, like the length of the rows or the (row, col) indexes of each element to store a matrix as a 1D array. This is commonly done for sparse matricies, where there are lots of zeros surrounding values clustered on either side of the diagonal.
Persisting the matrix in a database and operating on it in memory are two separate things.
Tasks like multiplying require (row, col) indexes. Storing the matrix as a 2D array means that you'll have them, so no other info is needed. The 1D array needs this info too, so you'll have to supply it.
The advantage swings to the 1D array for sparse matricies. You don't have to store zero values outside the bandwidth in that case, but your operations like addition and multiplication become more complex to code.

Resources