Efficient methods for storing many similar arrays? - arrays

Recently I was faced with having to store many 'versions' of an array in memory (think of an undo system or changes to a file in version-control - but could apply elsewhere too).
In case this isn't clear:
Arrays may be identical, share some data, or none at all.
Elements may be added or removed an any point.
The goal is to avoid storing an entirely new array when there are large sections of the array that are identical.
For the purpose of this question, changes such as adding a number to each value can be ignored (treated as different data).
I've looked into writing my own solution, in principle this can be done fairly simply:
divide the array into small blocks.
if nothing changes, reuse the blocks in each new version of the array.
if only one block changes, make a new block with changed data.
retrieving an array can be done by allocating the memory, then filling it with the data from each block.
Things become more involved is when the array length changes or when the data is re-ordered.
Then it becomes a trade off for how much time its worth to spend searching for duplicate blocks (in my case I hashed some data at the beginning of each block to help identify candidates to use).
I've got my implementation working (and can link to it if its useful, though I rather avoid discussing my spesific code, since it distracts from the general case).
I suspect my own code could be improved (using tried-and-tested memory hashing & searching methods). Possibly I'm not using the right terms but I wasn't able to find information on this searching online.
So my questions are:
Which methods are most efficient for recognizing and storing arrays that share some contiguous data?
Are there known, working methods which are considered best-practice to solve this problem?
Update, wrote a small(ish) single file library and tests, as well as a Python reference version.

Related

How to manage memory for inserting and deleting content? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
For convenience, I just use examples for plain text. For sentence I have a cat, for example, I need to malloc 13 slots of char variables so that it stores all the letters with the final \0.
However, what if now I want to insert lovely before cat? It seems that I have to create a new array that is large enough and copy everything over.
Worse, since it is unpredictable for the computer that how much stuff will be added, it seems that I have to do this re-malloc and copy thing each time a new letter is added, that is, do the whole thing for each letter l o v e l y, which turns out not a smart solution. (The computer does not know ahead of time the word 'lovely', eh?)
A "better" solution seems to be creating a large enough array at the first place so that every time a new letter is inserted, the program only copies and moves everything after it back. However, this is still inefficient especially when the document is long and I'm adding stuff from the beginning.
The same applies for 'delete', for which every time a letter is deleted I have to copy everything after it over and shrink the array size, it seems.
Using nodes instead of arrays for storing content seems an equally awful solution as now every time I want to do something in the middle of the content I have to take a path all the way from the beginning.
So what is the correct, or efficient, way to manage the memory in this case? I want answers for programming at a low level such as C, which requires direct memory allocation and de-allocation without "magic" functions or libraries that handle everything for you already.
Using linked list of chunks of memory sounds like a good intermediate solution. Each node would be a "page" of memory of certain size. To speed up modifying content in middle pages you could have an index array which would contain page pointers to absolute positions in the whole document.
Deletion should just be performed when and entire page is empty. In that moment you should do something like:
prevPage->next = nextPage;
pageFree(page_to_delete);
If you want to handle characters insertion and deletion easily without re-malloc over and over, i think the best solution is Doubly Linked List.
Check out here : DoublyLinkedListExample (i learned it at school but i think this tuto explains you quite simply how it works and how to use it)
Those are just struct (nodes) with your data, a pointer to the previous element and a pointer to the next element. If you don't understand how it works, just check a tutorial for simply linked list before and then it will be easier for you.
Just practice it beceause its quite hard to understand at the beginning. Keep training and you will reach it :)
One efficient solution is to use a circular array list.
http://en.wikipedia.org/wiki/Circular_buffer
After pre-allocating some size of array, you also keep track of a pointer to the 'beginning' of the list(at first the index of 'c', then the index of 'l'). This way, to insert or delete at the beginning you can add to the physical end of memory and change the pointer.
To index into the array, you simply index into array[(beginning pointer + index)%size].
If number of letters becomes too large, you still have to copy to a new array.
In terms of how much to pre-allocate, a system that doesn't take to much time is to double the size of the array each time it becomes full. This doesn't add too much of an overhead.
Edit:
A circular array list won't be useful if you need to insert data into the middle of the list. However, it is useful for adding data to the beginning and end of the list and modifying or accessing the middle.
Given what you responded to in the comments clarifying your use case, my suggestion would be to consider a linked list of content, where in the metaphor of your plain text example, the elements of the linked list are words or paragraphs or pages, and the words themselves are contiguous arrays.
While the navigation between them isn't super fast, it seemed that your performance imperative was for quick insertion and deletion. By having small contiguous words, the O(n) cost for reallocing/shrinking and copying stuff over is minimized by controlling for small n. This is achieved by having many n's which are the linked list elements.
This blends together performance improvements from having the 'individual' pieces of the content spatial locality, while allowing you to pick an upper level list/tree structure to help gain benefits for temporal locality.
The one thing this really doesn't address is what needs to be done to this data after the fact for processing it, and what level of performance is truly tolerable. Constant malloc calls will be bad for latency because its a blocking system call; so you could further consider using another solution already mentioned such as circular buffers or managing of your own bigger chunks of memory to distribute yourself to these elements. In that way you'd only have to malloc when you needed a much larger chunk of memory to work with, and still wouldn't necessarily have to recopy everything from page to page, but just a smaller chunk that didn't fit.
Again as I said in my comment, people write dissertations about this kinda thing, and it's a major component of OS design and systems understanding. So take this all with a grain of salt. There are a very large number of things to consider that can't be covered here.
It is not completely clear what is your use case.
Since you mention text manipulation and having efficient insert, deletion and random access operations I guess you could use a rope data structure, which is a binary tree which basically store short string fragments in its nodes (roughly). For the details see the linked article.

Is it bad form to shuffle data instead of pointers to it?

This isn't actually a homework question per se, just a question that keeps nagging me as I do my homework. My book sometimes gives an exercise about rearranging data, and will explicitly say to do it by changing only pointers, not moving the data (for example, in a linked list making use of a "node" struct with a data field and a next/pointer field, only change the next field).
Is it bad form to move data instead? Sometimes it seems to make more sense (either for efficiency or clarity) to move the data from one struct to another instead of changing pointers around, and I guess I'm just wondering if there's a good reason to avoid doing that, or if the textbook is imposing that constraint to more effectively direct my learning.
Thanks for any thoughts. :)
Here are 3 reasons:
Genericness / Maintainability:
If you can get your algorithm to work by modifying pointers only, then it will always work regardless of what kind of data you put in your "node".
If you do it by modifying data, then your algorithm will be married to your data structure, and may not work if you change your data structure.
Efficiency:
Further, you mention efficiency, and you will be hard-pressed to find a more efficient operation than copying a pointer, which is just an integer, typically already the size of a machine word.
Safety:
And further still, the pointer-manipulation route will not cause confusion with other code which has its own pointers to your data, as #caf points out.
It depends. It generally makes sense to move the smaller thing, so if the data being shuffled is larger than a pointer (which is usually the case), then it makes more sense to shuffle pointers rather than data.
In addition, if other code might have retained pointers to the data, then it wouldn't expect the data to be changed from underneath, so this again points towards shuffling pointers rather than data.
Shuffling pointers or indexes is done when copying or moving the actual objects is difficult or inefficient. There's nothing wrong with shuffing the objects themselves if that's more convenient.
In fact by eliminating the pointers you eliminate a whole bunch of potential problems that you get with pointers, such as whether and when and how to delete them.
Moving data takes more time and depending on the nature of your data it may also don't like relocations (like the structure containing pointers into itself for whatever reasons).
If you have pointers, I assume they exist in the dynamic memory...
In other words, they just exist... So why bother changing the data from one to another, reallocating if necessary?
Usually, the purpose of a list is to have values from everywhere, from a memory perspective, into a continuous list.
With such a structure, you can re-arrange and re-order the list, without having to move the data.
You've to to understand that moving data implies reading and writing into memory (not speaking about reallocation).
It's resource consuming... So re-ordering only the addresses is a lot more efficient!
It depends on the data. If you're just moving around ints or chars, it would be no more expensive to shuffle the data than the pointer. However, once you pass a certain size or complexity, you start to lose efficiency quickly. Moving objects by pointer will work for any contained data, so getting used to using pointers, even on the toy structs that are used in your assignments, will help you handle those large, complex objects without.
It is especially idiomatic to handle things by pointer when dealing with something like a linked list. The whole point of the linked list is that the Node part can be as large or complex as you like, and the semantics of shuffling, sorting, inserting, or removing nodes all stay the same. This is the key to templated containers in C++ (which I know is not the primary target of this question). C++ also encourages you to consider and limit the number of times you shuffle things by data, because that involves calling a copy constructor on each object each time you move it. This doesn't work well with many C++ idioms, such as RAII, which makes a constructor a rather expensive but very useful operation.

How to automatically translate pure code into code that uses mutable arrays for efficiency?

This is a Haskell question, but I'd also be interested in answers about other languages. Is there a way to automatically translate purely functional code, written to process either lists or immutable arrays without doing any destructive updates, into code that uses mutable arrays for efficiency?
In Haskell the generated code would either run in the ST monad (in which case it would all be wrapped in runST or runSTArray) or in the IO monad, I assume.
I'm most interested in general solutions which work for any element type.
I thought I've seen this before, but I can't remember where. If it doesn't already exist, I'd be interested in creating it.
Implementing a functional language using destructive updates is a memory management optimization. If an old value will no longer be used, it is safe to reuse the old memory to hold a new values. Detecting that a value will not be used anymore is a difficult problem, which is why reuse is still managed manually.
Linear type inference and uniqueness type inference discover some useful information. These analyses discover variables that hold the only reference to some object. After the last use of that variable, either the object is transferred somewhere else, or the object can be reused to hold a new value.
Several languages, including Sisal and SAC, attempt to reuse old array memory to hold new arrays. In SAC, programs are first converted to use explicit memory management (specifically, reference counting) and then the memory management code is optimized.
You say "either lists or immutable arrays", but those are actually two very different things, and in many cases algorithms naturally suited to lists would be no faster (and possibly slower) when used with mutable arrays.
For instance, consider an algorithm consisting of three parts: Constructing a list from some input, transforming the list by combining adjacent elements, then filtering the list by some criterion. A naive approach of fully generating a new list at each step would indeed be inefficient; a mutable array updated in place at each step would be an improvement. But better still is to observe that only a limited number of elements are needed simultaneously and that the linear nature of the algorithm matches the linear structure of a list, which means that all three steps can be merged together and the intermediate lists eliminated entirely. If the initial input used to construct the list and the filtered result are significantly smaller than the intermediate list, you'll save a lot of overhead by avoiding extra allocation, instead of filling a mutable array with elements that are just going to be filtered out later anyway.
Mutable arrays are most likely to be useful when making a lot of piecemeal, random-access updates to an array, with no obvious linear structure. When using Haskell's immutable arrays, in many cases this can be expressed using the accum function in Data.Array, which I believe is already implemented using ST.
In short, a lot of the simple cases either have better optimizations available or are already handled.
Edit: I notice this answer was downvoted without comment and I'm curious why. Feedback is appreciated, I'd like to know if I said something dumb.

Persisting a Large List for Membership Testing in C

Each item is an array of 17 32-bit integers. I can probably produce 120-bit unique hashes for them.
I have an algorithm that produces 9,731,643,264 of these items, and want to see how many of these are unique. I speculate that at most 1/36th of these will be unique but can't be sure.
At this size, I can't really do this in memory (as I only have 4 gigs), so I need a way to persist a list of these, do membership tests, and add each new one if it's not already there.
I am working in C(gcc) on Linux so it would be good if the solution can work from there.
Any ideas?
This reminds me of some of the problems I faced working on a solution to "Knight's Tour" many years ago. (A math problem which is now solved, but not by me.)
Even your hash isn't that much help . . . at the nearly the size of a GUID, they could easily be unique accross all the the known universe.
It will take approximately .75 Terrabytes just to hold the list on disk . . . 4 Gigs of memory or not, you'd still need a huge disk just to hold them. And you'd need double that much disk or more to do the sort/merge solutions I talk about below.
If you could SORT that list, then you could just go threw the list one item at a time looking for unique copies next to each other. Of course sorting that much data would required a custom sort routine (that you wrote) since it is binary (coverting to hex would double the size of your data, but would allow you to use standard routines) . . . though likely even there they would probably choke on that much data . . . so your are back to your own custom routines.
Some things to think about:
Sorting that much data will take weeks, months or perhaps years. While you can do a nice heap sort or whatever in memory, because you only have so much disk space, you will likely be doing a "bubble" sort of the files regardless of what you do in memory.
Depending on what your generation algorithm looks like, you could generate "one memory load" worth of data, sort it in place then write it out to disk in a file (sorted). Once that was done, you just have to "merge" all those individual sorted files, which is a much easier task (even thought there would be 1000s of files, it would still be a relatively easier task).
If your generator can tell you ANYTHING about your data, use that to your advantage. For instance in my case, as I processed the Knight's Moves, I know my output values were constantly getting bigger (because I was always adding one bit per move), that small knowledge allowed me to optimize my sort in some unique ways. Look at your data, see if you know anything similar.
Making the data smaller is always good of course. For instance you talk about a 120 hash, but is that has reversable? If so, sort the hash since it is smaller. If not, the hash might not be that much help (at least for my sorting solutions).
I am interested in the machanics of issues like this and I'd be happy to exchange emails on this subject just to bang around ideas and possible solutions.
You can probably make your life a lot easier if you can place some restrictions on your input data: Even assuming only 120 significant bits, the high number of duplicate values suggests an uneven distribution, as an even distribution would make duplicates unlikely for a given sample size of 10^10:
2^120 = (2^10)^12 > (10^3)^12 = 10^36 >> 10^10
If you have continuous clusters (instead of sparse, but repeated values), you can gain a lot by operating on ranges instead of atomic values.
What I would do:
fill a buffer with a batch of generated values
sort the buffer in-memory
write ranges to disk, ie each entry in the file consists of start and end value of a continuous group of values
Then, you need to merge the individual files, which can be done online - ie as the files become available - the same way a stack-based mergesort operates: associate to each file a counter equal to the number of ranges in the file and push each new file on a stack. When the file on top of the stack has a counter greater or equal to the previous file, merge the files into a new file whose counter is the number of ranges in the merged file.

which one to use linked list or static arrays?

I have a structure in C which resembles that of a database table record.
Now when I query the table using select, I do not know how many records I will get.
I want to store all the returned records from the select query in a array of my structure data type.
Which method is best?
Method 1: find array size and allocate
first get the count of records by doing select count(*) from table
allocate a static array
run select * from table and then store each records in my structure in a loop.
Method 2: use single linked list
while ( records returned )
{
create new node
store the record in node
}
Which implementation is best?
My requirement is that when I have all the records,
I will probably make copies of them or something.
But I do not need random access and I will not be doing any search of a particular record.
Thanks
And I forgot option #4. Allocate an array of fixed size. When that array is full, allocate another. You can keep track of the arrays by linking them in a linked list, or having a higher level array that keeps the pointers to the data arrays. This two-level scheme is great when you need random access, you just need to break your index into two parts.
A problem with 'select count(*)' is that the value might change between calls, so your "real" select will have a number of items different from the count you'd expect.
I think the best solution is your "2".
Instead of a linked list, I would personally allocate an array (reallocating as necessary). This is easier in languages that support growing arrays (e.g. std::vector<myrecord> in C++ and List<myrecord> in C#).
You forgot option 3, it's a little more complicated but it might be best for your particular case. This is the way it's typically done in C++ std::vector.
Allocate an array of any comfortable size. When that array is filled, allocate a new larger array of 1.5x to 2x the size of the filled one, then copy the filled array to this one. Free the original array and replace it with the new one. Lather, rinse, repeat.
There are a good many possible critiques that should be made.
You are not talking about a static array at all - a static array would be of pre-determined size fixed at compile time, and either local to a source file or local to a function. You are talking about a dynamically allocated array.
You do not give any indication of record size or record count, nor of how dynamic the database underneath is (that is, could any other process change any of the data while yours is running). The sizing information isn't dreadfully critical, but the other factor is. If you're doing a report of some sort, then fetching the data into memory is fine; you aren't going to modify the database and the data is an accurate snapshot. However, if other people could be modifying the records while you are modifying records, your outline solution is a major example of how to lose other people's updates. That is a BAD thing!
Why do you need all the data in memory at once? Ignoring size constraints, what exactly is the benefit of that compared with processing each relevant record once in the correct sequence? You see, DBMS put a lot of effort into being able to select the relevant records (WHERE clauses) and the relevant data (SELECT lists) and allow you to specify the sequence (ORDER BY clauses) and they have the best sort systems they can afford (better than the ones you or I are likely to produce).
Beware of quadratic behaviour if you allocate your array in chunks. Each time you reallocate, there's a decent chance the old memory will have to be copied to the new location. This will fragment your memory (the old location will be available for reuse, but by definition will be too small to reuse). Mark Ransom points out a reasonable alternative - not the world's simplest scheme overall (but it avoids the quadratic behaviour I referred to). Of course, you can (and would) abstract that away by a set of suitable functions.
Bulk fetching (also mentioned by Mark Ransom) is also useful. You would want to preallocate the array into which a bulk fetch fetches so that you don't have to do extra copying. This is just linear behaviour though, so it is less serious.
Create a data structure to represent your array or list. Pretend you're in an OO language and create accessors and constructors for everything you need. Inside that data structure, keep an array, and, as others have said, when the array is filled to capacity, allocate a new array 2x as large and copy into it. Access the structure only through your defined routines for accessing it.
This is the way Java and other languages do this. Internally, this is even how Perl is implemented in C.
I was going to say your best option is to look for a library that already does this ... maybe you can borrow Perl's C implementation of this kind of data structure. I'm sure it's more well tested than anything you or I could roll up from scratch. :)
while(record = get_record()) {
records++;
records_array = (record_struct *) realloc(records_array, (sizeof record_struct)*records);
*records_array[records - 1] = record;
}
This is strictly an example — please don't use realloc() in production.
The linked list is a nice, simple option. I'd go with that. If you prefer the growing array, you can find an implementation as part of Dave Hanson's C Interfaces and Implementations, which as a bonus also provides linked lists.
This looks to me like a design decision that is likely to change as your application evolves, so you should definitely hide the representation behind a suitable API. If you don't already know how to do this, Hanson's code will give you a number of nice examples.

Resources