thread-safe cache for sparse, lazy, immutable arrays - c

I have an application that involves a collection of arrays which can be very large (indices up to the maximum value of an int), but which are lazy - their contents are calculated on the fly and are not actually known until requested. The arrays are also immutable - the value of each element of each array is constant throughout the life of the program. The arrays are sparse in the sense that often only a small subset of all array elements are ever requested (the arrays do not contain large blocks of zeros and are not "sparse" in that sense.)
Looking up (and possibly calculating in the process) an array element can be expensive, so I want to add a caching layer. The cache should implement the following interface:
void point_cache_store (gpointer data, gsize idx, gdouble value);
gdouble point_cache_fetch (gpointer data, gsize idx);
where data serves as a unique handle for each array (there can be many of these). point_cache_fetch() should return the value argument passed to point_cache_store() with the same data and idx arguments, or indicate a cache miss by returning the special value DATUM_UNKNOWN_VALUE (the caller will never call point_cache_store with DATUM_UNKNOWN_VALUE).
The question is: how can I implement point_cache_fetch() and point_cache_store()? (They are currently no-op stubs.)
Points to consider:
The cache implementation must be thread-safe. Several threads are running simultaneously and any of these can call point_cache_store() or point_cache_fetch() with any data or idx arguments.
The cache truly is a cache; it's always OK for point_cache_fetch() to return DATUM_UNKNOWN_VALUE, even if it once knew that value. The caller will just perform an ordinary lookup in that case.
Remember, the arrays are immutable - for given data and idx arguments, the caller will always provide the same value argument.
I realize that there are many ways to do this and that there are tradeoffs involved. For this question, though, I am going to evaluate answers by one very specific criterion: whether they improve performance in one particular benchmark in the application that inspired the question. If you want to go the extra mile and run the benchmark yourself, here is how to do it:
git clone git://github.com/gbenison/starparse
git clone git://github.com/gbenison/burrow-owl.git -b point-cache-base
The functions point_cache_fetch() and point_cache_store() are found in "burrow/spectrum/point_cache.c". The relevant benchmark is "benchmarks/b_cache".

A "very large sparse lazy array"? Sounds like you need a hash table.
From your point_cache_fetch function prototype and all through your question, I am confused about whether your cached values are doubles or immutable arrays.
I'm not going to provide an implementation, as this is not a 'coding challenge' website. You should try to find and reuse existing libraries of threadsafe hashtables and compare their performance for your specific needs.

Related

Is Array of Objects Access Time O(1)?

I know that accessing something from an array takes O(1) time if it is of one type using mathematical formula array[n]=(start address of array + (n * size Of(type)), but Assume you have an array of objects. These objects could have any number of fields including nested objects. Can we consider the access time to be constant?
Edit- I am mainly asking for JAVA, but I would like to know if there is a difference in case I choose another mainstream language like python, c++, JavaScript etc.
For example in the below code
class tryInt{
int a;
int b;
String s;
public tryInt(){
a=1;
b=0;
s="Sdaas";
}
}
class tryobject{
public class tryObject1{
int a;
int b;
int c;
}
public tryobject(){
tryObject1 o=new tryObject1();
sss="dsfsdf";
}
String sss;
}
public class Main {
public static void main(String[] args) {
System.out.println("Hello World!");
Object[] arr=new Object[5];
arr[0]=new tryInt();
arr[1]=new tryobject();
System.out.println(arr[0]);
System.out.println(arr[1]);
}
}
I want to know that since the tryInt type object should take less space than tryobject type object, how will an array now use the formula array[n]=(start address of array + (n * size Of(type)) because the type is no more same and hence this formula should/will fail.
The answer to your question is it depends.
If it's possible to random-access the array when you know the index you want, yes, it's a O(1) operation.
On the other hand, if each item in the array is a different length, or if the array is stored as a linked list, it's necessary to start looking for your element at the beginning of the array, skipping over elements until you find the one corresponding to your index. That's an O(n) operation.
In the real world of working programmers and collections of data, this O(x) stuff is inextricably bound up with the way the data is represented.
Many people reserve the word array to mean a randomly accessible O(1) collection. The pages of a book are an array. If you know the page number you can open the book to the page. (Flipping the book open to the correct page is not necessarily a trivial operation. You may have to go to your library and find the right book first. The analogy applies to multi-level computer storage ... hard drive / RAM / several levels of processor cache)
People use list for a sequentially accessible O(n) collection. The sentences of text on a page of a book are a list. To find the fifth sentence, you must read the first four.
I mention the meaning of the words list and array here for an important reason for professional programmers. Much of our work is maintaining existing code. In our justifiable rush to get things done, sometimes we grab the first collection class that comes to hand, and sometimes we grab the wrong one. For example, we might grab a list O(n) rather than an array O(1) or a hash O(1, maybe). The collection we grab works well for our tests. But, boom!, performance falls over just when the application gets successful and scales up to holding a lot of data. This happens all the time.
To remedy that kind of problem we need a practical understanding of these access issues. I once inherited a project with a homegrown hashed dictionary class that consumed O(n cubed) when inserting lots of items into the dictionary. It took a lot of digging to get past the snazzy collection-class documentation to figure out what was really going on.
In Java, the Object type is a reference to an object rather than an object itself. That is, a variable of type Object can be thought of as a pointer that says “here’s where you should go to find your Object” rather than “I am an actual, honest-to-goodness Object.” Importantly, the size of this reference - the number of bytes used up - is the same regardless of what type of thing the Object variable refers to.
As a result, if you have an Object[], then the cost of indexing into that array is indeed O(1), since the entries in that array are all the same size (namely, the size of an object reference). The sizes of the objects being pointed at might not all be the same, as in your example, but the pointers themselves are always the same size and so the math you’ve given provides a way to do array indexing in constant time.
The answer depends on context.
It's really common in some textbooks to treat array access as O(1) because it simplifies analysis.
And in fairness it is O(1) cpu instructions in today's architectures.
But:
As the dataset gets larger tending to infinity, it doesn't fit in memory. If the "array" is implemented as a database structure spread across multiple machines, you'll end up with a tree structure and probably have logarithmic worst case access times.
If you don't care about data size going to infinity, then big O notation may not be the right fit for your situation
On real hardware, memory accesses are not all equal -- there are many layers of caches and cache misses cost hundreds or thousands of cycles. The O(1) model for memory access tends to ignore that
In theory work, random access machines access memory in O(1), but turing machines cannot. Hierarchical cache effects tend to be ignored. Some models like transdichotomous RAM try to account for this.
In short, this is a property of your model of computation. There are many valid and interesting models of computation and what to choose depends on your needs and your situation.
In general. array denotes a fixed-size memory range, storing elements of the same size. If we consider this usual concept of array, then if objects are members of the array, then under the hood your array stores the object references and referring the i'th element in your array you find the reference of the object/pointer it contains, with an O(1) complexity and the address it points to is something the language is to find.
However, there are arrays which do not comply to this definition. For example, in Javascript you can easily add items to the array, which makes me think that in Javascript the arrays are somewhat different from an allocated fixed-sized range of elements of the same size/type. Also, in Javascript you can add any types of elements to an array. So, in general I would say the complexity is O(1), but there are quite a few important exceptions from this rule, depending on the technologies.

Why are lists used infrequently in Go?

Is there a way to create an array/slice in Go without a hard-coded array size? Why is List ignored?
In all the languages I've worked with extensively: Delphi, C#, C++, Python - Lists are very important because they can be dynamically resized, as opposed to arrays.
In Golang, there is indeed a list.Liststruct, but I see very little documentation about it - whether in Go By Example or the three Go books that I have - Summerfield, Chisnal and Balbaert - they all spend a lot of time on arrays and slices and then skip to maps. In souce code examples I also find little or no use of list.List.
It also appears that, unlike Python, Range is not supported for List - big drawback IMO. Am I missing something?
Slices are lovely, but they still need to be based on an array with a hard-coded size. That's where List comes in.
Just about always when you are thinking of a list - use a slice instead in Go. Slices are dynamically re-sized. Underlying them is a contiguous slice of memory which can change size.
They are very flexible as you'll see if you read the SliceTricks wiki page.
Here is an excerpt :-
Copy
b = make([]T, len(a))
copy(b, a) // or b = append([]T(nil), a...)
Cut
a = append(a[:i], a[j:]...)
Delete
a = append(a[:i], a[i+1:]...) // or a = a[:i+copy(a[i:], a[i+1:])]
Delete without preserving order
a[i], a = a[len(a)-1], a[:len(a)-1]
Pop
x, a = a[len(a)-1], a[:len(a)-1]
Push
a = append(a, x)
Update: Here is a link to a blog post all about slices from the go team itself, which does a good job of explaining the relationship between slices and arrays and slice internals.
I asked this question a few months ago, when I first started investigating Go. Since then, every day I have been reading about Go, and coding in Go.
Because I did not receive a clear-cut answer to this question (although I had accepted one answer) I'm now going to answer it myself, based on what I have learned, since I asked it:
Is there a way to create an array /slice in Go without a hard coded
array size?
Yes. Slices do not require a hard coded array to slice from:
var sl []int = make([]int, len, cap)
This code allocates slice sl, of size len with a capacity of cap - len and cap are variables that can be assigned at runtime.
Why is list.List ignored?
It appears the main reasons list.List seem to get little attention in Go are:
As has been explained in #Nick Craig-Wood's answer, there is
virtually nothing that can be done with lists that cannot be done
with slices, often more efficiently and with a cleaner, more
elegant syntax. For example the range construct:
for i := range sl {
sl[i] = i
}
cannot be used with list - a C style for loop is required. And in
many cases, C++ collection style syntax must be used with lists:
push_back etc.
Perhaps more importantly, list.List is not strongly typed - it is very similar to Python's lists and dictionaries, which allow for mixing various types together in the collection. This seems to run contrary
to the Go approach to things. Go is a very strongly typed language - for example, implicit type conversions never allowed in Go, even an upCast from int to int64 must be
explicit. But all the methods for list.List take empty interfaces -
anything goes.
One of the reasons that I abandoned Python and moved to Go is because
of this sort of weakness in Python's type system, although Python
claims to be "strongly typed" (IMO it isn't). Go'slist.Listseems to
be a sort of "mongrel", born of C++'s vector<T> and Python's
List(), and is perhaps a bit out of place in Go itself.
It would not surprise me if at some point in the not too distant future, we find list.List deprecated in Go, although perhaps it will remain, to accommodate those rare situations where, even using good design practices, a problem can best be solved with a collection that holds various types. Or perhaps it's there to provide a "bridge" for C family developers to get comfortable with Go before they learn the nuances of slices, which are unique to Go, AFAIK. (In some respects slices seem similar to stream classes in C++ or Delphi, but not entirely.)
Although coming from a Delphi/C++/Python background, in my initial exposure to Go I found list.List to be more familiar than Go's slices, as I have become more comfortable with Go, I have gone back and changed all my lists to slices. I haven't found anything yet that slice and/or map do not allow me to do, such that I need to use list.List.
I think that's because there's not much to say about them as the container/list package is rather self-explanatory once you absorbed what is the chief Go idiom for working with generic data.
In Delphi (without generics) or in C you would store pointers or TObjects in the list, and then cast them back to their real types when obtaining from the list. In C++ STL lists are templates and hence parameterized by type, and in C# (these days) lists are generic.
In Go, container/list stores values of type interface{} which is a special type capable to represent values of any other (real) type—by storing a pair of pointers: one to the type info of the contained value, and a pointer to the value (or the value directly, if it's size is no greater than the size of a pointer). So when you want to add an element to the list, you just do that as function parameters of type interface{} accept values coo any type. But when you extract values from the list, and what to work with their real types you have to either type-asert them or do a type switch on them—both approaches are just different ways to do essentially the same thing.
Here is an example taken from here:
package main
import ("fmt" ; "container/list")
func main() {
var x list.List
x.PushBack(1)
x.PushBack(2)
x.PushBack(3)
for e := x.Front(); e != nil; e=e.Next() {
fmt.Println(e.Value.(int))
}
}
Here we obtain an element's value using e.Value() and then type-assert it as int a type of the original inserted value.
You can read up on type assertions and type switches in "Effective Go" or any other introduction book. The container/list package's documentation summaries all the methods lists support.
Note that Go slices can be expanded via the append() builtin function. While this will sometimes require making a copy of the backing array, it won't happen every time, since Go will over-size the new array giving it a larger capacity than the reported length. This means that a subsequent append operation can be completed without another data copy.
While you do end up with more data copies than with equivalent code implemented with linked lists, you remove the need to allocate elements in the list individually and the need to update the Next pointers. For many uses the array based implementation provides better or good enough performance, so that is what is emphasised in the language. Interestingly, Python's standard list type is also array backed and has similar performance characteristics when appending values.
That said, there are cases where linked lists are a better choice (e.g. when you need to insert or remove elements from the start/middle of a long list), and that is why a standard library implementation is provided. I guess they didn't add any special language features to work with them because these cases are less common than those where slices are used.
From: https://groups.google.com/forum/#!msg/golang-nuts/mPKCoYNwsoU/tLefhE7tQjMJ
It depends a lot on the number of elements in your lists,
whether a true list or a slice will be more efficient
when you need to do many deletions in the 'middle' of the list.
#1
The more elements, the less attractive a slice becomes.
#2
When the ordering of the elements isn't important,
it is most efficient to use a slice and
deleting an element by replacing it by the last element in the slice and
reslicing the slice to shrink the len by 1
(as explained in the SliceTricks wiki)
So
use slice
1. If order of elements in list is Not important, and you need to delete, just
use List swap the element to delete with last element, & re-slice to (length-1)
2. when elements are more (whatever more means)
There are ways to mitigate the deletion problem --
e.g. the swap trick you mentioned or
just marking the elements as logically deleted.
But it's impossible to mitigate the problem of slowness of walking linked lists.
So
use slice
1. If you need speed in traversal
Unless the slice is updated way too often (delete, add elements at random locations) the memory contiguity of slices will offer excellent cache hit ratio compared to linked lists.
Scott Meyer's talk on the importance of cache..
https://www.youtube.com/watch?v=WDIkqP4JbkE
list.List is implemented as a doubly linked list. Array-based lists (vectors in C++, or slices in golang) are better choice than linked lists in most conditions if you don't frequently insert into the middle of the list. The amortized time complexity for append is O(1) for both array list and linked list even though array list has to extend the capacity and copy over existing values. Array lists have faster random access, smaller memory footprint, and more importantly friendly to garbage collector because of no pointers inside the data structure.

What are the advantages of using Uint8List over List<int> when dealing with byte arrays in Dart?

I'm writing a Dart library in which I'm very regularly dealing with byte arrays or byte strings. Since Dart doesn't have a byte type nor an array type, I'm using List for all byte arrays.
Is this a good practice to do? I only recently found out about the existence of Uint8List in the dart:typed_data package. It's clear that this class aims to by the go-to implementation for byte arrays.
But does it have any direct advantages?
I can imagine that it does always perform checks on new items so that the user can make sure no non-byte-value integers are inside the list. But are there other advantages or differences?
There also is a class named ByteArray, but it seems to be a quite inefficient alternative for List...
The advantage should be that the Uint8List consumes less memory than a normal List, because it is known from the beginning that each elements size is a single byte.
Uint8List can also be mapped directly to underlying optimized Uint8List types (e.g. in Javascript).
Copies of list slices are also easier to perform, because all bytes are laid-out continguos in memory and therefore the slice can be directly copied in a single operation to another Uint8List (or equivalent) type.
However if this advantage is fully used depends on how good the implementation of Uint8List in Dart is.
John Mccutchan of the Dart team explains that the Dart VM relies on 3 different integer representations — pretty like the Three Musketeer's, there is the small machine integer (smi), the medium (mint) and the big heavy integer (bint). The VM takes care to switch automatically between the three depending on the size of the integer in play.
Within the smi range, which depends on the CPU architecture, integers fit in a register, therefore can be loaded and stored directly in the field instead of being fetched from memory. They also never require memory allocation. Which leads to the performance side of the story: within the smi range, storing an integer in object lists is faster than putting them in a typed list.
Typed lists would have to tag and untags, steps which refer to the VM set of operations to box and unbox smi values without allocation memory or loading the value from a object. The leaner, the better.
On the other hand, typed list have two big capabilities to consider. The garbage collection is very low as typed lists can store never store object references, only numbers. Typed list can also be much more dense therefore an Int8List would require much less memory and make better use of CPU's cache. The smi range principle applies also in typed lists, so playing with numbers within that range provides the best performance.
All in all, what remains of this is that we need to benchmark each approach to find which work the best depending on the situation.

Which data-structure is used widely in C? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
A couple of days ago I asked myself which data-structure I should in a function in C.
I usually write in C++ and the choice would have fallen to std::vector.
There a some possible choices:
a static (big enough) array
a dynamic array which grows when needed(e.g. doubling its size)
an own list implementation as struct with a pointer next
The last option seems to be unusual. Are there any bigger project where
someone uses own structures like lists?
Is there a general rule for the decision between array or own
data-structures?
When I would need a tree structure I wouldn't think twice an just write a tree.
Are there any widely used libs with such structures(like boost for C++)?
Or is this considered as bad style because you would have to store a void* instead of
the actual type?
Thank a lot for your experience!
Different data structures offer different computational complexity for insertions, lookups and other operations.
To take your specific example, there is a number of differences between an array and a linked list:
lookup by index is O(1) in an array and O(n) in a list;
insertions into a list are O(1) or O(n) (depending on whether they require traversal), and into an array are amortized O(1) if done well;
deletions from an array are O(n) whereas certain types of list deletions can be done in O(1) time;
arrays offer better locality of reference and consequently better cache performance.
You might find the following page useful: http://essays.hexapodia.net/datastructures/
As a general rule, when choosing a data structure I first consider whether I have strong reasons to believe that performance of the code in question is going to be important:
if I don't, I choose the simplest data structure that would do the job, or the one that lends itself to the clearest code;
if I do, I think carefully about what operations I am going to perform on the structure and choose accordingly, possibly followed by profiling.
As for recommendations for good C data structure libraries, take a look at Are there any open source C libraries with common data structures?
It completely depends on which operations you are going to perform on the data structure. If you will be retrieving data by index (eg, data[ 3 ]), then a list is a horrible idea since each read will require you to walk the list. If you will be inserting into the first position a lot (eg, data[ 0 ] = x), then an array will be terrible because you will be moving all the data for each insertion.
If you were going to use std::vector, then a dynamic array is probably the best replacement. But perhaps std::vector would not have been the correct choice.
Each has its own advantages and disadvantages.
Static arrays: perfect for lookup tables and data that doesn't change its size throughout the program execution. They get allocated on program startup, so you don't have to manage this memory in any way. A disadvantage is that you cannot resize or free a static array - it stays there until the program terminates.
Dynamically growing arrays: an easy to manage data structure that can almost be a substitute for vector arrays in C++. A disadvantage is the overhead of allocating memory at runtime, but that can be alleviated if you allocate bigger chunks at once.
Linked lists: take care if you are going to have a lot of elements because allocating each one separately without using a memory pool can lead to memory waste and fragmentation.
Dynamic Link lists are widely used in C where number of items to be stored is not known.
Vector is a dynamic array and is implemented internally as a dynamic array. I would suggest you to use vector as the machines are optimized to retrieve the continuous memory locations,while the link list will not be stored in continuous location.
Having said that,if you don't require the fast retrieval of element by index in your use case then you can go for link list also.The link list also has a benefit of not wasting the space unlike dynamic array(by doubling when it is about to get full) and also insertion in the start or between the elements is cheaper as compared to array.
In pure C I mostly use static arrays. It is connected with programming of embedded devices that have problems with allocating and freeing memory - heap fragmentation.
But if there is a need of using list I implement one myself - sometimes its implementation is based on static arrays (again).
I think there are libraries for C that offer decent implementation of more complex data structures. AFAIK glib is widely used and offers at least linked lists.
I remember I read a document from Apple some time ago saying that such a thing could be naively implemented with a structure similar to:
struct {
void *data; //other type rather than void is easier
int length;
} MyArray;
MyArray *MyArrayCreate(){
//mallocate memory
return ...
}
void MyArrayRelease(){
free(...);
}
And implement a function that checks the length of the array and its not long enough then another big enough array will be allocated, former data will be copied to it and new data added to it.
MyArrayInsertAt(MyArray *array, index, void *object){
if (length < index){
//mallocate again, copy data
//update length
}
data[index] = object;
}
So it can used like so:
MyArray *array = MyArrayCreate();
MyArrayInsertAt(array, 5, something);
MyArrayRelease(array);
MyArrayInsertAt() function won't win a price for its performance, but it could be a solution for no high-demanding programs/applications
I just can't find the link... maybe someone read it too?
ADDED:
I have found that GNU NSMutableArray (Objective-C) methods implementation are done in C and they do the same as above. They double the size of the array each time an object has to be added and won't fit in the array. See line 131 through 145

C - How to implement Set data structure?

Is there any tricky way to implement a set data structure (a collection of unique values) in C? All elements in a set will be of the same type and there is a huge RAM memory.
As I know, for integers it can be done really fast'N'easy using value-indexed arrays. But I'd like to have a very general Set data type. And it would be nice if a set could include itself.
There are multiple ways of implementing set (and map) functionality, for example:
tree-based approach (ordered traversal)
hash-based approach (unordered traversal)
Since you mentioned value-indexed arrays, let's try the hash-based approach which builds naturally on top of the value-indexed array technique.
Beware of the advantages and disadvantages of hash-based vs. tree-based approaches.
You can design a hash-set (a special case of hash-tables) of pointers to hashable PODs, with chaining, internally represented as a fixed-size array of buckets of hashables, where:
all hashables in a bucket have the same hash value
a bucket can be implemented as a dynamic array or linked list of hashables
a hashable's hash value is used to index into the array of buckets (hash-value-indexed array)
one or more of the hashables contained in the hash-set could be (a pointer to) another hash-set, or even to the hash-set itself (i.e. self-inclusion is possible)
With large amounts of memory at your disposal, you can size your array of buckets generously and, in combination with a good hash method, drastically reduce the probability of collision, achieving virtually constant-time performance.
You would have to implement:
the hash function for the type being hashed
an equality function for the type being used to test whether two hashables are equal or not
the hash-set contains/insert/remove functionality.
You can also use open addressing as an alternative to maintaining and managing buckets.
Sets are usually implemented as some variety of a binary tree. Red black trees have good worst case performance.
These can also be used to build an map to allow key / value lookups.
This approach requires some sort of ordering on the elements of the set and the key values in a map.
I'm not sure how you would manage a set that could possibly contain itself using binary trees if you limit set membership to well defined types in C ... comparison between such constructs could be problematic. You could do it easily enough in C++, though.
The way to get genericity in C is by void *, so you're going to be using pointers anyway, and pointers to different objects are unique. This means you need a hash map or binary tree containing pointers, and this will work for all data objects.
The downside of this is that you can't enter rvalues independently. You can't have a set containing the value 5; you have to assign 5 to a variable, which means it won't match a random 5. You could enter it as (void *) 5, and for practical purposes this is likely to work with small integers, but if your integers can get into large enough sizes to compete with pointers this has a very small probability of failing.
Nor does this work with string values. Given char a[] = "Hello, World!"; char b[] = "Hello, World!";, a set of pointers would find a and b to be different. You would probably want to hash the values, but if you're concerned about hash collisions you should save the string in the set and do a strncmp() to compare the stored string with the probing string.
(There's similar problems with floating-point numbers, but trying to represent floating-point numbers in sets is a bad idea in the first place.)
Therefore, you'd probably want a tagged value, one tag for any sort of object, one for integer value, and one for string value, and possibly more for different sorts of values. It's complicated, but doable.
If the maximum number of elements in the set (the cardinality of the underlying data type) is small enough, you might want to consider using a plain old array of bits (or whatever you call them in your favourite language).
Then you have a simple set membership check: bit n is 1 if element n is in the set. You could even count 'ordinary' members from 1, and only make bit 0 equal to 1 if the set contains itself.
This approach will probably require some sort of other data structure (or function) to translate from the member data type to the position in the bit array (and back), but it makes basic set operations (union, intersection, membership test, difference, insertion, removal,compelment) very very easy. And it is only suitable for relatively small sets, you wouldn't want to use it for sets of 32-bit integers I don't suppose.

Resources