Implementing 'thread-safe" linked list

Implementing 'thread-safe" linked list - c

I am writing an applicaation where more then one link list is shared among threads. Operations on the linked list is as usual:searching,inserting,deleting,modifying node contents.
I came across ann implementation to keep the entire procedure of link list operation "thread-safe". http://www.cs.cf.ac.uk/Dave/C/node31.html#SECTION003100000000000000000
But was wondering if I caan do it as follows:
lock(mutex)
link list operation
unlock(mutex)
I.e I associate a mutex with each link list and use it as above whenever I start an operation
Would be gratefull for views

It is possible to do it this way, but you sacrifice liveness as the linked list can now only be touched by one thread at a time - this may lead to the list becoming a bottleneck in your program.
Think about the interface of the linked list (what methods may be called by the threads) and how you can keep the list safe, but also allow as many threads as possible to use it at once.
For example, if you're using the list as a queue, one thread could be enqueueing items at the tail of the list, while another thread dequeues an item.
There are lots of challenges in creating thread safe utilities, but you should try to be as surgical as possible to make sure you aren't sacrificing the performance that you're trying to gain by parallelising your software in the first place! Have fun!

Depends. If your threads will mainly search through the lists without modifying them, it may be worthwhile to implement a reader/writer-lock. There's no reason to prevent other threads from reading the information as long as none of them modifies it. If however, the most common operations involve modifying the lists or the information in them, there may not be much to gain, so a simple lock/do operation/unlock scheme should work just as well.

Related

What is the most efficient way to implement a C hash table with timeout on entries?

I am currently developing a hash table to serve as a database data structure which may contain a large number of elements and which must be as much efficient as possible (especially on the operations to add a new elements and update an existing element).
I am also forced to use only C (avoiding C++ or other languages which have existing classes or structures which could really help in this case).
What I need to develop is a hash table with linked lists in which each entry has a timeout (let's say few minutes), after which it should automatically delete itself (or, as an alternative, the old entries should be "garbage collected" at a certain point in time, as elements may be added at a very fast rate and I do not want to use too much memory for entries which are too old).
I was thinking about adding a timer field to each entry of the Hash Table:
struct HTnode {
// Hash table entry ID
long int id;
// Pointer to the next element of the linked list (when hash is the same for two different IDs)
struct STnode * next;
// Other fields...
// Timer for each entry
timer_t entryTimer;
};
Then, when a new entry is added, to start a timer (this project will only run on Linux, that's why I am considering to use timer_t - error checking is not performed in this sample code for the sake of brevity):
struct sigevent entryTimerEvent;
struct itimerspec entryTimerTs;
// Allocate a new entry for a given id (struct HTnode *entry)
// ...
memset(&entryTimerEvent,0,sizeof(entryTimerEvent));
// entryDeleter() is a function deleting the current entry from the hash table
entryTimerEvent.sigev_notify_function=entryDeleter;
entryTimerEvent.sigev_notify=SIGEV_THREAD;
entryTimerTs.it_value.tv_sec=...; // Set to a certain timeout value
entryTimerTs.it_value.tv_nsec=...; // Set to a certain timeout value
entryTimerTs.it_interval.tv_sec=...; // Set to a certain timeout value
entryTimerTs.it_interval.tv_nsec=...; // Set to a certain timeout value
timer_create(CLOCK_REALTIME,&entryTimerEvent,&(entry->entryTimer));
timer_settime(entry->entryTimer,0,&entryTimerTs,NULL);
When an entry is updated, instead, I would just rearm the timer with timer_settime.
However, I fear that a solution like this could become problematic in terms of performance when I reach more than some thousands of entries, all with their own running timer (some active entries may be even updated with a sub-second granularity, causing very frequent calls to timer_settime), and I am currently struggling to find a good alternative.
Are there better and more efficient solutions, in your opinion, maybe not requiring the usage of timers for each entry?
Thank you very much in advance.

What I understand of you requirements:
If an element has timed-out, it should not appear when you try to get it
Eventually It will be deleted so that the table do not grow forever
To implement that you can add a timestamp with each element and
Change your get function to check if the current time is before the timestamp, otherwise, do not return it
Have a watcher thread that delete expired elements

A crude but easy option:
Have distinct hash tables - e.g. three tables each storing all entries added during a particular minute; you add new entries into table[active]. To do a lookup or erase, search in the active table (if recently added elements are statistically most likely to be accessed), then try older tables in turn. When you add an element and find the minute has changed, shift active along with ++active %= 3 then clear table[active] before adding the new element to it.
A more over-engineered way, with a single hash table...
Assuming you want elements older than X units of time to be erased even if they've been accessed recently, then all you need is a double-ended queue off to the side of the hash table that tracks the elements you've added. Each element in the queue should store the expiry time and whatever's convenient for quick deletion (e.g. a pointer to the previous element (or bucket/root) in the linked list from the bucket, if you're using separate chaining, don't have to support resizing of the hash table, and know elements won't be erased by any other mechanism). I say previous because if it's a singly-linked list then you'll need to rewire the previous element's next pointer. If it's a doubly-linked list, it may be more intuitive to store a pointer to the element itself, then you can use prev->next = next to rewire.
When you add an element you add it to the back of that queue. Whenever you've idle time for some housekeeping, you start at the front of the queue and erase however-many elements are expired. You can see whether it's better to have a separate thread handling expiries (but then you need locking around the hash table accesses/updates), or do it from the same thread that otherwise using the hash table.
If you do want recently-accessed elements to hang around longer, then you'd want to use a Least Recently Used cache mechanism to track the expiry times, and that means you need to be able to search in the LRU by key, so you'd might want to use a balanced binary tree. This will be a lot more implementation effort.
If you're using a closed hashing / open addressing hash table, what to do to erase elements will depend on how you're handling collisions: you might be able to just overwrite the bucket with a no-longer-in-use sentinel value, or you might be obliged to shift another element to be closer to its hashed-to bucket. These kind of factors may change what indexing data store in the timeout-tracking double-ended queue or LRU.
It is a major undertaking to do it from scratch in C, so I'd suggest you utilise existing libraries of data structures. Still, asking for recommendations for 3rd party resources is off topic on StackOverflow, which may frustrate that if you don't already know your options. I couldn't go back from C++ to C these days - too frustrating - and I'm not aware of current offerings to suggest something....

Queue optimizations between two queues

I'll try to explain my problem best I can. I'm having trouble with the final part of a college assignment and got really stuck.
The scenario is this: we have to build a simulation of an airport, in which a thread represents a flight (either departure or arrival), and have to optimize the arrivals and departures to minimize the time that it takes flights to land or depart.
So, onto the problem. I have all structures working, threads being created in the right moment, shared memory fully functional, message queue created and working, etc.
But now, I'm struggling with actually managing the flights. I created two linked lists, one for the arrivals and one for the departures. Each node of the linked list has a pointer to a space in the shared memory. Each of those spaces has the information relative to a flight (eta and fuel for arrivals, desired takeoff for departures). The arrivals linked list is sorted by ETA, and the takeoff linked list is sorted by desired takeoff. These linked lists are supposed to be a queue.
Problem is, I have no clue on how to manage them. There can be two departures or two arrivals at the same time, but there can't be both arrivals and departures at the same time.
I'm thinking of using semaphores, but I'm not sure if that's a good approach. I'd greatly appreciate any pointers in the right direction.
Thanks in advance!
EDIT:
We have a couple of specifications, but I thought this post would become too big as I was just looking for general headlines. In short, we have to minimize the number of times we need to order an arrival to hold (wait in the air), and "alternate between arrivals and departures to improve airport efficiency"

The specification is short. I understand this is for the purpose of keeping question small. But with this limited specification, I can only do some guess work here.
This problem seems to me similar to Old bridge problem.
An old bridge has only one lane and can only hold at most 3 cars at a
time without risking collapse. Create a solution that controls traffic
so that at any given time, there are at most 3 cars on the bridge, and
all of them are going the same direction. A car calls ArriveBridge
when it arrives at the bridge and wants to go in the specified
direction (0 or 1); ArriveBridge should not return until the car is
allowed to get on the bridge. A car calls ExitBridge when it gets off
the bridge, potentially allowing other cars to get on. Don’t worry
about starving cars trying to go in one direction; just make sure cars
are always on the bridge when they can be.
If you think so, you can find the solution to this problem here:
With some modifications you may be able to use it for your air traffic problem.
https://codeistry.wordpress.com/2018/04/18/old-bridge/

Why should we use stack, since an array or linked list can do everything a stack can do

I was wondering why should we even use stack, since an array or linked list can do everything a stack can do? Why do we bother name it a "data structure" separately? In the real world, just use an array would be sufficient enough to solve the problem; why would one bother to implement a stack which will restrict himself to only be able to push and pop from the top of the collection?

I think it is better to use the term data type to refer to things whose behavior is defined by some interface, or algebra, or collection of operations. Things like
stacks
queues
priority queues
dequeues
lists
sets
hierarchies
maps (dictionaries)
are types because for each, we just list their behaviors. Stacks are stacks because they can only be pushed and popped. Queues are queues because ... (you get the picture).
On the other hand, a data structure is an implementation of a data type defined by the way it arranges its components. Examples include
array (constant time access by index)
linked list
bitmap
BSTs
Red-black tree
(a,b)-trees (e.g. 2-3 trees)
Skip lists
hash tables (many variants)
adjacency matrices
Bloom filters
Judy arrays
Tries
A lot of people do confuse the terms data structures and data types, but it's best not to be too pedantic. Stacks are data types, not data structures, but again, why be too pedantic.
To answer your specific question, we use stacks as data types in situations where we want to ensure or data is modified only by pushing and popping and we never violate this access pattern (even by accident).
Under the hoodm we may use a linked list as an implementation of our stack. But most programming languages will provide a way to restrict the interface to allow our code to readably, and securely, use our data in a LIFO fashion.
TL;DR: Readability. Security.

Stacks can and usually are implemented using an array or a linked list as the underlying structure. There are many benefits to using a Stack.
One of the main benefits is the LIFO ordering that is maintained by the push/pop operations. This allows for the stack elements to have a natural order and be removed from the stack in the reverse order to the order of their addition. Such data structure can be very useful in many applications where using just an array or a linked list would actually be less efficient and/or more complicated. Some of those applications are:
Balancing of symbols (such as parenthesis)
Infix-to-postfix conversion
Evaluation of postfix expression
Implementing function calls (including recursion)
Finding of spans (finding spans in stock markets)
Page-visited history in a Web browsed (Back buttons)
Undo sequence in a text editor
Matching Tags in HTML and XML
Here are some more stack applications
And some more...
The two underlying implementations of an array or a linked list give the stack different capabilities and features. Here is a simple comparison:
(Dynamic) Array Implementation:
1. All operations take constant time besides push().
2. Expensive doubling operation every once in a while.
3. Any sequence of n operations (starting from empty stack) -> "amortized" bound takes time proportional to n.
Linked List Implementation:
1. Grows and shrinks easily.
2. Every operation takes constant time O(1).
3. Every operation uses extra space and time to deal with references.

How do prevent race condition when deleting a GWAN KV stored struct?

After several months of evaluating and reevaluating and planing different data structures and web/application servers I'm now at a point where I need to bang my head around with implementation details. The (at the moment theoretical) question I'm facing is this:
Say I'm using GWANs KV store to store C structs for Users and the like (works fine, tested), how should I go about removing these objects from KV, and later on from memory, without encountering a race condition?
This is what I'm at at the moment:
Thread A:
grab other objects referencing the one to be deleted
set references to NULL
delete object
Thread B:
try to get object -> kv could return object, as it's not yet deleted
try to do something with the object -> could already be deleted here, so I would access already freed memory?
or something else which could happen:
Thread B:
get thing referencing object
follow reference -> object might not be deleted here
do something with reference -> object might be deleted here -> problem
or
Thread B:
got some other object which could reference the to be deleted object
grab object which isn't yet deleted
set reference to object -> object might be deleted here -> problem
Is there a way to avoid those kind of conditions, except for using locks? I've found a latitude of documents describing algorithms dealing with different producer/consumer situations, hashtables, ... with even sometimes wait free implementations (I haven't yet found a good example to show me the difference between lock-free and wait-free, though I get it conceptually), but I haven't been able to figure out how to deal with these kind of things.
Am I overthinking this, or is there maybe an easy way to avoid all these situations? I'm free to change the data- and -storage layout in any way I want, and I can use processor specific instructions freely (e.g. CAS)
Thanks in advance

Several questions there:
deleting a GWAN KV stored struct
When removing a KV from a persistence pointer or freeing the KV, you have to make sure that nobody is dereferencing freed data.
This is application dependent. You can introduce some tolerance by using G-WAN memory pools which will make data survive a KV deletion as long as the memory is not overwrited (or the pool freed).
deleting a GWAN KV key-value pair
G-WAN's KV store does the bookkeeping (using atomic intrinsics) to protect values fetched by threads and unprotects them after the request has been processed.
If you need to keep data for a longer time, make a copy.
Other storage tools, like in-memory SQLite use locks. In this case, lock granularity is very important.

Linked lists or hash tables?

I have a linked list of around 5000 entries ("NOT" inserted simultaneously), and I am traversing the list, looking for a particular entry on occasions (though this is not very often), should I consider Hash Table as a more optimum choice for this case, replacing the linked list (which is doubly-linked & linear) ?? Using C in Linux.

If you have not found the code to be the slow part of the application via a profiler then you shouldn't do anything about it yet.
If it is slow, but the code is tested, works, and is clear, and there are other slower areas that you can work on speeding up do those first.
If it is buggy then you need to fix it anyways, go for the hash table as it will be faster than the list. This assumes that the order that the data is traversed does not matter, if you care about what the insertion order is then stick with the list (you can do things with a hash table and keep the order, but that will make the code much tricker).
Given that you need to search the list only on occasion the odds of this being a significant bottleneck in your code is small.
Another data structure to look at is a "skip list" which basically lets you skip over a large portion of the list. This requires that the list be sorted however, which, depending on what you are doing, may make the code slower overall.

Whether using hash table is more optimum or not depends on the use case, which you have not described in detail. But more importantly, make sure the bottleneck of performance is in this part of the code. If this code is called only once in a while and not in a critical path, no use bothering to change the code.

Have you measured and found a performance hit with the lookup? A hash_map or hash table should be good.

If you need to traverse the list in order (not as a part of searching for elements, but say for displaying them) then a linked list is a good choice. If you're only storing them so that you can look up elements then a hash table will greatly outperform a linked list (for all but the worst possible hash function).
If your application calls for both types of operations, you might consider keeping both, and using whichever one is appropriate for a particular task. The memory overhead would be small, since you'd only need to keep one copy of each element in memory and have the data structures store pointers to these objects.
As with any optimization step that you take, make sure you measure your code to find the real bottleneck before you make any changes.

If you care about performance, you definitely should. If you're iterating through the thing to find a certain element with any regularity, it's going to be worth it to use a hash table. If it's a rare case, though, and the ordinary use of the list is not a search, then there's no reason to worry about it.

If you only traverse the collection I don't see any advantages of using a hashmap.

I advise against hashes in almost all cases.
There are two reasons; firstly, the size of the hash is fixed.
Second and much more importantly; the hashing algorithm. How do you know you've got it right? how will it behave with real data rather than test data?
I suggest a balanced b-tree. Always O(log n), no uncertainty with regard to a hash algorithm and no size limits.