Hashing with large data sets and C implementation - c

I have a large number of values ranging from 0 - 5463458053. To each value, I wish to map a set containing strings so that the operation lookup, i. e. finding whether a string is present in that set takes the least amount of time. Note that this set of values may not contain all values from (0 - 5463458053), but yes, a large number of them.
My current solution is to hash those values (between 0 - 5463458053) and for each value, have a linked list of strings corresponding to that value. Every time, I want to check for a string in a given set, I hash the value(between 0 - 5463458053), get the linked list, and traverse it to find out whether it contains the aforementioned string or not.
While this might seem easier, it's a little time consuming. Can you think of a faster solution? Also, collisions will be dreadful. They'll lead to wrong results.
The other part is about implementing this in C. How would I go about doing this?
NOTE: Someone suggested using a database instead. I wonder if that'll be useful.
I'm a little worried about running out of RAM naturally. :-)

You could have an hash-table of hash-sets. The first hash-table has keys your integers. The values inside it are hash-sets, i.e. hash-tables whose keys are strings.
You could also have an hashed set, with the keys being pairs of integers and strings.
There are many libraries implementing such data structures (and in C++, the standard library is implementing them, as std::map & std::set). For C, I was thinking of Glib from GTK.
With hashing techniques, memory use is proportional to the size of the considered sets (or relations). For instance, you could accept 30% emptiness rate.

Large number of strings + fast lookup + limited memory ----> you want a prefix trie, crit-bit tree, or anything of that family (many different names for very similar things, e.g. PATRICIA... Judy is one such thing too). See for example this.
These data structores allow for prefix-compression, so they are able to store a lot of strings (which somehow necessarily will have common prefixes) very efficiently. Also, lookup is very fast. Due to caching and paging effects that the common big-O notation does not account for, they can be as fast or even faster than a hash, at a fraction of the memory (even though according to big-O, nothing except maybe an array can beat a hash).

A Judy Array, with the C library that implements it, might be exactly the base of what you need. Here's a quote that describes it:
Judy is a C library that provides a state-of-the-art core technology
that implements a sparse dynamic array. Judy arrays are declared
simply with a null pointer. A Judy array consumes memory only when it
is populated, yet can grow to take advantage of all available memory
if desired. Judy's key benefits are scalability, high performance, and
memory efficiency. A Judy array is extensible and can scale up to a
very large number of elements, bounded only by machine memory. Since
Judy is designed as an unbounded array, the size of a Judy array is
not pre-allocated but grows and shrinks dynamically with the array
population. Judy combines scalability with ease of use. The Judy API
is accessed with simple insert, retrieve, and delete calls that do not
require extensive programming. Tuning and configuring are not required
(in fact not even possible). In addition, sort, search, count, and
sequential access capabilities are built into Judy's design.
Judy can be used whenever a developer needs dynamically sized arrays,
associative arrays or a simple-to-use interface that requires no
rework for expansion or contraction.
Judy can replace many common data structures, such as arrays, sparse
arrays, hash tables, B-trees, binary trees, linear lists, skiplists,
other sort and search algorithms, and counting functions.

If the entries are from 0 to N and consecutive: use an array. (Is indexing fast enough for you?)
EDIT: the numbers do not seem to be consecutive. There is a large number of {key,value} pairs, where the key is a big number (>32 bits but < 64 bits) and the value is a bunch of strings.
If memory is available, a hash table is easy, if the bunch of strings is not too large you can inspect them sequentially. If the same strings occur (much) more than once, you could enumerate the strings (put pointers to them in a char * array[] and use the index into that array instead. finding the index given a string probably involves another hash table)
For the "master" hashtable an entry would probably be:
struct entry {
struct entry *next; /* for overflow chain */
unsigned long long key; /* the 33bits number */
struct list *payload;
} entries[big_enough_for_all] ; /* if size is known in advance
, preallocation avoids a lot of malloc overhead */
if you have enough memory to store a heads-array, you chould certainly do that:
struct entry *heads[SOME_SIZE] = {NULL, };
, otherwise you can combine the heads array with the array of entries. (like I did Lookups on known set of integer keys here)
Handling collisions is easy: as you walk the overflow chain, just compare your key with the key in the entry. If they are unequal: walk on. If they are equal: found; now go walking the strings.

You can use a single binary search tree (AVL/Red-black/...) to contain all the strings, from all sets, by keying them lexicographically as (set_number, string). You don't need to store sets explicitly anywhere. For example, the comparator defining the order of nodes for the tree could look like:
function compare_nodes (node1, node2) {
if (node1.set_number < node2.set_number) return LESS;
if (node1.set_number > node2.set_number) return GREATER;
if (node1.string < node2.string) return LESS;
if (node1.string > node2.string) return GREATER;
return EQUAL;
}
With such a structure, some common operations are possible (but maybe not straightforward).
To find whether a string s exists in the set set_number, simply lookup (set_number, s) in the tree, for an exact match.
To find all strings in the set set_number:
function iterate_all_strings_in_set (set_number) {
// Traverse the tree from root downwards, looking for the given key. Return
// wherever the search ends up, whether it found the value or not.
node = lookup_tree_weak(set_number, "");
// tree empty?
if (node == null) {
return;
}
// We may have gotten the greatest node from the previous set,
// instead of the first node from the set we're interested in.
if (node.set_number != set_number) {
node = successor(node);
}
while (node != null && node.set_number == set_number) {
do_something_with(node.string);
node = successor(node);
}
}
The above requires O((k+1)*log(n)) time, where k is the number of strings in set_number, and n is the number of all strings.
To find all set numbers with at least one string associated:
function iterate_all_sets ()
{
node = first_node_in_tree();
while (node != null) {
current_set = node.set_number;
do_something_with(current_set);
if (cannot increment current_set) {
return;
}
node = lookup_tree_weak(current_set + 1, "");
if (node.set_number == current_set) {
node = successor(node);
}
}
}
The above requires O((k+1)*log(n)) time, where k is the number of sets with at least one string, and n is the number of all strings.
Note that the above code assumes that the tree is not modified in the "do_something" calls; it may crash if nodes are removed.
Addidionally, here's some real C code which demonstrates this, using my own generic AVL tree implemetation. To compile it, it's enough to copy the misc/ and structure/ folders from BadVPN source somewhere and add an include path there.
Note how my AVL tree does not contain any "data" in its nodes, and how it doesn't do any of its own memory allocation. This comes handy when you have a lot of data to work with. To make it clear: the program below does only a single malloc(), which is the one that allocates the nodes array.
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <assert.h>
#include <structure/BAVL.h>
#include <misc/offset.h>
struct value {
uint32_t set_no;
char str[3];
};
struct node {
uint8_t is_used;
struct value val;
BAVLNode tree_node;
};
BAVL tree;
static int value_comparator (void *unused, void *vv1, void *vv2)
{
struct value *v1 = vv1;
struct value *v2 = vv2;
if (v1->set_no < v2->set_no) {
return -1;
}
if (v1->set_no > v2->set_no) {
return 1;
}
int c = strcmp(v1->str, v2->str);
if (c < 0) {
return -1;
}
if (c > 0) {
return 1;
}
return 0;
}
static void random_bytes (unsigned char *out, size_t n)
{
while (n > 0) {
*out = rand();
out++;
n--;
}
}
static void random_value (struct value *out)
{
random_bytes((unsigned char *)&out->set_no, sizeof(out->set_no));
for (size_t i = 0; i < sizeof(out->str) - 1; i++) {
out->str[i] = (uint8_t)32 + (rand() % 94);
}
out->str[sizeof(out->str) - 1] = '\0';
}
static struct node * find_node (const struct value *val)
{
// find AVL tree node with an equal value
BAVLNode *tn = BAVL_LookupExact(&tree, (void *)val);
if (!tn) {
return NULL;
}
// get node pointer from pointer to its value (same as container_of() in Linux kernel)
struct node *n = UPPER_OBJECT(tn, struct node, tree_node);
assert(n->val.set_no == val->set_no);
assert(!strcmp(n->val.str, val->str));
return n;
}
static struct node * lookup_weak (const struct value *v)
{
BAVLNode *tn = BAVL_Lookup(&tree, (void *)v);
if (!tn) {
return NULL;
}
return UPPER_OBJECT(tn, struct node, tree_node);
}
static struct node * first_node (void)
{
BAVLNode *tn = BAVL_GetFirst(&tree);
if (!tn) {
return NULL;
}
return UPPER_OBJECT(tn, struct node, tree_node);
}
static struct node * next_node (struct node *node)
{
BAVLNode *tn = BAVL_GetNext(&tree, &node->tree_node);
if (!tn) {
return NULL;
}
return UPPER_OBJECT(tn, struct node, tree_node);
}
size_t num_found;
static void iterate_all_strings_in_set (uint32_t set_no)
{
struct value v;
v.set_no = set_no;
v.str[0] = '\0';
struct node *n = lookup_weak(&v);
if (!n) {
return;
}
if (n->val.set_no != set_no) {
n = next_node(n);
}
while (n && n->val.set_no == set_no) {
num_found++; // "do_something_with_string"
n = next_node(n);
}
}
static void iterate_all_sets (void)
{
struct node *node = first_node();
while (node) {
uint32_t current_set = node->val.set_no;
iterate_all_strings_in_set(current_set); // "do_something_with_set"
if (current_set == UINT32_MAX) {
return;
}
struct value v;
v.set_no = current_set + 1;
v.str[0] = '\0';
node = lookup_weak(&v);
if (node->val.set_no == current_set) {
node = next_node(node);
}
}
}
int main (int argc, char *argv[])
{
size_t num_nodes = 10000000;
// init AVL tree, using:
// key=(struct node).val,
// comparator=value_comparator
BAVL_Init(&tree, OFFSET_DIFF(struct node, val, tree_node), value_comparator, NULL);
printf("Allocating...\n");
// allocate nodes (missing overflow check...)
struct node *nodes = malloc(num_nodes * sizeof(nodes[0]));
if (!nodes) {
printf("malloc failed!\n");
return 1;
}
printf("Inserting %zu nodes...\n", num_nodes);
size_t num_inserted = 0;
// insert nodes, giving them random values
for (size_t i = 0; i < num_nodes; i++) {
struct node *n = &nodes[i];
// choose random set number and string
random_value(&n->val);
// try inserting into AVL tree
if (!BAVL_Insert(&tree, &n->tree_node, NULL)) {
printf("Insert collision: (%"PRIu32", '%s') already exists!\n", n->val.set_no, n->val.str);
n->is_used = 0;
continue;
}
n->is_used = 1;
num_inserted++;
}
printf("Looking up...\n");
// lookup all those values
for (size_t i = 0; i < num_nodes; i++) {
struct node *n = &nodes[i];
struct node *lookup_n = find_node(&n->val);
if (n->is_used) { // this node is the only one with this value
ASSERT(lookup_n == n)
} else { // this node was an insert collision; some other
// node must have this value
ASSERT(lookup_n != NULL)
ASSERT(lookup_n != n)
}
}
printf("Iterating by sets...\n");
num_found = 0;
iterate_all_sets();
ASSERT(num_found == num_inserted)
printf("Removing all strings...\n");
for (size_t i = 0; i < num_nodes; i++) {
struct node *n = &nodes[i];
if (!n->is_used) { // must not remove it it wasn't inserted
continue;
}
BAVL_Remove(&tree, &n->tree_node);
}
return 0;
}

Related

How would I convert this recursive function to an iterative one?

I'm having trouble trying to convert this recursive function find_reachable(..) to it's iterative equivalent. I have looked around and saw suggestions to use a stack but cannot figure it out. I also recognize that this function is tail recursive, but don't know what to do with this info. The if statement has me particularly stumped. Any help appreciated, thanks.
void find_reachable(struct person *current, int steps_remaining,
bool *reachable){
// mark current root person as reachable
reachable[person_get_index(current)] = true;
// now deal with this person's acquaintances
if (steps_remaining > 0){
int num_known = person_get_num_known(current);
for (int i = 0; i < num_known; i++){
struct person *acquaintance = person_get_acquaintance(current, i);
find_reachable(acquaintance, steps_remaining - 1, reachable);
}
}
}
.....
struct person {
int person_index; // each person has an index 0..(#people-1)
struct person ** known_people;
int number_of_known_people;
};
// return the person's unique index between zero and #people-1
int person_get_index(struct person * p) {
return p->person_index;
}
// return the number of people known by a person
int person_get_num_known(struct person * p) {
return p->number_of_known_people;
}
// get the index'th person known by p
struct person * person_get_acquaintance(struct person * p, int index) {
//fprintf(stderr, "index %d, num_known %d\n", index, p->number_of_known_people);
assert( (index >= 0) && (index < p->number_of_known_people) );
return p->known_people[index];
}
This looks like a depth-first search: it examines each person by examining that person's first acquaintance, then examining that acquaintance's first acquaintance, and so on before eventually backtracking. Recursion is generally a pretty good strategy for the depth-first search, and most iterative implementations do in fact use a stack to record the addresses of nodes higher up in the tree in order to backtrack to them later. The iterative depth-first search goes like this:
Let S be a stack, initially containing only the root of the graph being searched.
Pop from the stack into variable v.
Push to S all children (or, as they are called in this case, "acquaintances") of v.
If the stack is empty, terminate; otherwise, go to step 2.
The simplest way to implement stacks in C is to use a singly linked list, like so:
struct person_stack;
struct person_stack {
struct person *who;
struct person_stack *next;
};
struct person *person_stack_pop(struct person_stack **s) {
struct person_stack *old_top = *s;
struct person *who = old_top->who;
*s = *s->next;
free(old_top);
return who;
}
struct person_stack *person_stack_push(struct person_stack **s, struct person *p) {
struct person_stack *new_top = malloc(sizeof (struct person_stack));
new_top->next = *s;
new_top->who = p;
*s = new_top;
return *s;
}
There is one complication here, though! Your function only searches to a given depth. This is the reason why that if statement is there in the first place: to terminate the recursion when the search has gone deep enough. The regular DFS backtracks only when it runs out of children to search, so you'll have to add some extra logic to make it aware of its distance from the root.
You may also want to make sure that an acquaintance is only pushed into the stack if it is not already in the stack. This will save you from redundant iterations—think about what would happen if many of these people have mutual acquaintances.
Full transparency, I don't know c too well, so I used pseudocode where I didn't know the syntax, but something like this might work?
The general idea of converting recursion to iteration often requires adding objects (in this case, people) that need processing to the stack. You add these objects on some condition (in this case, if they're not too deep in the web of acquaintances).
You then iterate as long as the stack is not empty, popping off the top element and processing it. The 'processing' step usually also consists of adding more elements to the stack.
You are effectively mimicking the 'call stack' which results from a recursive function.
There's more involved in this if you care about the order in which elements are processed, but in this case, it doesn't seem like you do since they're only marked as 'reachable'.
void find_reachable(struct person *current, int steps_remaining, bool *reachable){
stack = /* create a new stack which contains a (person, int) touple or struct */
stack.push(/* current & steps_remaining*/)
while (/*stack is not empty*/) {
currentStruct = /* pop the top of stack */
current = currentStruct.person
depth = currentStruct.depth
reachable[person_get_index(current)] = true;
if (depth - 1 <= 0) {
continue;
// don't add this person's aquantances b/c we're 'steps_remaining' levels in
}
int num_known = person_get_num_known(current);
for (int i = 0; i < num_known; i++){
struct person *acquaintance = person_get_acquaintance(current, i);
stack.add(/*acquantance & depth - 1*/)
}
}
}
EDIT: Improved code

Managing duplicates in a binary tree with memory efficiency

I have a self balancing key-value binary tree (similar to Tarjan's Zip Tree) where there will be duplication of keys. To ensure O(log N) performance the only thing I can come up with is to maintain three pointers per node; a less than, a greater than, and an "equals". The equals pointer is a pointer to a linked-list of members having the same key.
This seems memory inefficient to me because I'll have an extra 8 bytes per node in the whole tree to handle the infrequent duplicate occurrences. Is there a better way that doesn't involve "cheats" like bit banging the left or right pointers for use as a flag?
When you have a collision insertion, allocate new buffer, copy new data.
Hash the new data pointer down to one or two bytes. You'll need a hash that only returns zero on zero input!
Store the hash value in your node. This field would be zero if there are no collision data, so you are O(log KeyCount) for all keys without extra data elements. You're worst case is log KeyCount plus whatever your hashing algorithm yields on lookups, which might be a constant close to 1 additional step until your table has to be resized.
Obviously, choice of hashing algorithm is critical here. Look for one that is good with pointer values on whatever architecture you are targeting. You may need different hashes for different architectures.
You can carry this even further by using only one byte hash values that get you the hash table that you then use the key hash (can be a larger integer) to find the pointer to the additional data. When a hash table fills up, insert a new one into the parent table. I'll leave the math to you.
Regarding data locality. Since the node data are large, you already don't have good node record to actual data locality anyway. This scheme doesn't change that, except in the case where you have multiple data nodes for a particular key, in which case, you'd likely have cache miss getting to the correct index of a variable array embedded in the node. This scheme avoids having to reallocate the nodes on collisions, and probably won't have a severe impact on your cache miss rate.
I usually use this setup when i do a binary search tree, it skips in an array the duplicates values:
#include <stdio.h>
#include <stdlib.h>
#define SIZE 13
typedef struct Node
{
struct Node * right;
struct Node * left;
int value;
}TNode;
typedef TNode * Nodo;
void bst(int data, Nodo * p )
{
Nodo pp = *p;
if(pp == NULL)
{
pp = (Nodo)malloc(sizeof(struct Node));
pp->right = NULL;
pp->left = NULL;
pp->value = data;
*p = pp;
}
else if(data == pp->value)
{
return;
}
else if(data > pp->value)
{
bst(data, &pp->right);
}
else
{
bst(data, &pp->left);
}
}
void displayDesc(Nodo p)
{
if(p != NULL)
{
displayDesc(p->right);
printf("%d\n", p->value);
displayDesc(p->left);
}
}
void displayAsc(Nodo p)
{
if(p != NULL)
{
displayAsc(p->left);
printf("%d\n", p->value);
displayAsc(p->right);
}
}
int main()
{
int arr[SIZE] = {4,1,0,7,5,88,8,9,55,42,0,5,6};
Nodo head = NULL;
for(int i = 0; i < SIZE; i++)
{
bst(arr[i], &head);
}
displayAsc(head);
exit(0);
}

Any faster methods to find data?

This is an Interview question.
We are developing a k/v system, part of it has been developed, we need you to finish it.
Things already done -
1) Return a hash of any string, you can assume return value is always unique, no collision,
it's up to you to use it or not
int hash(char *string);
Things you have to finish -
int set(char *key, char *value);
char *get(char *key);
And my answer was
struct kv {
int key;
char *value;
kv *next;
};
struct kv *top;
struct kv *end;
void set(char *key, char *value) {
if(top == NULL) {
top = malloc(struct kv);
end = top;
}
sturct kv *i = top;
int k = hash(key);
while(i != end) {
if(i->key == k) {
i->value = value;
return;
}
i = i->next;
}
i = malloc(struct kv);
i->key = k;
i->value = value;
end = i;
}
char *get(char *key) {
if(top == NULL) {
return NULL;
}
sturct kv *i = top;
int k = hash(key);
while(i != end) {
if(i->key == k) {
return i->value;
}
i = i->next;
}
return NULL;
}
Q: - Is there any faster way to do it? What do you think is the fastest way?
What you have done is made a linked list to store the key value pairs. But as you can see, the search complexity is O(n). You can make it faster by creating a hash table. You already have a hash function with guaranteed 0 collisions.
What you can do is
char* hash_tables[RANGE_OF_HASH] = {NULL}; // Your interviewer should provide you RANGE_OF_HASH
Then your set and get become -
void set(char* key, char* value) {
hash_table[hash(key)] = value; // Can do this because no collisions are guaranteed.
}
char* get(char* key) {
return hash_table[hash(key)];
}
In this case since you don't have to iterate over all the keys inserted, the get complexity is O(1) (also set).
But you need to be aware that this usually occupies more space than your approach.
Your method occupies O(n) space but this occupies O(RANGE_OF_HASH). Which might not be acceptable in situations where memory is a constraint.
If RANGE_OF_HASH is very huge(like INT_MAX) and you don't have enough memory for hash_table, you can create a multi level hash table.
For instance, your main hash_table will have only 256 slots. Each of the entry will point to another hash table of 256 entries and so on. You will have
to do some bit masking to get the hash value for each level. You can allocate each level on demand basis. This way you will minimize the memory usage.
There's lots of great ways of doing this. Here's a small reading list, go through it. There's definitely more out there that I'm not aware of.
Sorted list with binary search - Depending on the usage patterns, can be fast or slow to build, but lookups are guaranteed to be O(log(N)).
Hash table - fast, close to O(1) on average, O(N) in worst case for all operations.
Binary tree - best case O(log(N)), worst case O(N).
AVL tree - guaranteed O(log(N)) for all operations.
Red-black tree - similar to AVL but trades off lookup speed for more inserting speed.
Trie - True O(1) on all operations, at the expense of more memory usage.
After this, take a break, brace yourself, and delve into this article about computer memory. This is already advanced stuff and will show you that sometimes a worse big-O measure can actually perform better in real world scenarios. It's all down to what kind of data will there be and what the usage patterns are.

C Function returning pointer to garbage memory [duplicate]

This question already has answers here:
How to access a local variable from a different function using pointers?
(10 answers)
Closed 5 years ago.
I am writing a program that, given a set of inputs and outputs, figures out what the equation is. The way the program works is by randomly generating binary trees and putting them through a genetic algorithm to see which is the best.
All the functions I have written work individually, but there is either one or two that do not.
In the program I use two structs, one for a node in the binary tree and the other to keep track of how accurate each tree is given the data (its fitness):
struct node {
char value;
struct node *left, *right;
};
struct individual {
struct node *genome;
double fitness;
};
One function I use to randomly create trees is a subtree crossover function, which randomly merges two trees, returning two trees that are sort of a mixture of each other. The function is as follows:
struct node **subtree_crossover(struct node parent1, struct node parent2) {
struct node *xo_nodes[2];
for (int i = 0; i < 2; i++) {
struct node *parent = (i ? &parent2 : &parent1);
// Find the subtree at the crossover point
xo_nodes[i] = get_node_at_index(&parent, random_index);
}
else {
// Swap the nodes
struct node tmp = *xo_nodes[0];
*xo_nodes[0] = *xo_nodes[1];
*xo_nodes[1] = tmp;
}
struct node **parents = malloc(sizeof(struct node *) * 2);
parents[0] = &parent1;
parents[1] = &parent2;
return parents;
}
Another function used one that takes two populations (list of individuals) and selects the best from both, returning the next population. It is as follows:
struct individual *generational_replacement(struct individual *new_population,
int size, struct individual *old_population) {
int elite_size = 3;
struct individual *population = malloc(sizeof(struct individual) * (elite_size + size));
int i;
for (i = 0; i < size; i++) {
population[i] = new_population[i];
}
for (i; i < elite_size; i++) {
population[i] = old_population[i];
}
sort_population(population);
population = realloc(population, sizeof(struct individual) * size);
return population;
}
Then there is the function that essentially is the main part of the program. This functions loops through a population, randomly modifies them and chooses the best among them across multiple generations. From this, it selects the best individual (the highest fitness) and returns it. It is as follows:
struct individual *search_loop(struct individual *population) {
int pop_size = 10;
int tourn_size = 3;
int new_pop_i = 0;
int generation = 1
struct individual *new_population = malloc(sizeof(struct individual) * pop_size);
while (generation < 10) {
while (new_pop_i < pop_size) {
// Insert code where random subtrees are chosen
struct node **nodes = subtree_crossover(random_subtree_1, random_subtree_2);
// Insert code to add the trees to new_population
}
population = generational_replacement(new_population, pop_size, population);
// Insert code to sort population by fitness value
}
return &population[0];
}
The issue I am having is that the search_loop function returns a pointer to an individual that is filled with garbage values. To narrow down the causes, I began to comment out code. By commenting out either subtree_crossover() or generational_replacement() the function returns a valid individual. Based on this, my guess is that the error is caused by either subtree_crossover() or generational_replacement().
Obviously, this is a heavily reduced version of the code I am using, but I believe it still will show the error that I am getting. If you would like to view the full source code, look in the development branch of this project: https://github.com/dyingpie1/pony_gp_c/tree/Development
Any help would be greatly appreciated. I have been trying to figure this out for multiple days.
Your subtree_crossover() function is taking two nodes as values. The function will receive copies, which will then live on the stack until the function exits, at which point they will become invalid. Unfortunately, the function later sticks their addresses into an array that it returns. Therefore, the result of subtree_crossover() is going to contain two invalid pointers to garbage data.
You could initialize parents as a struct node * instead of a struct node **, and make it twice the size of a struct node. Then, you could just copy the nodes into the array. This would avoid the issue. Alternatively, you could copy the nodes onto the heap, so that you could return a struct node **. You'd then have to remember to eventually free the copies, though.

Is it possible to build a linked list without the help of self referential structure?

Is it possible to build a linked list without the help of self referential structure? I.e. like this just using a pointer:
struct list{
int data;
int *nxt;
};
instead of
struct list{
int data;
struct list *nxt;
};
Yes, it's possible.
What you're proposing is type punning, and you'll probably get away with it with most compilers on most platforms, but there's not a single good reason to do it in the first place and many good reasons not to.
You might have a look at how linked lists are implemented in the Linux kernel.
Compared to the classical linked lists:
struct my_list{
void *myitem;
struct my_list *next;
struct my_list *prev;
};
using a linked list in the linux kernel looks like:
struct my_cool_list{
struct list_head list; /* kernel's list structure */
int my_cool_data;
};
Source
Sure. You could use void * instead of struct list * as your nxt member. For example:
struct list {
int data;
void *nxt;
};
Alternatively, if you're prepared to build in a guarantee that the nodes of the list will be stored into an array (which is great for cache locality and insertion speed), you could use a size_t.
struct list {
int data;
size_t nxt;
};
If your implementation provides uintptr_t (from <stdint.h>), you could merge these two types into one:
struct list {
int data;
uintptr_t nxt;
};
Another approach is to use parallel arrays as your "heap"1, with one array containing your data (whether a scalar or a struct) and another to store the index of the next "node" in the list, like:
int data[N];
int next[N];
You'll need two "pointers" - one to point to the first element in the list, and one to point to the first available "node" in your "heap".
int head;
int free;
You'll initialize all the elements in the "next" array to "point" to the next element, except for the last element which points to "null":
for ( int i = 1; i < N-1; i++ )
next[i-1] = i;
next[N-1] = -1;
free = 0; // first available "node" in the heap
head = -1; // list is empty, head is "null"
Inserting a new element in the list looks something like this (assuming an ordered list):
if ( free != -1 )
{
int idx = free; // get next available "node" index;
free = next[free]; // update the free pointer
data[idx] = newValue;
next[idx] = -1;
if ( head == -1 )
{
head = idx;
}
else if ( data[idx] < data[head] )
{
next[idx] = head;
head = idx;
}
else
{
int cur = head;
while ( next[cur] != -1 && data[idx] >= data[next[cur]] )
cur = next[cur];
next[idx] = next[cur];
next[cur] = idx;
}
}
Deleting an element at idx is pretty straightforward:
int cur = head;
while ( next[cur] != idx && next[cur] != -1 )
cur = next[cur];
if ( next[cur] == idx )
{
next[cur] = next[idx]; // point to the element following idx
next[idx] = free; // add idx to the head of the free list
free = idx;
}
Advantages of this approach:
No dynamic memory management;
No self-referential structures;
Data items are stored contiguously in memory, which may lead to better caching performance.
Disadvantages:
In this simple approach, list sizes are fixed;
Any usefully large arrays will have to be declared static, meaning your binary may have a large footprint;
The convention of using -1 as "null" means you have to use a signed type for your indices, which reduces the number of available elements for a given array type.
You could certainly use 0 as "null" and count all your indices from 1; this allows you to use unsigned types like size_t for your indices, meaning you can make your arrays pretty much as big as the system will allow. It's just that 0 is a valid array index whereas -1 (usually) isn't, which is why I chose it for this example.
1. Welcome to my Data Structures class, ca. 1987, which was taught using Fortran 77. Parallel arrays were pretty much the answer for everything (lists, trees, queues, etc.)

Resources