How should I change my Graph structure (very slow insertion)? - c

This program I'm doing is about a social network, which means there are users and their profiles. The profiles structure is UserProfile.
Now, there are various possible Graph implementations and I don't think I'm using the best one. I have a Graph structure and inside, there's a pointer to a linked list of type Vertex. Each Vertex element has a value, a pointer to the next Vertex and a pointer to a linked list of type Edge. Each Edge element has a value (so I can define weights and whatever it's needed), a pointer to the next Edge and a pointer to the Vertex owner.
I have a 2 sample files with data to process (in CSV style) and insert into the Graph. The first one is the user data (one user per line); the second one is the user relations (for the graph). The first file is quickly inserted into the graph cause I always insert at the head and there's like ~18000 users. The second file takes ages but I still insert the edges at the head. The file has about ~520000 lines of user relations and takes between 13-15mins to insert into the Graph. I made a quick test and reading the data is pretty quickly, instantaneously really. The problem is in the insertion.
This problem exists because I have a Graph implemented with linked lists for the vertices. Every time I need to insert a relation, I need to lookup for 2 vertices, so I can link them together. This is the problem... Doing this for ~520000 relations, takes a while.
How should I solve this?
Solution 1) Some people recommended me to implement the Graph (the vertices part) as an array instead of a linked list. This way I have direct access to every vertex and the insertion is probably going to drop considerably. But, I don't like the idea of allocating an array with [18000] elements. How practically is this? My sample data has ~18000, but what if I need much less or much more? The linked list approach has that flexibility, I can have whatever size I want as long as there's memory for it. But the array doesn't, how am I going to handle such situation? What are your suggestions?
Using linked lists is good for space complexity but bad for time complexity. And using an array is good for time complexity but bad for space complexity.
Any thoughts about this solution?
Solution 2) This project also demands that I have some sort of data structures that allows quick lookup based on a name index and an ID index. For this I decided to use Hash Tables. My tables are implemented with separate chaining as collision resolution and when a load factor of 0.70 is reach, I normally recreate the table. I base the next table size on this Link.
Currently, both Hash Tables hold a pointer to the UserProfile instead of duplication the user profile itself. That would be stupid, changing data would require 3 changes and it's really dumb to do it that way. So I just save the pointer to the UserProfile. The same user profile pointer is also saved as value in each Graph Vertex.
So, I have 3 data structures, one Graph and two Hash Tables and every single one of them point to the same exact UserProfile. The Graph structure will serve the purpose of finding the shortest path and stuff like that while the Hash Tables serve as quick index by name and ID.
What I'm thinking to solve my Graph problem is to, instead of having the Hash Tables value point to the UserProfile, I point it to the corresponding Vertex. It's still a pointer, no more and no less space is used, I just change what I point to.
Like this, I can easily and quickly lookup for each Vertex I need and link them together. This will insert the ~520000 relations pretty quickly.
I thought of this solution because I already have the Hash Tables and I need to have them, then, why not take advantage of them for indexing the Graph vertices instead of the user profile? It's basically the same thing, I can still access the UserProfile pretty quickly, just go to the Vertex and then to the UserProfile.
But, do you see any cons on this second solution against the first one? Or only pros that overpower the pros and cons on the first solution?
Other Solution) If you have any other solution, I'm all ears. But please explain the pros and cons of that solution over the previous 2. I really don't have much time to be wasting with this right now, I need to move on with this project, so, if I'm doing to do such a change, I need to understand exactly what to change and if that's really the way to go.
Hopefully no one fell asleep reading this and closed the browser, sorry for the big testament. But I really need to decide what to do about this and I really need to make a change.
P.S: When answering my proposed solutions, please enumerate them as I did so I know exactly what are you talking about and don't confuse my self more than I already am.

The first approach is the Since the main issue here is speed, I would prefer the array approach.
You should, of course, maintain the hash table for the name-index lookup.
If I understood correctly, you only process the data one time. So there is no dynamic data insertion.
To deal with the space allocation problem, I would recommend:
1 - Read once the file, to get the number of vertex.
2 - allocate that space
If you data is dynamic, you could implement some simple method to increment the array size in steps of 50%.
3 - In the Edges, substitute you linked list for an array. This array should be dynamically incremented with steps of 50%.
Even with the "extra" space allocated, when you increment the size with steps of 50%, the total size used by the array should only be marginally larger than with the size of the linked list.
I hope I could help.

Related

How to manage memory for inserting and deleting content? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
For convenience, I just use examples for plain text. For sentence I have a cat, for example, I need to malloc 13 slots of char variables so that it stores all the letters with the final \0.
However, what if now I want to insert lovely before cat? It seems that I have to create a new array that is large enough and copy everything over.
Worse, since it is unpredictable for the computer that how much stuff will be added, it seems that I have to do this re-malloc and copy thing each time a new letter is added, that is, do the whole thing for each letter l o v e l y, which turns out not a smart solution. (The computer does not know ahead of time the word 'lovely', eh?)
A "better" solution seems to be creating a large enough array at the first place so that every time a new letter is inserted, the program only copies and moves everything after it back. However, this is still inefficient especially when the document is long and I'm adding stuff from the beginning.
The same applies for 'delete', for which every time a letter is deleted I have to copy everything after it over and shrink the array size, it seems.
Using nodes instead of arrays for storing content seems an equally awful solution as now every time I want to do something in the middle of the content I have to take a path all the way from the beginning.
So what is the correct, or efficient, way to manage the memory in this case? I want answers for programming at a low level such as C, which requires direct memory allocation and de-allocation without "magic" functions or libraries that handle everything for you already.
Using linked list of chunks of memory sounds like a good intermediate solution. Each node would be a "page" of memory of certain size. To speed up modifying content in middle pages you could have an index array which would contain page pointers to absolute positions in the whole document.
Deletion should just be performed when and entire page is empty. In that moment you should do something like:
prevPage->next = nextPage;
pageFree(page_to_delete);
If you want to handle characters insertion and deletion easily without re-malloc over and over, i think the best solution is Doubly Linked List.
Check out here : DoublyLinkedListExample (i learned it at school but i think this tuto explains you quite simply how it works and how to use it)
Those are just struct (nodes) with your data, a pointer to the previous element and a pointer to the next element. If you don't understand how it works, just check a tutorial for simply linked list before and then it will be easier for you.
Just practice it beceause its quite hard to understand at the beginning. Keep training and you will reach it :)
One efficient solution is to use a circular array list.
http://en.wikipedia.org/wiki/Circular_buffer
After pre-allocating some size of array, you also keep track of a pointer to the 'beginning' of the list(at first the index of 'c', then the index of 'l'). This way, to insert or delete at the beginning you can add to the physical end of memory and change the pointer.
To index into the array, you simply index into array[(beginning pointer + index)%size].
If number of letters becomes too large, you still have to copy to a new array.
In terms of how much to pre-allocate, a system that doesn't take to much time is to double the size of the array each time it becomes full. This doesn't add too much of an overhead.
Edit:
A circular array list won't be useful if you need to insert data into the middle of the list. However, it is useful for adding data to the beginning and end of the list and modifying or accessing the middle.
Given what you responded to in the comments clarifying your use case, my suggestion would be to consider a linked list of content, where in the metaphor of your plain text example, the elements of the linked list are words or paragraphs or pages, and the words themselves are contiguous arrays.
While the navigation between them isn't super fast, it seemed that your performance imperative was for quick insertion and deletion. By having small contiguous words, the O(n) cost for reallocing/shrinking and copying stuff over is minimized by controlling for small n. This is achieved by having many n's which are the linked list elements.
This blends together performance improvements from having the 'individual' pieces of the content spatial locality, while allowing you to pick an upper level list/tree structure to help gain benefits for temporal locality.
The one thing this really doesn't address is what needs to be done to this data after the fact for processing it, and what level of performance is truly tolerable. Constant malloc calls will be bad for latency because its a blocking system call; so you could further consider using another solution already mentioned such as circular buffers or managing of your own bigger chunks of memory to distribute yourself to these elements. In that way you'd only have to malloc when you needed a much larger chunk of memory to work with, and still wouldn't necessarily have to recopy everything from page to page, but just a smaller chunk that didn't fit.
Again as I said in my comment, people write dissertations about this kinda thing, and it's a major component of OS design and systems understanding. So take this all with a grain of salt. There are a very large number of things to consider that can't be covered here.
It is not completely clear what is your use case.
Since you mention text manipulation and having efficient insert, deletion and random access operations I guess you could use a rope data structure, which is a binary tree which basically store short string fragments in its nodes (roughly). For the details see the linked article.

how to remember multiple indexes in a buffer to later access them for modification one by one...keeping optimization in mind

i have a scenario where i have to set few records with field values to a constant and then later access them one by one sequentially .
The records can be random records.
I dont want to use link list as it will be costly and don't want to traverse the whole buffer.
please give me some idea to do that.
When you say "set few records with field values to a constant" is this like a key to the record? And then "later access them one by one" - is this to recall them with some key? "one-by-one sequentially" and "don't want to traverse the whole buffer" seems to conflict, as sequential access sounds a lot like traversal.
But I digress. If you in fact do have a key (and it's a number), you could use some sort of Hash Table to organize your records. One basic implementation might be an array of linked lists, where you mod the key down into the array's range, then add it to the list there. This might increase performance assuming you have a good distribution of keys (your records spread across the array well).
Another data structure to look into might be a B-Tree or a binary search tree, which can access nodes in logarithmic time.
However, overall I agree with the commenters that over-optimizing is usually not a good idea.

Iterating and storing thousands/millions of objects efficiently

I am working on a simulation where I need to be able to handle thousands potentialy millions of objects updating every loop.
All of the objects need to have their logic function called (AI).
But depending on the location of the object determines how detailed the logic will be. For example:
[working with 100 objects to keep it simple]
All objects have a location (x,y)
20 objects are 500 points away
from a 'point of interest' location.
50 objects are 500 points
from the 20 objects (1000 points away).
30 objects are within 100
points from the point of interest.
Now say this was a detailed city simulation with the objects being virtual citizens.
At 6pm it's time for everyone to go home from their jobs and sleep.
So we iterate through all citizens, but I'm wanting them to do different things.
The furtherest away objects (50) Go home from their job and sleep
until morning.
The closer objects (20) Go home from their job, have a
bite to eat then sleep until morning.
The closest objects (30) Go
home from their job, have a bite to eat, brush teeth then sleep until
morning.
As you can see the closer they are to the point of interest the more detailed the logic becomes.
I am trying to work out what the best and most performance efficient way to iterate through all objects would be.
This would be relativly easy with a hand full of objects but as this needs to handle at lest 500,000 objects efficiently, I need some advice.
Also I'm not sure if I should iterate through all objects every loop or maybe it would be better to iterate through the closest objects every loop but only itereate through further away objects every 10 loops?
With the additional requirement of needing the objects to interact between other objects close to them, I have been thinking the best way to do this might be to organise them in a quadtree but I'm not sure. It seems as though quad trees are more for static content, but the objects i'm dealing with, as mentioned have a location and are required to move to other locations.
Am I going down the right track of thinking? or is there a 'better' way?
I am also working in c++ if anyone thinks its relevant.
Any advise would be greatly appreciated.
NOTE:
The point of interest changes regularly, think of it as a camera
view.
Objects are created and destroyed dynamically
If you want to quickly select objects in certain radius from particular point, then quad-tree or just simple square grid will help.
If your problem is how to store millions of objects to make iteration through them efficient, then you probably could use column based technique, when instead of having 1 million objects each having 5 fields, you have 5 arrays of 1 million elements each. In this case each object is just an index in range 0 .. 999999. So, for example, you want to store 1 million object of the following structure:
struct resident
{
int x;
int y;
int flags;
int age;
int health; // This is computer game, right?
}
Then, instead of declaring resident residents [1000000] you declare 5 arrays:
int resident_x [1000000];
int resident_y [1000000];
int resident_flags [1000000];
int resident_age [1000000];
int resident_health [1000000];
And then, instead of, say, residents [n].x you use resident_x [n]. Such way to store objects may be faster when you need to iterate through all objects of the same type and do something with couple of fields in each object (with the same set of fields in each object).
You need to break the problem down into "classes", just like in the real world. Each person's class is computed from the distance. So lower class people are far away and upper class are close. Or more correctly "far class", nearish class and "here class" or whatever you want to name them.
1) Make an array with one slot for each class. This slot will hold a "linked list" of each person in that class. When a persons class changers(social climbers), then it is very rapid to move the object to another list.
2) So put everybody into the proper classes and iterate only the classes close to you. In a proper scenario there are objects which are to far away to care about so you can put those back to disk and only reload when you get nearer.
There's a few questions embedded in there:
-How to deal with large quantities of objects? If there is a constant number of fixed objects, you may be able to simply create an array of them, as long as you have sufficient memory. If you need to dynamically create and destroy them, you put yourself at risk for memory leaks without careful handling of destroyed objects. At a certain point, you may ask yourself whether it is better to use another application, such as a database, to store your objects, and perform just the logic in your C++ code. Databases will provide additional functionality that I will highlight.
-How to find objects in a given distance from others. This is a classic problem for geographic information systems (GIS); it sounds like you are trying to operate a simple GIS to store your objects and your attributes, so it is applicable. It takes computation power to test SQRT((X-x)^2+(Y-y)^2), the distance formula, on every point. Instead, it is common to use a 'windowing function' to extract a square containing all the points you want, and then search within this to find points that lie specifically in a given radius. Some databases are optimized to perform a variety of GIS functions, including returning points within a given radius, or returning points within some other geometry like a polygon. Otherwise you'll have to program this functionality yourself.
-Tree storage of objects. This can improve speed, but you will hit a tradeoff if the objects are constantly moving around, wherein the tree has to be restructured often. It all depends on how often things move versus how often you want to do calculations on them.
-AI code. If you're trying to do AI on millions of objects, this may be your biggest use of performance, instead of the methodology used to store and search the objects. You're right in that simpler code for points farther away will increase performance, as will executing the logic less often for far away points. This is sometimes handled using Monte Carlo analysis, where the logic will be performed on a random subset of points during any given iteration, and you could have the probability of execution decrease as distance from the point of interest increases.
I would consider using a Linear Quadtree with Morton Encoding / Z-Order indexing. You can further optimize this structure by using a Bit Array to represent nodes that contain data and very quickly perform calculations.
I've done this extremely efficiently in the browser using Javascript and I can traverse through 67 million nodes in sub-seconds. Once I've narrowed it down to the region of interest, I look up the data in a different structure. All of it still in milliseconds. I'm using this for spatial vector animation.

which one to use linked list or static arrays?

I have a structure in C which resembles that of a database table record.
Now when I query the table using select, I do not know how many records I will get.
I want to store all the returned records from the select query in a array of my structure data type.
Which method is best?
Method 1: find array size and allocate
first get the count of records by doing select count(*) from table
allocate a static array
run select * from table and then store each records in my structure in a loop.
Method 2: use single linked list
while ( records returned )
{
create new node
store the record in node
}
Which implementation is best?
My requirement is that when I have all the records,
I will probably make copies of them or something.
But I do not need random access and I will not be doing any search of a particular record.
Thanks
And I forgot option #4. Allocate an array of fixed size. When that array is full, allocate another. You can keep track of the arrays by linking them in a linked list, or having a higher level array that keeps the pointers to the data arrays. This two-level scheme is great when you need random access, you just need to break your index into two parts.
A problem with 'select count(*)' is that the value might change between calls, so your "real" select will have a number of items different from the count you'd expect.
I think the best solution is your "2".
Instead of a linked list, I would personally allocate an array (reallocating as necessary). This is easier in languages that support growing arrays (e.g. std::vector<myrecord> in C++ and List<myrecord> in C#).
You forgot option 3, it's a little more complicated but it might be best for your particular case. This is the way it's typically done in C++ std::vector.
Allocate an array of any comfortable size. When that array is filled, allocate a new larger array of 1.5x to 2x the size of the filled one, then copy the filled array to this one. Free the original array and replace it with the new one. Lather, rinse, repeat.
There are a good many possible critiques that should be made.
You are not talking about a static array at all - a static array would be of pre-determined size fixed at compile time, and either local to a source file or local to a function. You are talking about a dynamically allocated array.
You do not give any indication of record size or record count, nor of how dynamic the database underneath is (that is, could any other process change any of the data while yours is running). The sizing information isn't dreadfully critical, but the other factor is. If you're doing a report of some sort, then fetching the data into memory is fine; you aren't going to modify the database and the data is an accurate snapshot. However, if other people could be modifying the records while you are modifying records, your outline solution is a major example of how to lose other people's updates. That is a BAD thing!
Why do you need all the data in memory at once? Ignoring size constraints, what exactly is the benefit of that compared with processing each relevant record once in the correct sequence? You see, DBMS put a lot of effort into being able to select the relevant records (WHERE clauses) and the relevant data (SELECT lists) and allow you to specify the sequence (ORDER BY clauses) and they have the best sort systems they can afford (better than the ones you or I are likely to produce).
Beware of quadratic behaviour if you allocate your array in chunks. Each time you reallocate, there's a decent chance the old memory will have to be copied to the new location. This will fragment your memory (the old location will be available for reuse, but by definition will be too small to reuse). Mark Ransom points out a reasonable alternative - not the world's simplest scheme overall (but it avoids the quadratic behaviour I referred to). Of course, you can (and would) abstract that away by a set of suitable functions.
Bulk fetching (also mentioned by Mark Ransom) is also useful. You would want to preallocate the array into which a bulk fetch fetches so that you don't have to do extra copying. This is just linear behaviour though, so it is less serious.
Create a data structure to represent your array or list. Pretend you're in an OO language and create accessors and constructors for everything you need. Inside that data structure, keep an array, and, as others have said, when the array is filled to capacity, allocate a new array 2x as large and copy into it. Access the structure only through your defined routines for accessing it.
This is the way Java and other languages do this. Internally, this is even how Perl is implemented in C.
I was going to say your best option is to look for a library that already does this ... maybe you can borrow Perl's C implementation of this kind of data structure. I'm sure it's more well tested than anything you or I could roll up from scratch. :)
while(record = get_record()) {
records++;
records_array = (record_struct *) realloc(records_array, (sizeof record_struct)*records);
*records_array[records - 1] = record;
}
This is strictly an example — please don't use realloc() in production.
The linked list is a nice, simple option. I'd go with that. If you prefer the growing array, you can find an implementation as part of Dave Hanson's C Interfaces and Implementations, which as a bonus also provides linked lists.
This looks to me like a design decision that is likely to change as your application evolves, so you should definitely hide the representation behind a suitable API. If you don't already know how to do this, Hanson's code will give you a number of nice examples.

Best data structure in C for these two situations?

I kinda need to decide on this to see if I can achieve it in a couple of hours before the deadline for my school project is due but I don't understand much about data structures and I need suggestions...
There's 2 things I need to do, they will probably use different data structures.
I need a data structure to hold profile records. The profiles must be search able by name and social security number. The SSN is unique, so I probably can use that for my advantage? I suppose hash maps is the best bet here? But how do I use the SSN in an hash map to use that as an advantage in looking for a specific profile? A basic and easy to understand explanation would be much appreciated.
I need a data structure to hold records about cities. I need to know which are cities with most visitors, cities less visited and the clients (the profile is pulled from the data structure in #1 for data about the clients) that visit a specific city.
This is the third data structure I need for my project and it's the data structure that I have no idea where to begin. Suggestions as for which type of data structure to use are appreciated, if possible, with examples on how to old the data above in bold.
As a note:
The first data structure is already done (I talked about it in a previous question). The second one is posted here on #1 and although the other group members are taking care of that I just need to know if what we are trying to do is the "best" approach. The third one is #2, the one I need most help.
The right answer lies anywhere between a balanced search tree and an array.
The situation you have mentioned here and else-thread misses out on a very important point: The size of the data you are handling. You choose your data structure and algorithm(s) depending on the amount of data you have to handle. It is important that you are able to justify your choice(s). Using a less efficient general algorithm is not always bad. Being able to back up your choices (e.g: choosing bubble-sort since data size < 10 always) shows a) greater command of the field and b) pragmatism -- both of which are in short supply.
For searchability across multiple keys, store the data in any convenient form, and provides fast lookup indexes on the key(s).
This could be as simple as keeping the data in an array (or linked list, or ...) in the order of creation, and keeping a bunch of {hashtables|sorted arrays|btrees} of maps (key, data*) for all the interesting keys (SSN, name, ...).
If you had more time, you could even work out how to not have a different struct for each different map...
I think this solution probably applies to both your problems.
Good luck.
For clarity:
First we have a simple array of student records
typedef
struct student_s {
char ssn[10]; // nul terminated so we can use str* functions
char name[100];
float GPA;
...
} student;
student slist[MAX_STUDENTS];
which is filled in as you go. It has no order, so search on any key is a linear time operation. Not a problem for 1,000 entries, but maybe a problem for 10,000, and certainly a problem for 1 million. See dirkgently's comments.
If we want to be able to search fast we need another layer of structure. I build a map between a key and the main data structure like this:
typedef
struct str_map {
char* key;
student *data;
} smap;
smap skey[MAX_STUDENTS]
and maintain skey sorted on the key, so that I can do fast lookups. (Only an array is a hassle to keep sorted, so we probably prefer a tree, or a hashmap.)
This complexity isn't needed (and should certainly be avoided) if you will only want fast searches on a single field.
Outside of a homework question, you'd use a relational database for
this. But that probably doesn't help you…
The first thing you need to figure out, as others have already pointed
out, is how much data you're handling. An O(n) brute-force search is
plenty fast as long a n is small. Since a trivial amount of data would
make this a trivial problem (put it in an array, and just brute-force
search it), I'm going to assume the amount of data is large.
Storing Cities
First, your search requirements appear to require the data sorted in
multiple ways:
Some city unique identifier (name?)
Number of visitors
This actually isn't too hard to satisfy. (1) is easiest. Store the
cities in some array. The array index becomes the unique identifier
(assumption: we aren't deleting cities, or if we do delete cities we can
just leave that array spot unused, wasting some memory. Adding is OK).
Now, we also need to be able to find most & fewest visits. Assuming
modifications may happen (e.g., adding cities, changing number of
visitors, etc.) and borrowing from relational databases, I'd suggest
creating an index using some form of balanced tree. Databases would
commonly use a B-Tree, but different ones may work for you: check Wikipedia's
article on trees. In each tree node, I'd just keep a pointer (or
array index) of the city data. No reason to make another copy!
I recommend a tree over a hash for one simple reason: you can very
easily do a preorder or reverse order traversal to find the top or
bottom N items. A hash can't do that.
Of course, if modifications may not happen, just use another array (of
pointers to the items, once again, don't duplicate them).
Linking Cities to Profiles
How to do this depends on how you have to query the data, and what form
it can take. The most general is that each profile can be associated
with multiple cities and each city can be associated with multiple
profiles. Further, we want to be able to efficiently query from either
direction — ask both "who visits Phoenix?" and "which cities does Bob
visit?".
Shamelessly lifting from databases again, I'd create another data
structure, a fairly simple one along the lines of:
struct profile_city {
/* btree pointers here */
size_t profile_idx; /* or use a pointer */
size_t city_idx; /* for both indices */
};
So, to say Bob (profile 4) has visited Phoenix (city 2) you'd have
profile_idx = 4 and city_idx = 2. To say Bob has visited Vegas (city
1) as well, you'd add another one, so you'd have two of them for Bob.
Now, you have a choice: you can store these either in a tree or a
hash. Personally, I'd go with the tree, since that code is already
written. But a hash would be O(n) instead of O(logn) for lookups.
Also, just like we did for the city visit count, create an index for
city_idx so the lookup can be done from that side too.
Conclusion
You now have a way to look up the 5 most-visited cities (via an in-order
traversal on the city visit count index), and find out who visits those
cities, by search for each city in the city_idx index to get the
profile_idx. Grab only unique items, and you have your answer.
Oh, and something seems wrong here: This seems like an awful lot of code for your instructor to want written in several hours!

Resources