Need a fast way to write large blocks of data to file in C - c

I am not at all good when it comes to writing large chunks of data to file. I have a simulation which has structs like so
typedef struct
{
int age;
float height;
float weight;
int friends [ 250000 ];
} Person;
And I can have as many as 250,000 persons, each with 250000 friends (a clique). Obviously this is a great deal of data. If I want to save each struct so I can later load them, what is the most efficient way in C? Here is what I have considered so far
I don't want to create a HUGE string with 250,000 groups of data and then do a single write as this will use a great deal of memory
I also don't want to create 250,000 different files as doing so may be slow.
Appending the files based on index (ie person 1, then person 2...), but this might be slow too.
Saving the data as binary (is this more efficient?)
EDIT I am looking for efficient approaches to using fwrite (), namely whether it's faster to collect all the data and write to a single file, or whether to create multiple files and avoid the overhead of collecting all the data before hand.

You can loop over the people and just store the age, height and weight members (3 fwrites), then a friend_count and then loop over the friends and write them one by one. All of this with fwrite. You don't need to care about optimizing I/O, as the C library will buffer for you and do a big "write" when needed.

I think you are trying to [partially] reinvent a RDBMS (database). Reinventing is usually a bad idea. Consider storing your data in a free database system (e.g. Postgres). It will have other benefits -- you'll be able to interrogate your data w/o writing C code.
If a database sounds like an overkill, use a simpler, file based database storage library such as BerkleyDB or SQLite.

I am not very clear about your structure.
You have a Person structure array, and friends[] contain indexes of other Persons array?
The best way would be to distinguish between a Person and his friends.
This way you have a Person of fixed size, and can store all Persons in a single file, and quickly read back data of Person 12345 - it's at filepos 12345*sizeof(Person) from the beginning of the file.
Friends array can be kept in memory through a
int *Friends[MAXFRIENDS]
array -- you need MAXFRIENDS*sizeof(int *) more bytes of memory, for 250.000 friends it should be 2 megabytes on a 64-bit system. Small change. Each pointer holds the friend[] array for that person.
Then the friends of a Person are into a file in a directory, called, say, /dd/cc/aabbccdd, where aabbccdd is obtained by sprintf("%08x", PersonIndex). Using dd/cc leads to a slightly more balanced tree. To write the friends file, just point to Friends[PersonIndex] and write as many friend indexes as needed (I'd store FriendsNumber in the Person struct).

I'd look at a library like HDF5 so you can not only read the file back on this machine, but give the file to someone else and have the platform portability problem taken care of for you.

Related

Multiple Json Files or A single file with multiple arrays

I have a large blob (azure) file with 10k json objects in a single array. This does not perform because of its size. As I look to re-architect it, I can either create multiple files with a single array in each of 500-1000 objects or I could keep the one file, but burst the single array into an array of arrays-- maybe 10 arrays of 1000 objects each.
For simplicity, I'd rather break into multiple files. However, I thought this was worth asking the question and seeing if there was something to be learned in the answers.
I would think this depends strongly on your use-case. The multiple files or multiple arrays you create will partition your data somehow: will the partitions be used mostly together or mostly separate? I.e. will there be a lot of cases in which you only read one or a small number of the partitions?
If the answer is "yes, I will usually only care about a small number of partitions" then creating multiple files will save you having to deal with most of your data on most of your calls. If the answer is "no, I will usually need either 1.) all/most of my data or 2.) data from all/most of my partitions" then you probably want to keep one file just to avoid having to open many files every time.
I'll add: in this latter case, it may well turn out that the file structure (one array vs an array-of-arrays) doesn't change things very much, since a full scan is a full scan is a full scan etc. If that's the case, then you may need to start thinking about how to move to the prior case where you partition your data so that your calls fall neatly within few partitions, or how to move to a different data format.

How do structures affect efficiency?

Pointers make the program more efficient and faster. But how do structures effect the efficiency of the program? Does it make it faster? Is it for just the readability of the code, or what? And may I have an example of how it does so?
Pointers are just pointing a memory address, they have nothing to do with efficiency and speed(it is just a variable who stores some address which is required/helpful for some instruction to execute, nothing more)
Yes but Data structures affect the efficiency of program/code in multiple ways.they can increase/decrease time complexity and space complexity of your algorithm and ultimately your code (in which you are implementing your algo.)
for example, let take example of array and linked list
Array : some amount of space allocated sequentially in memory
Linkedlist : some space allocated randomly in memory but connected via pointers
in all cases both can be used (assuming not much heavy space allocation). but as array is continuous allocation retrieval is faster than random allocation in linked list (every time get address of next allocated block and then fetch the data)
thus this improves speed/efficiency of your code
there are many such examples which will prove you why data sructures are more important (if they were not so important why new algorithms are designed and mostly why you are learning them)
Link to refer,
What are the lesser known but useful data structures?
Structures have little to do with efficiency, they're used for abstraction. They allow you to keep all related data together, and refer to it by a single name.
There are some performance-related features, though. If you have all your data in a structure, you can pass a pointer to that structure as one argument to a function. This is better than passing lots of separate arguments to the function, for each value that would have been a member of the structure. But this isn't the primary reason we use structures, it's mainly an added benefit.
Pointers do not contribute anything in program's efficiency and execution time/speed. Structure provides a way of storing different variables under the same name. These variables can be of different types, and each has a name which is used to select it from the structure. For example, if you want to store data about student, it may consist of Student_Id, Name, Sex, School, Address etc. where Student_Id is int, Name is string, Sex is char (M/F) etc. but all variables are grouped together as a single structure 'Student' at a 'single block of memory'. So everytime you need a student data to fetch or update, you need to deal with structured data only. Now imagine, how much problem you may face if you try to store all those int, char, char[] variables separately and update them individually. Because you need to update everything at different memory locations for each student's record.
But if you consider data structure to structure your whole data in abstract data types where you may go for different kind of linked list, tree, graph etc. or array implementation then your algorithm plays a vital role in deciding time and space complexity of the program. So in that sense you can make your program more efficient.
When you want to optimize your memory/cache usage structures can increase the efficiency of your code (make it faster). This is because when data is loaded from memory to the cache it done in words (32-64bits) by fitting you data to these word boundaries you can ensure when your first int is loaded so is your second for a two int structure (maybe a serious of coordinates).

Practical to save thousands of data structures in a file and do specific lookups?

There's been a discussion between me and some colleagues that are taking the same class as me (and thus have the same project) about saving data to files and read from those files only when we need that specific data.
For instance, the project is something about managing a social network. I'm not going into specifics because it doesn't matter, but the idea is to use the best data structures to manipulate this data.
Let's say I'm using an Hash Table to save the users profile data. Some of them argue that only some specific information should be saved in the data structures, like and ID that represents an user. Everything else should be put on files. We should access the files to get that data we want when we want.
I don't think this is practical... It could be if we were using some library for a database like SQLite or something, but are not and I don't think we are supposed to. We are only supposed to code everything ourselves and use C functions, like these. Nor do I think we are supposed to do a perfect memory management. The requisites of the project are not for us to code a database, or even a pseudo-database. What this project demands of us, are the best data structures (as long as we know how to justify why we picked those instead of others) to store the type of data and the all data specified for the project.
I should let you know that we had 2 classes before where the knowledge we got there is to be applied on this project. One of those dealt with the basis of C, functions, structures, arrays, strings, file IO, recursion, pointers and simple data structures like binary trees and linked lists, stuff like that. The other one was about more complex data structures, hash tables, AVL trees, heaps, graphs, etc... It also talked about time complexity, big O notation and stuff like that.
For instance, let's say all I have in memory is the IDs of the users and then I need to find all friends of a specific user. I'll have to process the whole file (or files) finding out the friends of that user. It would be much easier if I could have all that data in memory already.
It makes no sense to me that we need to pick (and justify) the data structures that we best see fit for the project and then only use them to lookup for an ID. We will then need to do a second lookup, to get the real data we need, which will take it's time, won't it? Why did we bother with the data structures in the first place if we still need to get to search a bunch of files on the hard drive?
How could it be possible, using standard C functions, coding everything manually and still simulate some kind of database? Is this practical at all?
Am I missing something here?
It sounds like the project might be more about how you design the relationships between your data "entities," and not as much about how you store them. I don't think storing data off in files would be a good solution - file IO will be much slower than accessing things in memory. If you had the need to persist data on the disk, you'd probably want to just use a database, rather than files (I know it's an academic course though, so who knows).
I think you should focus more on how you design your data types, and their relationships, to maximize the speed of lookups, searches, etc. For example, you could store all the users in a linked list, or store them in a tree, or a graph, but each will have its implications on how fast you can find users, etc. Depending on what features you want in your social networking site, there will be different designs that will allow different types of behavior to perform better than it would in other designs.
From what you're saying I doubt that you need to store anything on disk.
One thing that I would ask the teacher is if you're optimizing for time or space complexity (there will be a trade off between these two depending on what you're trying to achieve).
That can certainly be done. The resource forks in Mac System 5-8 files were stored as binary indexed databases (general use of the term, don't think SQL!). (I think the interface was actually written in assembly, but I could do it in c).
The only thing is: it's a pain in the butt. Such files typically need to start with some kind of index or header, and then hold a bunch of records at predictable locations. (OK, sometimes the first index just points at some more indexes. How many layers of indirection do you care to manage?)
If you're going to do it, just remember: binary mode access.
Hmm... what about persistent storage?
If your project requires you to be able to remember friend data between two restarts of the app, then don't you think file storage (or whatever else becomes an issue)?
I'm having a very hard time figuring out what you are trying to ask here.
But there is a general rule that may apply:
If all of your data will fit in memory at once, it is usually best to load all of it into memory at once and keep it there. You write out to a file only to save, to exit, or for backup.
There are lots of exceptions to this rule, but for a class project where this is going to be the only major application running on the machine, you may as well store everything in memory. After all, you have already paid for the memory; you don't want it just sitting there idle.
I may have completely misunderstood the question you are trying to ask...

which one to use linked list or static arrays?

I have a structure in C which resembles that of a database table record.
Now when I query the table using select, I do not know how many records I will get.
I want to store all the returned records from the select query in a array of my structure data type.
Which method is best?
Method 1: find array size and allocate
first get the count of records by doing select count(*) from table
allocate a static array
run select * from table and then store each records in my structure in a loop.
Method 2: use single linked list
while ( records returned )
{
create new node
store the record in node
}
Which implementation is best?
My requirement is that when I have all the records,
I will probably make copies of them or something.
But I do not need random access and I will not be doing any search of a particular record.
Thanks
And I forgot option #4. Allocate an array of fixed size. When that array is full, allocate another. You can keep track of the arrays by linking them in a linked list, or having a higher level array that keeps the pointers to the data arrays. This two-level scheme is great when you need random access, you just need to break your index into two parts.
A problem with 'select count(*)' is that the value might change between calls, so your "real" select will have a number of items different from the count you'd expect.
I think the best solution is your "2".
Instead of a linked list, I would personally allocate an array (reallocating as necessary). This is easier in languages that support growing arrays (e.g. std::vector<myrecord> in C++ and List<myrecord> in C#).
You forgot option 3, it's a little more complicated but it might be best for your particular case. This is the way it's typically done in C++ std::vector.
Allocate an array of any comfortable size. When that array is filled, allocate a new larger array of 1.5x to 2x the size of the filled one, then copy the filled array to this one. Free the original array and replace it with the new one. Lather, rinse, repeat.
There are a good many possible critiques that should be made.
You are not talking about a static array at all - a static array would be of pre-determined size fixed at compile time, and either local to a source file or local to a function. You are talking about a dynamically allocated array.
You do not give any indication of record size or record count, nor of how dynamic the database underneath is (that is, could any other process change any of the data while yours is running). The sizing information isn't dreadfully critical, but the other factor is. If you're doing a report of some sort, then fetching the data into memory is fine; you aren't going to modify the database and the data is an accurate snapshot. However, if other people could be modifying the records while you are modifying records, your outline solution is a major example of how to lose other people's updates. That is a BAD thing!
Why do you need all the data in memory at once? Ignoring size constraints, what exactly is the benefit of that compared with processing each relevant record once in the correct sequence? You see, DBMS put a lot of effort into being able to select the relevant records (WHERE clauses) and the relevant data (SELECT lists) and allow you to specify the sequence (ORDER BY clauses) and they have the best sort systems they can afford (better than the ones you or I are likely to produce).
Beware of quadratic behaviour if you allocate your array in chunks. Each time you reallocate, there's a decent chance the old memory will have to be copied to the new location. This will fragment your memory (the old location will be available for reuse, but by definition will be too small to reuse). Mark Ransom points out a reasonable alternative - not the world's simplest scheme overall (but it avoids the quadratic behaviour I referred to). Of course, you can (and would) abstract that away by a set of suitable functions.
Bulk fetching (also mentioned by Mark Ransom) is also useful. You would want to preallocate the array into which a bulk fetch fetches so that you don't have to do extra copying. This is just linear behaviour though, so it is less serious.
Create a data structure to represent your array or list. Pretend you're in an OO language and create accessors and constructors for everything you need. Inside that data structure, keep an array, and, as others have said, when the array is filled to capacity, allocate a new array 2x as large and copy into it. Access the structure only through your defined routines for accessing it.
This is the way Java and other languages do this. Internally, this is even how Perl is implemented in C.
I was going to say your best option is to look for a library that already does this ... maybe you can borrow Perl's C implementation of this kind of data structure. I'm sure it's more well tested than anything you or I could roll up from scratch. :)
while(record = get_record()) {
records++;
records_array = (record_struct *) realloc(records_array, (sizeof record_struct)*records);
*records_array[records - 1] = record;
}
This is strictly an example — please don't use realloc() in production.
The linked list is a nice, simple option. I'd go with that. If you prefer the growing array, you can find an implementation as part of Dave Hanson's C Interfaces and Implementations, which as a bonus also provides linked lists.
This looks to me like a design decision that is likely to change as your application evolves, so you should definitely hide the representation behind a suitable API. If you don't already know how to do this, Hanson's code will give you a number of nice examples.

Best data structure in C for these two situations?

I kinda need to decide on this to see if I can achieve it in a couple of hours before the deadline for my school project is due but I don't understand much about data structures and I need suggestions...
There's 2 things I need to do, they will probably use different data structures.
I need a data structure to hold profile records. The profiles must be search able by name and social security number. The SSN is unique, so I probably can use that for my advantage? I suppose hash maps is the best bet here? But how do I use the SSN in an hash map to use that as an advantage in looking for a specific profile? A basic and easy to understand explanation would be much appreciated.
I need a data structure to hold records about cities. I need to know which are cities with most visitors, cities less visited and the clients (the profile is pulled from the data structure in #1 for data about the clients) that visit a specific city.
This is the third data structure I need for my project and it's the data structure that I have no idea where to begin. Suggestions as for which type of data structure to use are appreciated, if possible, with examples on how to old the data above in bold.
As a note:
The first data structure is already done (I talked about it in a previous question). The second one is posted here on #1 and although the other group members are taking care of that I just need to know if what we are trying to do is the "best" approach. The third one is #2, the one I need most help.
The right answer lies anywhere between a balanced search tree and an array.
The situation you have mentioned here and else-thread misses out on a very important point: The size of the data you are handling. You choose your data structure and algorithm(s) depending on the amount of data you have to handle. It is important that you are able to justify your choice(s). Using a less efficient general algorithm is not always bad. Being able to back up your choices (e.g: choosing bubble-sort since data size < 10 always) shows a) greater command of the field and b) pragmatism -- both of which are in short supply.
For searchability across multiple keys, store the data in any convenient form, and provides fast lookup indexes on the key(s).
This could be as simple as keeping the data in an array (or linked list, or ...) in the order of creation, and keeping a bunch of {hashtables|sorted arrays|btrees} of maps (key, data*) for all the interesting keys (SSN, name, ...).
If you had more time, you could even work out how to not have a different struct for each different map...
I think this solution probably applies to both your problems.
Good luck.
For clarity:
First we have a simple array of student records
typedef
struct student_s {
char ssn[10]; // nul terminated so we can use str* functions
char name[100];
float GPA;
...
} student;
student slist[MAX_STUDENTS];
which is filled in as you go. It has no order, so search on any key is a linear time operation. Not a problem for 1,000 entries, but maybe a problem for 10,000, and certainly a problem for 1 million. See dirkgently's comments.
If we want to be able to search fast we need another layer of structure. I build a map between a key and the main data structure like this:
typedef
struct str_map {
char* key;
student *data;
} smap;
smap skey[MAX_STUDENTS]
and maintain skey sorted on the key, so that I can do fast lookups. (Only an array is a hassle to keep sorted, so we probably prefer a tree, or a hashmap.)
This complexity isn't needed (and should certainly be avoided) if you will only want fast searches on a single field.
Outside of a homework question, you'd use a relational database for
this. But that probably doesn't help you…
The first thing you need to figure out, as others have already pointed
out, is how much data you're handling. An O(n) brute-force search is
plenty fast as long a n is small. Since a trivial amount of data would
make this a trivial problem (put it in an array, and just brute-force
search it), I'm going to assume the amount of data is large.
Storing Cities
First, your search requirements appear to require the data sorted in
multiple ways:
Some city unique identifier (name?)
Number of visitors
This actually isn't too hard to satisfy. (1) is easiest. Store the
cities in some array. The array index becomes the unique identifier
(assumption: we aren't deleting cities, or if we do delete cities we can
just leave that array spot unused, wasting some memory. Adding is OK).
Now, we also need to be able to find most & fewest visits. Assuming
modifications may happen (e.g., adding cities, changing number of
visitors, etc.) and borrowing from relational databases, I'd suggest
creating an index using some form of balanced tree. Databases would
commonly use a B-Tree, but different ones may work for you: check Wikipedia's
article on trees. In each tree node, I'd just keep a pointer (or
array index) of the city data. No reason to make another copy!
I recommend a tree over a hash for one simple reason: you can very
easily do a preorder or reverse order traversal to find the top or
bottom N items. A hash can't do that.
Of course, if modifications may not happen, just use another array (of
pointers to the items, once again, don't duplicate them).
Linking Cities to Profiles
How to do this depends on how you have to query the data, and what form
it can take. The most general is that each profile can be associated
with multiple cities and each city can be associated with multiple
profiles. Further, we want to be able to efficiently query from either
direction — ask both "who visits Phoenix?" and "which cities does Bob
visit?".
Shamelessly lifting from databases again, I'd create another data
structure, a fairly simple one along the lines of:
struct profile_city {
/* btree pointers here */
size_t profile_idx; /* or use a pointer */
size_t city_idx; /* for both indices */
};
So, to say Bob (profile 4) has visited Phoenix (city 2) you'd have
profile_idx = 4 and city_idx = 2. To say Bob has visited Vegas (city
1) as well, you'd add another one, so you'd have two of them for Bob.
Now, you have a choice: you can store these either in a tree or a
hash. Personally, I'd go with the tree, since that code is already
written. But a hash would be O(n) instead of O(logn) for lookups.
Also, just like we did for the city visit count, create an index for
city_idx so the lookup can be done from that side too.
Conclusion
You now have a way to look up the 5 most-visited cities (via an in-order
traversal on the city visit count index), and find out who visits those
cities, by search for each city in the city_idx index to get the
profile_idx. Grab only unique items, and you have your answer.
Oh, and something seems wrong here: This seems like an awful lot of code for your instructor to want written in several hours!

Resources