Optimize I/O to a large file in C - c

I have written a C program that reads data from a huge file(>3 GB). Each record in the file is a key-value pair. Whenever a query comes, the program searches for the key and retrieves the corresponding value, similarly for updating the value.
The queries are coming at a fast rate so this technique will eventually fail.
The worst case access time is too large. Creating an in-memory object will again be a bad idea, because of the size.
Is there any way in which this problem can be sorted out?

Sure seems to me a file of that size wrapping a series of name-value pairs is begging to be migrated to an actual database; failing that, I'd probably at least explore the idea of a memory-mapped file, with only portions resident at any given time...

How large are the keys, in comparison to their corresponding values? If they are significantly smaller, you might try creating a table in memory between the keys and the corresponding locations within the file of their values.

Related

Appending + Removing Postgres Text Array elements. Resulting in massive table sizes

UPDATE db.table SET logs = array_prepend('some things happened', logs[0:1000]) WHERE id = 'foo';
This query simply prepends text to a text array, and removes elements from the array and limits the array to 1,000 elements. It works, but the table size on the disk rapidly swells to multiple GB (The table should only be around 150MB). Am I doing something wrong? Is this a bug in PostgreSQL? I'm running PostgreSQL 11.9. If I don't run a full vacuum, PostgreSQL will eventually use up all available disk space.
This query is for a turn-based game, and it stores logs about what's happening to the player for debugging purposes.
This is expected behavior. The space is only cleared by vacuum/autovacuum. However there's actually not a huge cost to having that used space around, as Postgres will reuse it if it runs short on disk space.
Part of the issue is that modifying a column value requires rewriting the entire row (or in this case, your column is probably getting TOASTed, so rewriting the pointer to the TOAST table and writing a new value in the TOAST table), so each update you do rewrites everything you have stored. For large values this adds up quickly.
If you're really worried about it I think normalizing this might be a good choice, or you could switch to storing this data in something better designed for append only data. Or you could use an FDW designed for storing append only data like this outside the normal storage mechanisms, usually as a file on disk.

Is there a Database engine that implements Random-Access?

by Random Access i do not mean selecting a random record,Random Access is the
ability to fetch all records in equal time,the same way values are fetched from an array.
From wikipedia: http://en.wikipedia.org/wiki/Random_access
my intention is to store a very large array of strings, one that is too big for memory.
but still have the benefit or random-access to the array.
I usally use MySQL but it seems it has only B-Tree and Hash index types.
I don't see a reason why it isn't possible to implement such a thing.
The indexes will be like in array, starting from zero and incrementing by 1.
I want to simply fetch a string by its index, not get the index according to the string.
The goal is to improve performance. I also cannot control the order in which the strings
will be accessed, it'll be a remote DB server which will constantly receive indexes from
clients and return the string for that index.
Is there a solution for this?
p.s I don't thing this is a duplicate of Random-access container that does not fit in memory?
Because in that question he has other demands except random access
Given your definition, if you just use an SSD for storing your data, it will allow for what you call random access (i.e. uniform access speed across the data set). The fact that sequential access is less expensive than random one comes from the fact that sequential access to disk is much faster than random one (and any database tries it's best to make up for this, btw).
That said, even RAM access is not uniform as sequential access is faster due to caching and NUMA. So uniform access is an illusion anyway, which begs the question, why you are so insisting of having it in the first place. I.e. what you think will go wrong when having slow random access - it might be still fast enough for your use case.
You are talking about constant time, but you mention a unique incrementing primary key.
Unless such a key is gapless, you cannot use it as an offset, so you still need some kind of structure to look up the actual offset.
Finding a record by offset isn't usually particularly useful, since you will usually want to find it by some more friendly method, which will invariably involve an index. Searching a B-Tree index is worst case O(log n), which is pretty good.
Assuming you just have an array of strings - store it in a disk file of fixed length records and use the file system to seek to your desired offset.
Then benchmark against a database lookup.

How to instantly query a 64Go database

Ok everyone, I have an excellent challenge for you. Here is the format of my data :
ID-1 COL-11 COL-12 ... COL-1P
...
ID-N COL-N1 COL-N2 ... COL-NP
ID is my primary key and index. I just use ID to query my database. The datamodel is very simple.
My problem is as follow:
I have 64Go+ of data as defined above and in a real-time application, I need to query my database and retrieve the data instantly. I was thinking about 2 solutions but impossible to set up.
First use sqlite or mysql. One table is needed with one index on ID column. The problem is that the database will be too large to have good performance, especially for sqlite.
Second is to store everything in memory into a huge hashtable. RAM is the limit.
Do you have another suggestion? How about to serialize everything on the filesystem and then, at each query, store queried data into a cache system?
When I say real-time, I mean about 100-200 query/second.
A thorough answer would take into account data access patterns. Since we don't have these, we just have to assume equal probably distribution that a row will be accessed next.
I would first try using a real RDBMS, either embedded or local server, and measure the performance. If this this gives 100-200 queries/sec then you're done.
Otherwise, if the format is simple, then you could create a memory mapped file and handle the reading yourself using a binary search on the sorted ID column. The OS will manage pulling pages from disk into memory, and so you get free use of caching for frequently accessed pages.
Cache use can be optimized more by creating a separate index, and grouping the rows by access pattern, such that rows that are often read are grouped together (e.g. placed first), and rows that are often read in succession are placed close to each other (e.g. in succession.) This will ensure that you get the most back for a cache miss.
Given the way the data is used, you should do the following:
Create a record structure (fixed size) that is large enough to contain one full row of data
Export the original data to a flat file that follows the format defined in step 1, ordering the data by ID (incremental)
Do a direct access on the file and leave caching to the OS. To get record number N (0-based), you multiply N by the size of a record (in byte) and read the record directly from that offset in the file.
Since you're in read-only mode and assuming you're storing your file in a random access media, this scales very well and it doesn't dependent on the size of the data: each fetch is a single read in the file. You could try some fancy caching system but I doubt this would gain you much in terms of performance unless you have a lot of requests for the same data row (and the OS you're using is doing poor caching). make sure you open the file in read-only mode, though, as this should help the OS figure out the optimal caching mechanism.

detecting when data has changed

Ok, so the story is like this:
-- I am having lots of files (pretty big, around 25GB) that are in a particular format and needs to be imported in a datastore
-- these files are continuously updated with data, sometimes new, sometimes the same data
-- I am trying to figure out an algorithm on how could I detect if something has changed for a particular line in a file, in order to minimize the time spent updating the database
-- the way it currently works now is that I'm dropping all the data in the database each time and then reimport it, but this won't work anymore since I'll need a timestamp for when an item has changed.
-- the files contains strings and numbers (titles, orders, prices etc.)
The only solutions I could think of are:
-- compute a hash for each row from the database, that it's compared against the hash of the row from the file and if they're different the update the database
-- keep 2 copies of the files, the previous ones and the current ones and make diffs on it (which probably are faster than updating the db) and based on those update the db.
Since the amount of data is very big to huge, I am kind of out of options for now. On the long run, I'll get rid of the files and data will be pushed straight into the database, but the problem still remains.
Any advice, will be appreciated.
Problem definition as understood.
Let’s say your file contains
ID,Name,Age
1,Jim,20
2,Tim,30
3,Kim,40
As you stated Row can be added / updated , hence the file becomes
ID,Name,Age
1,Jim,20 -- to be discarded
2,Tim,35 -- to be updated
3,Kim,40 -- to be discarded
4,Zim,30 -- to be inserted
Now the requirement is to update the database by inserting / updating only above 2 records in two sql queries or 1 batch query containing two sql statements.
I am making following assumptions here
You cannot modify the existing process to create files.
You are using some batch processing [Reading from file - Processing in Memory- Writing in DB]
to upload the data in the database.
Store the hash values of Record [Name,Age] against ID in an in-memory Map where ID is the key and Value is hash [If you require scalability use hazelcast ].
Your Batch Framework to load the data [Again assuming treats one line of file as one record], needs to check the computed hash value against the ID in in-memory Map.First time creation can also be done using your batch framework for reading files.
If (ID present)
--- compare hash
---found same then discard it
—found different create an update sql
In case ID not present in in-memory hash,create an insert sql and insert the hashvalue
You might go for parallel processing , chunk processing and in-memory data partitioning using spring-batch and hazelcast.
http://www.hazelcast.com/
http://static.springframework.org/spring-batch/
Hope this helps.
Instead of computing the hash for each row from the database on demand, why don't you store the hash value instead?
Then you could just compute the hash value of the file in question and compare it against the database stored ones.
Update:
Another option that came to my mind is to store the Last Modified date/time information on the database and then compare it against that of the file in question. This should work, provided the information cannot be changed either intentionally or by accident.
Well regardless what you use your worst case is going to be O(n), which on n ~ 25GB of data is not so pretty.
Unless you can modify the process that writes to the files.
Since you are not updating all of the 25GBs all of the time, that is your biggest potential for saving cycles.
1. Don't write randomly
Why don't you make the process that writes the data append only? This way you'll have more data, but you'll have full history and you can track which data you already processed (what you already put in the datastore).
2. Keep a list of changes if you must write randomly
Alternatively if you really must do the random writes you could keep a list of updated rows. This list can be then processed as in #1, and the you can track which changes you processed. If you want to save some space you can keep a list of blocks in which the data changed (where block is a unit that you define).
Furthermore you can keep checksums/hashes of changed block/lines. However this might not be very interesting - it is not so cheap to compute and direct comparison might be cheaper (if you have free CPU cycles during writing it might save you some reading time later, YMMV).
Note(s)
Both #1 and #2 are interesting only if you can make adjustment to the process that writes the data to the disk
If you can not modify the process that writes in the 25GB data then I don't see how checksums/hashes can help - you have to read all the data anyway to compute the hashes (since you don't know what changed) so you can directly compare while you read and come up with a list of rows to update/add (or update/add directly)
Using diff algorithms might be suboptimal, diff algorithm will not only look for the lines that changed, but also check for the minimal edit distance between two text files given certain formatting options. (in diff, this can be controlled with -H or --minimal to work slower or faster, ie search for exact minimal solution or use heuristic algorithm for which if iirc this algorithm becomes O(n log n); which is not bad, but still slower then O(n) which is available to you if you do direct comparison line by line)
practically it's kind of problem that has to be solved by backup software, so why not use some of their standard solutions?
the best one would be to hook the WriteFile calls so that you'll receive callbacks on each update. This would work pretty well with binary records.
Something that I cannot understand: the files are actually text files that are not just appended, but updated? this is highly ineffective ( together with idea of keeping 2 copies of files, because it will make the file caching work even worse).

How do databases deal with data tables that cannot fit in memory?

Suppose you have a really large table, say a few billion unordered rows, and now you want to index it for fast lookups. Or maybe you are going to bulk load it and order it on the disk with a clustered index. Obviously, when you get to a quantity of data this size you have to stop assuming that you can do things like sorting in memory (well, not without going to virtual memory and taking a massive performance hit).
Can anyone give me some clues about how databases handle large quantities of data like this under the hood? I'm guessing there are algorithms that use some form of smart disk caching to handle all the data but I don't know where to start. References would be especially welcome. Maybe an advanced databases textbook?
Multiway Merge Sort is a keyword for sorting huge amounts of memory
As far as I know most indexes use some form of B-trees, which do not need to have stuff in memory. You can simply put nodes of the tree in a file, and then jump to varios position in the file. This can also be used for sorting.
Are you building a database engine?
Edit: I built a disc based database system back in the mid '90's.
Fixed size records are the easiest to work with because your file offset for locating a record can be easily calculated as a multiple of the record size. I also had some with variable record sizes.
My system needed to be optimized for reading. The data was actually stored on CD-ROM, so it was read-only. I created binary search tree files for each column I wanted to search on. I took an open source in-memory binary search tree implementation and converted it to do random access of a disc file. Sorted reads from each index file were easy and then reading each data record from the main data file according to the indexed order was also easy. I didn't need to do any in-memory sorting and the system was way faster than any of the available RDBMS systems that would run on a client machine at the time.
For fixed record size data, the index can just keep track of the record number. For variable length data records, the index just needs to store the offset within the file where the record starts and each record needs to begin with a structure that specifies it's length.
You would have to partition your data set in some way. Spread out each partition on a separate server's RAM. If I had a billion 32-bit int's - thats 32 GB of RAM right there. And thats only your index.
For low cardinality data, such as Gender (has only 2 bits - Male, Female) - you can represent each index-entry in less than a byte. Oracle uses a bit-map index in such cases.
Hmm... Interesting question.
I think that most used database management systems using operating system mechanism for memory management, and when physical memory ends up, memory tables goes to swap.

Resources