What if the db size exceeds memory in sqlite? - database

I'm working with a heavy dataset, which I chose to store in plain binary format and load into memory in chunks. However, even the smallest chunk is about to exceed my computer memory (16GB), so i'll have to fragment them further or find another solution. The whole dataset is around half a terabyte.
I've seen on wiki that sqlite can work with up to 32TB of data, which is cool. However, I could not figure out whether you have to have 32TB of memory to use it, or you can have smaller memory and store the thing on a hard drive. My understanding is that it should be possible, I need very simple operations like add line, read line, pick all lines with a given value e.t.c
I would appreciate if you folks help me, because I would hate to invest into studying sqlite only to learn that it does not work the way I thought. Also if you have any insight that you think might help, please share.

Most databases, including SQLite, can work on the data in chunks, i.e., they need to load only a single record into memory at once. (But using more memory for caching makes things much faster.) Furthermore, if intermediate data becomes too large, it can be moved into temporary files. This is the default behaviour, so it is not explicitly advertized.
Having the entire database in memory is possible, but would make sense only for temporary data that is not to be saved anyway.

Related

Fortran: How do I allocate arrays when reading a file of unknown size?

My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?
I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.
I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.

Better to store data in RAM, text file, or database

I am working on a project where I am using words, encoded by vectors, which are about 2000 floats long. Now when I use these with raw text I need to retrieve the vector for each word as it comes across and do some computations with it. Needless to say for a large vocabulary (~100k words) this has a large storage requirement (about 8 GB in a text file).
I initially had a system where I split the large text file into smaller ones and then for a particular word, I read its file, and retrieved its vector. This was too slow as you might imagine.
I next tried reading everything into RAM (takes about ~40GB RAM) figuring once everything was read in, it would be quite fast. However, it takes a long time to read in and a disadvantage is that I have to use only certain machines which have enough free RAM to do this. However, once the data is loaded, it is much faster than the other approach.
I was wondering how a database would compare with these approaches. Retrieval would be slower than the RAM approach, but there wouldn't be the overhead requirement. Also, any other ideas would be welcome and I have had others myself (i.e. caching, using a server that has everything loaded into RAM etc.). I might benchmark a database, but I thought I would post here to see what other had to say.
Thanks!
UPDATE
I used Tyler's suggestion. Although in my case I did not think a BTree was necessary. I just hashed the words and their offset. I then could look up a word and read in its vector at runtime. I cached the words as they occurred in text so at most each vector is read in only once, however this saves the overhead of reading in and storing unneeded words, making it superior to the RAM approach.
Just an FYI, I used Java's RamdomAccessFile class and made use of the readLine(), getFilePointer(), and seek() functions.
Thanks to all who contributed to this thread.
UPDATE 2
For more performance improvement check out buffered RandomAccessFile from:
http://minddumped.blogspot.com/2009/01/buffered-javaiorandomaccessfile.html
Apparently the readLine from RandomAccessFile is very slow because it reads byte by byte. This gave me some nice improvement.
As a rule, anything custom coded should be much faster than a generic database, assuming you have coded it efficiently.
There are specific C-libraries to solve this problem using B-trees. In the old days there was a famous library called "B-trieve" that was very popular because it was fast. In this application a B-tree will be faster and easier than fooling around with a database.
If you want optimal performance you would use a data structure called a suffix tree. There are libraries which are designed to create and use suffix trees. This will give you the fastest word lookup possible.
In either case there is no reason to store the entire dataset in memory, just store the B-tree (or suffix tree) with an offset to the data in memory. This will require about 3 to 5 megabytes of memory. When you query the tree you get an offset back. Then open the file, seek forwards to the offset and read the vector off disk.
You could use a simple text based index file just mapping the words to indices, and another file just containing the raw vector data for each word. Initially you just read the index to a hashmap that maps each word to the datafile index and keep it in memory. If you need the data for a word, you calculate the offset in the data file (2000 * 32 * index) and read it as needed. You probably want to cache this data in RAM (if you are in java perhaps just use a weak map as a starting point).
This is basically implementing your own primitive database, but it may still be preferable because it avoidy database setup / deployment complexity.

Is there a way to read HD data past EOF?

Is there a way to read a file's data but continue reading the data on the hard drive past the end of file? For normal file I/O I could just use fread(), but, obviously, that will only read to the end of the file. And it might be beneficial if I add that I need this on a Windows computer.
All my Googling for a way to do this is instead coming up with results about unrelated topics concerning EOF, such as people having problems with normal I/O.
My reasoning for this is that I accidentally deleted part of the text in a text file I was working on, and it was an entire day's worth of work. I Googled up a bunch of file recovery stuff, but it all seems to be about recovering deleted files, where my problem is that the file is still there but without some of its information, and I'm hoping some of that data still exists directly after the currently marked end of file and is neither fragmented elsewhere or already claimed or otherwise overwritten. Since I can't find a program that helps with this specifically, I'm hoping I can quickly make something up for it (I understand that, depending on what is involved, this might not be as feasible as just redoing the work, but I'm hoping that's not the case).
As far as I can foresee, though I might not be correct (not sure, which is why I'm asking for help), there are 3 possibilities.
Worst of the three: I have to look up Windows API functions that allow direct access to the entire hard drive (similar to its functions for memory, perhaps? those I have experience with) and scan the entire thing for the data that I still have access to from the file and then just continue looking at what's after it.
Second: I can get a pointer to the file, then I still have to get raw access to HD but at least have a pointer to the file in it?
Best of the three: Just open the file for write access, seek to the end, then write a ways past EOF to claim more space, but first hope that Windows won't clean the data before it hands it over to me so that I get garbage data which was the previous data in that spot which would actually be what I'm looking for? This would be awesome if it were that simple, but I'm afraid to test it out because I'd lose the data if it failed, so hopefully someone else already knows. The PC in question is running Vista Home Premium if that matters to anyone that knows the gory details of Windows.
Do either of those three seem plausible? Whether yea or nay, I'm also open (and eager) for other suggestions, especially those which are better than my silly ideas, and especially if they come with direction toward specific functions to use to get the job done.
Also, if anyone else actually has heard of a recovery program that doesn't just recover deleted files but which would actually work for a situation like this, and which is free and trustworthy, that works too.
Thanks in advance for any assistance.
You should get a utility for scanning the free space of a hard drive and recovering data from it, for example PhotoRec or foremost. Note however that if you've been using the machine much at all (even web browsing, which will create files in your cache), the data has likely already been overwritten. Do not save your recovery tools on the same hard drive, or even use the same PC to download them; get them from another computer and save them to a USB device, then run them from that device.
As for the conceptual content of your question, files are abstract objects. There is no such thing as data "past eof" except (depending on the implementation) perhaps up to the next multiple of the filesystem/disk "blocksize". Also it's possible (very likely) that your editor "saved" the file by truncating it and writing everything newly from the beginning, meaning there's not necessarily any correspondence between the old and new storage.
Your question doesn't make a lot of sense -- by definition there is nothing in the file after the EOF. By your further description, it appears that you want to read whatever happens to be on the disk after the last byte that is used by the file, which might be random garbage (unused space) or might be some other file. But in either case, this isn't 'data after the EOF' its just data on the disk that's not part of the file. Its even possible that it might be some other part of the same file, if the filesystem happens to lay out its data that way -- some filesystems scatter blocks in seemingly random ways across the disk and figuring out what bytes belong to which files requires understanding the filesystem metadata.

Performance issues in writing to large files?

I have been recently involved in handling the console logs for a server and I was wondering, out of curiosity, that is there a performance issue in writing to a large file as compared to small ones.
For instance is it a good idea to keep the log file size small instead of letting them grow bulky, but I was not able to argue much in favor of either approach.
There might be problems in reading or searching in the file, but right now I am more interested in knowing if writing can be affected in any way.
Looking for an expert advice.
Edit:
The way I thought it was that the OS only has to open a file handle and push the data to the file system. There is little correlation to the file size, since you have to keep on appending the data to the end of the file and whenever a block of data is full, OS will assign another block to the file. As I said earlier, there can be problems in reading and searching because of defragmentation of file blocks, but I could not find much difference while writing.
As a general rule, there should be no practical difference between appending a block to a small file (or writing the first block which is appending to a zero-length file) or appending a block to a large file.
There are special cases (like trying to fault in a triple-indirect block or the initial open having to read all mapping information) which could add additional I/O's. but the steady-state should be the same.
I'd be more worried about the manageability of having huge files: slow to backup, slow to copy, slow to view, etc.
I am not an expert, but I will try to answer anyway.
Larger files may take longer to write on disk and in fact it is not a programming issue. It is file system issue. Perhaps there are file systems, which does not have such issues, but on Windows large files cannot be write down in one piece so fragmenting them will take time (for the simple reason that head will have to move to some other cylinder). Assuming that we are talking about "classic" hard drives...
If you want an advice, I would go for writing down smaller files and rotating them either daily or when they hit some size (or both actually). That is rather common approach I saw in an enterprise-grade products.

Is it possible to delete both ends of a large file without copying?

I would like to know if it is possible, using Windows and c++, to take a large video file (several gigabytes in length) and delete the first and last few hundred megabytes of it “in-place”.
The traditional approach of copying the useful data to a new file often takes upwards of 20 minutes of seemingly needless copying.
Is there anything clever that can be done low-level with the disk to make this happen?
Sure, it's possible in theory. But if your filesystem is NTFS, be prepared to spend a few months learning about all the data structures that you'll need to update. (All of which are officially undocumented BTW.)
Also, you'll need to either
Somehow unmount the volume and make your changes then; or
Learn how to write a kernel filesystem driver, buy a license from MS, develop the driver and use it to make changes to a live filesystem.
It's a bit easier if your filesystem is something simpler like FAT32. But either way: in short, it might be possible, but even if it is it'll take years out of your life. My advice: don't bother.
Instead, look at other ways you could solve the problem: e.g. by using an avisynth script to serve just the frames from the region you are interested in.
Are you hoping to just fiddle around with sector addresses in the directory entry? It's virtually inconceivable that plan would work.
First of all, it would require that the amount of data you wish to delete be exactly a sector size. That's not very likely considering that there is probably some header data at the very start that must remain there.
Even if it mets those requirements, it would take a low-level modification, which Windows tries very hard to prevent you from doing.
Maybe your file format allows to 'skip' the bytes, so that you could simply write over (i.e. with memory mapping) the necessary parts. This would of course still use up unnecessarily much disk space.
Yes, you can do this, on NTFS.
The end you remove with SetFileLength.
The beginning, or any other large consecutive region of the file, you overwrite with zeros. You then mark the file "sparse", which allows the file system to reclaim those clusters.
Note that this won't actually change the offset of the data relative to the beginning of the file, it only prevents the filesystem from wasting space storing unneeded data.
Even if low level filesystem operations were easy, editing a video file is not simply a matter of deleting unwanted megabytes. You still do have to consider concepts such as compression, frames, audio and video muxing, media file containers, and many others...
Your best solution is to simply accept your idle twenty minutes.

Resources