Postgres: how to calculate temp file space requirement - database

During execution of query I get error:
temporary file size exceeds temp_file_limit (<some num>kB).
How can I calculate the required temp file storage of a query (using EXPLAIN ANALYZE)?
I require method of calculating temp file disk usage, so that I can tune my query to lower the requirement.

The way to go would be to disable temp_file_limit and set log_temp_files = 0. Then you would see the size of the temporary files in the log (and in the EXPLAIN (ANALYZE) output).
But if I were you, I'd look at the EXPLAIN output and find the Sort that promises to be very big and see what I can to to avoid it. The optimizer will avoid plans with very large hashes, and bitmaps won't ever exceed work_mem, so it must be a large sort that is creating temporary files.

Related

File Browser in C for POSIX OS

I have created a file browsing UI for an embedded device. On the embedded side, I am able to get all files in a directory off the hard disk and return stats such as name, size, modified, etc. This is done using opendir and closedir and a while loop that goes through every file until no files are left.
This is cool, until file counts reach large quantities. I need to implement pagination and sorting. Suppose I have 10,000 files in a directory - how can I possibly go through this amount of files and sort based on size, name, etc, without easily busting the RAM (about 1mb of RAM... !). Perhaps something already exists within the Hard Drive OS or drivers?
Here's two suggestions, both of which have small memory footprint. The first will use no more memory that the number of results you wish to return for the request. It's a constant-time O(1) memory - it only depends on the size of the result set but is ultimately quadratic time (or worse) if the user really does page through all results:
You are only looking for a small paged result (e.g the r=25 entries). You can generate these by scanning through all filenames and maintaining a sorted list of items you will return, using an insertion sort of length r and for each file inserted, only retain the first r results. (In practice you would not insert the file F if it is lower than the rth entry).
How would you generate the 2nd page of results? You already know the 25th file from the previous request - so during the scan ignore all entries that are before that. (You'll need to work harder if sorting on fields with duplicates)
The upside is the minimum memory required - the memory needed is not much larger than the r results you wish to return (and can even be less if you don't cache the names). The downside is generating the complete result will be quadratic in time in terms of the number of total files you have. In practice people don't sort results then page through all pages, so this may be acceptable.
If your memory budget is larger (e.g. fewer than 10000 files) but you still don't have enough space to perform a simple in-memory sort of all 10000 filenames then seekdir/telldir is your friend. i.e. create an array of longs by streaming readdir and using telldir to capture the position of each entry. (you might even be able to compress the delta between each telldir to a 2 byte short). As a minimal implementation you can then sort 'em all with clib's sort function and writing your own callback to convert a location into a comparable value. Your call back will use seekdir twice to read the two filenames.
The above approach is overkill - you just sorted all entries and you only needed one page of ~25, so for fun why not read up on Hoare's QuickSelect algorithm and use a version of it to identify the results within the required range. You can recursive ignore all entries outside the required range and only sort the entries between the first and last entry of the results.
What you want is an external sort, that's a sort done with external resources, usually on disk. The Unix sort command does this. Typically this is done with an external merge sort.
The algorithm is basically this. Let's assume you want to dedicate 100k of memory to this (the more you dedicate, the fewer disk operations, the faster it will go).
Read 100k of data into memory (ie. call readdir a bunch).
Sort that 100k hunk in-memory.
Write the hunk of sorted data to its own file on disk.
You can also use offsets in a single file.
GOTO 1 until all hunks are sorted.
Now you have X hunks of 100k on disk an each of them are sorted. Let's say you have 9 hunks. To keep within the 100k memory limit, we'll divide the work up into the number of hunks + 1. 9 hunks, plus 1, is 10. 100k / 10 is 10k. So now we're working in blocks of 10k.
Read the first 10k of each hunk into memory.
Allocate another 10k (or more) as a buffer.
Do an K-way merge on the hunks.
Write the smallest in any hunk to the buffer. Repeat.
When the buffer fills, append it to a file on disk.
When a hunk empties, read the next 10k from that hunk.
When all hunks are empty, read the resulting sorted file.
You might be able to find a library to perform this for you.
Since this is obviously overkill for the normal case of small lists of files, only use it if there are more files in the directory than you care to have in memory.
Both in-memory sort and external sort begin with the same step: start calling readdir and writing to a fixed-sized array. If you run out of files before running out of space, just do an in-memory quicksort on what you've read. If you run out of space, this is now the first hunk of an external sort.

Better to store data in RAM, text file, or database

I am working on a project where I am using words, encoded by vectors, which are about 2000 floats long. Now when I use these with raw text I need to retrieve the vector for each word as it comes across and do some computations with it. Needless to say for a large vocabulary (~100k words) this has a large storage requirement (about 8 GB in a text file).
I initially had a system where I split the large text file into smaller ones and then for a particular word, I read its file, and retrieved its vector. This was too slow as you might imagine.
I next tried reading everything into RAM (takes about ~40GB RAM) figuring once everything was read in, it would be quite fast. However, it takes a long time to read in and a disadvantage is that I have to use only certain machines which have enough free RAM to do this. However, once the data is loaded, it is much faster than the other approach.
I was wondering how a database would compare with these approaches. Retrieval would be slower than the RAM approach, but there wouldn't be the overhead requirement. Also, any other ideas would be welcome and I have had others myself (i.e. caching, using a server that has everything loaded into RAM etc.). I might benchmark a database, but I thought I would post here to see what other had to say.
Thanks!
UPDATE
I used Tyler's suggestion. Although in my case I did not think a BTree was necessary. I just hashed the words and their offset. I then could look up a word and read in its vector at runtime. I cached the words as they occurred in text so at most each vector is read in only once, however this saves the overhead of reading in and storing unneeded words, making it superior to the RAM approach.
Just an FYI, I used Java's RamdomAccessFile class and made use of the readLine(), getFilePointer(), and seek() functions.
Thanks to all who contributed to this thread.
UPDATE 2
For more performance improvement check out buffered RandomAccessFile from:
http://minddumped.blogspot.com/2009/01/buffered-javaiorandomaccessfile.html
Apparently the readLine from RandomAccessFile is very slow because it reads byte by byte. This gave me some nice improvement.
As a rule, anything custom coded should be much faster than a generic database, assuming you have coded it efficiently.
There are specific C-libraries to solve this problem using B-trees. In the old days there was a famous library called "B-trieve" that was very popular because it was fast. In this application a B-tree will be faster and easier than fooling around with a database.
If you want optimal performance you would use a data structure called a suffix tree. There are libraries which are designed to create and use suffix trees. This will give you the fastest word lookup possible.
In either case there is no reason to store the entire dataset in memory, just store the B-tree (or suffix tree) with an offset to the data in memory. This will require about 3 to 5 megabytes of memory. When you query the tree you get an offset back. Then open the file, seek forwards to the offset and read the vector off disk.
You could use a simple text based index file just mapping the words to indices, and another file just containing the raw vector data for each word. Initially you just read the index to a hashmap that maps each word to the datafile index and keep it in memory. If you need the data for a word, you calculate the offset in the data file (2000 * 32 * index) and read it as needed. You probably want to cache this data in RAM (if you are in java perhaps just use a weak map as a starting point).
This is basically implementing your own primitive database, but it may still be preferable because it avoidy database setup / deployment complexity.

Strategy for mass storage of small files

What is the good strategy for mass storage for millions of small files (~50 KB on average) with auto-pruning of files older than 20 minutes? I need to write and access them from the web server.
I am currently using ext4, and during delete (scheduled in cron) HDD usage spikes up to 100% with [flush-8:0] showing up as the process that creates the load. This load is interferes with other applications on the server. When there are no deletes, max HDD utilisation is 0-5%. Situation is same with nested and non-nested directory structures. The worst part is that it seems that mass-removing during peak load is slower than the rate of insertions, so amount of files that need to be removed grows larger and larger.
I have tried changing schedulers (deadline, cfq, noop), it didn't help. I have also tried setting ionice to removing script, but it didn't help either.
I have tried GridFS with MongoDB 2.4.3 and it performs nicely, but horrible during mass delete of old files. I have tried running MongoDB with journaling turned off (nojournal) and without write confirmation for both delete and insert (w=0) and it didn't help. It only works fast and smooth when there are no deletes going on.
I have also tried storing data in MySQL 5.5, in BLOB column, in InnoDB table, with InnoDB engine set to use innodb_buffer_pool=2GB, innodb_log_file_size=1GB, innodb_flush_log_on_trx_commit=2, but the perfomance was worse, HDD load was always at 80%-100% (expected, but I had to try). Table was only using BLOB column, DATETIME column and CHAR(32) latin1_bin UUID, with indexes on UUID and DATETIME columns, so there was no room for optimization, and all queries were using indexes.
I have looked into pdflush settings (Linux flush process that creates the load during mass removal), but changing the values didn't help anything so I reverted to default.
It doesn't matter how often I run auto-pruning script, each 1 second, each 1 minute, each 5 minutes, each 30 minutes, it is disrupting server significantly either way.
I have tried to store inode value and when removing, remove old files sequentially by sorting them with their inode numbers first, but it didn't help.
Using CentOS 6. HDD is SSD RAID 1.
What would be good and sensible solution for my task that will solve auto-pruning performance problem?
Deletions are kind of a performance nuisance because both the data and the metadata need to get destroyed on disk.
Do they really need to be separate files? Do the old files really need to get deleted, or is it OK if they get overwritten?
If the answer is "no" to the second of these questions, try this:
Keep a list of files that's roughly sorted by age. Maybe chunk it by file size.
When you want to write to a new file, find an old file that's preferably bigger than what you're replacing it with. Instead of blowing away the old file, truncate() it to the appropriate length and then overwrite its contents. Make sure you update your old-files list.
Clean up the really old stuff that hasn't been replaced explicitly once in a while.
It might be advantageous to have an index into these files. Try using a tmpfs full of symbolic links to the real file system.
You might or might not get a performance advantage in this scheme by chunking the files in to manageably-sized subdirectories.
If you're OK with multiple things being in the same file:
Keep files of similar sizes together by storing each one as an offset into an array of similarly-sized files. If every file is 32k or 64k, keep a file full of 32k chunks and a file full of 64k chunks. If files are of arbitrary sizes, round up to the next power of two.
You can do lazy deletes here by keeping track of how stale each file is. If you're trying to write and something's stale, overwrite it instead of appending to the end of the file.
Another thought: Do you get a performance advantage by truncate()ing all of the files to length 0 in inode order and then unlink()ing them? Ignorance stops me from knowing whether this can actually help, but it seems like it would keep the data zeroing together and the metadata writing similarly together.
Yet another thought: XFS has a weaker write ordering model than ext4 with data=ordered. Is it fast enough on XFS?
If mass-removing millions of files results in performance problem, you can resolve this problem by "removing" all files at once. Instead of using any filesystem operation (like "remove" or "truncate") you could just create a new (empty) filesystem in place of the old one.
To implement this idea you need to split your drive into two (or more) partitions. After one partition is full (or after 20 minutes) you start writing to second partition while using the first one for reading only. After another 20 minutes you unmount first partition, create empty filesystem on it, mount it again, then start writing to first partition while using the second one for reading only.
The simplest solution is to use just two partitions. But this way you don't use disk space very efficiently: you can store twice less files on the same drive. With more partitions you can increase space efficiency.
If for some reason you need all your files in one place, use tmpfs to store links to files on each partition. This requires mass-removing millions of links from tmpfs, but this alleviates performance problem because only links should be removed, not file contents; also these links are to be removed only from RAM, not from SSD.
If you don't need to append to the small files, I would suggest that you create a big file and do a sequential write of the small files right in it, while keeping records of offsets and sizes of all the small files within that big file.
As you reach the end of the big file, start writing from its beginning again, while invalidating records of the small files in the beginning that you replace with new data.
If you choose the big file size properly, based new files saving rate, you can get auto-pruning of files older than ~20 minutes almost as you need.

I'm trying to rebuild the indexes on a Progress database after a huge binary load, getting this error

Just imported 655 tables via binary load using a batch script to a newly created database on a 650gb hard drive.
Idxbuild is running with threads, the maximum number of threads is 1. (13942)
TMB value is 8, TM value is 8, SG value is 48, packing factor is 100. (16141)
Temporary sort file at: C:\Progress\OpenEdge\bin will use the available disk spa
ce. (11443)
SYSTEM ERROR: Unable to extend database within area Schema Area. (8897)
I can't find any solution to this in the documentation.
Tom Bascom -- I know you know a solution to this.
Thank You community!
What version of Progress?
As Tim says, that's a very odd place for your temp files. How did that happen? My guess is that your working directory is %DLC%\bin.
You're extending the schema area? Why? Did you forget to move all of your data, indexes and LOBs to type 2 storage?
By eliminating all of the AREA information from the structure file you put everything in the schema area.
You probably also did not create a structure file with multiple extents and so forth? Thus there is just the single initial extent.
It also seems likely that you did not enable large files. Which means that once that extent hits 2GB it cannot grow.
So the quick and easy solution is probably:
proutil dbName -C enablelargefiles
Note: this is a terrible way to set setup a database -- don't do it for a real system. But, as I understand it, you are just trying to do a one-time load of this data so that you can export it as CSV data.
From the KB:
This situation arises when the database Storage Area is either:
a.) composed entirely of fixed-length extents and the last extent has
become filled up, b.) a variable length extent needs to exceed the
2Gig file size limit to accommodate writes and LargeFiles have not
been enabled, or c.) when the user hits their ulimit as defined in
their .profile (UNIX) or disk quota limit (Windows)
In any of the above cases, the PROGRESS run is aborted and recovery
must be run.
Also:
More extents for the database to grow must be made avaialable. The
prostrct utility must be used to add additional space. In future, the
highwater mark of the last extent in an area can be monitored to
forewarn of this occurrence. Once this is done, restart the database
and allow crash recovery to take place.
and
References to Written DocumentationReferences to
ProgressManuals:Database Administration Guide and Reference - Chapter
9 "Maintaining the Database Structure"
Also, why are you pointing your temp space for sorting at the progress bin directory? Point it at an empty temp directory instead.
Use the Temporary Directory (-T) startup parameter to identify or redirect temporary files created by the PROUTIL utility to a specified directory when sorting and handling space issues.

dealing with a large flat data files with a very big record length

I have a large data file that is created from a shell script. The next script processes it by sorting and reading several times. That takes more than 14 hours; it is not viable.
I want to replace this long running script with a program, probably in JAVA, C, or COBOL, that can run on Windows or on Sun Solaris. I have to read a group of records every time, sort and process and write to the output sort file and at the same time insert into db2/sql tables.
If you insert them into a database anyway it might be much simpler to not do the sorting yourself, but just receive the data ordered from the database once you've inserted it all.
Something that might speed up your sorting is alter your data producing script to place the data into different files based on all or the prefix of the key you will be used to sort the entries.
Then when you actually sort the entries you can limit your sort to only work on the smaller files, which will (pretty much) turn your sort time from O( f(N) ) to O( f(n0) + f(n1) + ... ), which for any f() more complex than f(x)=x should be smaller (faster).
This will also open up the possibility of sorting your files concurrently because the disk IO wait time for one sorting thread would be a great time for another thread to actually sort the records that it has loaded.
You will need to find a happy balance between too many files and too bit files. 256 files is a good starting point.
Another thing you might want to investigate is your sorting algorithm. Merge sort is good for secondary storage sorting. Replacement selection sort is also a good algorithm to use for secondary storage sorting.
http://www.cs.auckland.ac.nz/software/AlgAnim/niemann/s_ext.htm
Doing your file IO in large chunks (file system block sized aligned chunks are best) will also help in most cases.
If you do need to use a relational database anyway you might as well just go ahead and put everything in there to start with, though. RDBMSes typically have very good algorithms to handle all of this tricky stuff.

Resources