Libpq speed on windows - database

I have a problem with libpq on windows. Connecting to a db and running a "select * from some_table;" is very slow.
The table has only 1800 rows, 7 columns. No blobs etc.
The query is taking around 3500ms, in linux it takes around 800ms. (About 500ms is network time, the server is on the opposite side of the world from my location.)
The hardware is identical (dual boot)
Why does this so long in windows? I tested in pqsl, and pgadmin to rule out errors in the app code.
Any advice or clues?

I would be willing to bet that the real problem is antivirus software acting up. It is true that PostgreSQL on Windows may not perform quite as well as on Linux, but the differences you are seeing cannot be simply in relation to the differences between multiple processes and multiple threads (copy on write, etc).
The very first thing to do is to rule out causes like antivirus software. Because this software sits in between reads and writes of disk I/O it has the capability of making your disk I/O significantly slower. Additionally if it is slow enough it may render sequential disk I/O performance more like random disk I/O which is not a good thing. So try with your antivirus switched off (and preferably not connected to a network).
A second thing I would look at is filesystem fragmentation. Are these files heavily fragmented? If so, disk I/O will be more expensive as well. Beyond this, doing a clean boot, starting the service manually, and trying this again may rule out other programs interfering with disk I/O.
Once you have the problem ruled down, then it should be simple to come up with a solution.

Related

Embedded File System and power-off

I am working on an embedded application without any OS that needs the use of a File System. I've been over this many times with the people in the project and some agree with me that the system must make a proper shut down of the system whenever there is a power failure or else the file system might go crazy.
Some people say that it doesn't matter if you simply power off the system and let nature run its course, but I think that's one of the worst things to do, especially if you know this will bring you a problem and probably shorten your product's life span.
In the last paragraph I just assumed that it is a problem, but my question remains:
Does a power down have any effect on the file system?
Here is a list of various techniques to help an embedded system tolerate a power failure. These may not be practical for your particular application.
Use a Journaling File System - Can tolerate incomplete writes due to power failure, OS crash, etc. Most modern filesystems are journaled, but do your homework to confirm.
Unless your application needs the write performance, disable all write caching. Check your disk drivers for caching options. Under Linux/Unix, consider mounting the filesystem in sync mode.
Unless it must be writable, make it read-only. Try to keep your application executables and operating system files on their own partition(s), with write protections in place (e.g. mount read only in Linux). Your read/write data should be on its own partition. Even if your application data gets corrupted, your system should still be able to boot (albeit with a fail safe default configuration).
3a. For data that is only written once (e.g. Configuration Settings), try to keep it mounted as read-only most of the time. If there is a settings change mount is as R/W temporarily, update the data, and then unmount/remount it as read-only.
3b. Use a technique similar to 3a to handle application/OS updates in the field.
3c. If it is impractical for you to mount the FS as read-only, at least consider opening individual files as read-only (e.g. fp=fopen("configuration.ini", "r")).
If possible, use separate devices for your storage. Keeping things in separate partitions provides some protection, but there are still edge cases where a partition table may become corrupt and render the entire drive unreadable. Using physically separate devices further isolates against one corrupt device bringing down the whole system. In a perfect world, you would have at least 4 separate devices:
4a. Boot Loader
4b. Operating System & Application Code
4c. Configuration Settings
4e. Application Data
Know the characteristics of your storage devices, and control the brand/model/revision of devices used. Some hard disks ignore cache flush commands from the OS. We had cases where some models of CompactFlash cards would corrupt themselves during a power failure, but the "industrial" models did not have this problem. Of course, this information was not published in any datasheet, and had to be gathered by experimental testing. We developed a list of approved CF cards, and kept inventory of those cards. We periodically had to update this list as older cards became obsolete, or the manufacturer would make a revision.
Put your temporary files in a RAM Disk. If you keep those writes off-disk, you eliminate them as a potential source of corruption. You also reduce flash wear and tear.
Develop automated corruption detection and recovery methods. - All of the above techniques will not help you if the application simply hangs because a missing config file. You need to be able to recover as gracefully as possible:
7a. Your system should maintain at least two copies of its configuration settings, a "primary" and a "backup". If the primary fails for some reason, switch to the backup. You should also consider mechanisms for making backups whenever whenever the configuration is changed, or after a configuration has been declared "good" by the user (testing vs production mode).
7b. Did your Application Data partition fail to mount? Automatically run chkdsk/fsck.
7c. Did chkdsk/fsck fail to fix the problem? Automatically re-format the partition and get it back to a known state.
7d. Do you have a Boot Loader or other method to restore the OS and application after a failure?
7e. Make sure your system will beep, flash an LED, or something to indicate to the user what happened.
Power Failures should be part of your system qualification testing. The only way you will be sure you have a robust system is to test it. Yank the power cord from the system and document what happens. Try yanking the power at multiple points in the system operation (during runtime, while booting, mid configuration, etc). Repeat each test multiple times.
If you cannot mitigate all power failure problems, incorporate a battery or Supercapacitor into the system - Keep in mind that you will need a background process in your OS to initiate a graceful shutdown when power gets low. Also, batteries will require periodic testing and replacement with age.
Addition to msemack's response, unfortunately my rating is too low to post a comment to his answer vs. a separate answer.
Does a power down have any effect on the file system?
Yes, if proper measures aren't put in place to prevent corruption. See previous answers for file system options to help mitigate. However if ATA flush/sleep aren't properly implemented on your device you may run into the scenario we did. In our scenario the device was corrupt beyond the file system, and fdisk/format would not recover the device.
Instead an ATA security-erase was required to recover the device once corruption occurs. In order to avoid this, we implemented an ATA sleep command prior to power loss. This required hold-up of 400ms to support the 160ms ATA sleep took, and leave some head room for degradation of the caps over the life of the product.
Notes from our scenario:
fdisk/format failed to repair/recover the drive.
Our power-safe file system's check disk utility returned that the device had bad blocks, but there really weren't any.
flush/sync returned success, quickly, and most likely weren't implemented.
Once corrupt, dd could not read the device beyond the 1st partition boundary and returned i/o errors after.
hdparm used to issue ATA security-erase, as only method of recovery for some corruption scenarios.
For non-journalling filesystem unexpected turn-off can mean corruption of certain data including directory structure. This happens if there's unsaved data in the cache or if the FS is in the process of writing multi-block update and interruption happens when only some blocks are written.
Journalling addresses this problem mostly - if there's interruption in the middle, recovery routine or check-and-repair operation done by the FS (usually implicitly) brings the filesystem to consistent state. However this state is not always the latest - i.e. if there were some data in the memory cache, they can be lost even with journalling. This is because journalling saves you from corruption of the filesystem but doesn't do magic.
Write-through mode (no write caching) reduces possibility of the data loss but doesn't solve the problem completely, as journalling will work as a cache (for a very short time).
So unfortunately backup or data duplication are the main ways to prevent data loss.
It totally depends on the file system you are using and if it is acceptable to loose some data at power off based on your project requirements.
One could imagine using a file system that is secured against unattended power-off and is able to recover from a partial write sequence. So on the applicative side, if you don't have critic data that absolutely needs to be written before shuting down, there is no need for a specific power off detection procedure.
Now if you want a more specific answer for your project you will have to give more information on the file system you are using and your project requirements.
Edit: As you have critical applicative data to save before power-off, i think you have answered the question yourself. The only way to secure unattended power-off is to have a brown-out detection that alerts your embedded device coupled with some hardware circuitry that allows keeping delivering enought power to the device to perform the shutdown procedure.
The FAT file-system is particularly prone to corruption if a write is in progress or a file is open on shutdown - specifically if ther is a buffered operation that is not flushed . On one project I worked on the solution was to run a file system integrity check and repair (essentially chkdsk/scandsk) on start-up. This strategy did not prevent data loss, but it did prevent the file system becoming unusable.
A number of vendors provide journalling add-on components for FAT to counter exactly this problem. These include Segger, Quadros and Micrium for example.
Either way, your system should generally adopt a open-write-close approach to file access, or open-write-flush if you feel the need to keep the file open.

After how many seconds are file system write buffers typically flushed?

Before overwriting data in a file, I would like to be pretty sure the old data is stored on disk. It's potentially a very big file (multiple GB), so in-place updates are needed. Usually writes will be 2 MB or larger (my plan is to use a block size of 4 KB).
Instead of (or in addition to) calling fsync(), I would like to retain (not overwrite) old data on disk until the file system has written the new data. The main reasons why I don't want to rely on fsync() is: most hard disks lie to you about doing an fsync.
So what I'm looking for is what is the typical maximum delay for a file system, operating system (for example Windows), hard drive until data is written to disk, without using fsync or similar methods. I would like to have real-world numbers if possible. I'm not looking for advice to use fsync.
I know there is no 100% reliable way to do it, but I would like to better understand how operating systems and file systems work in this regard.
What I found so far is: 30 seconds is / was the default for /proc/sys/vm/dirty_expire_centiseconds. Then "dirty pages are flushed (written) to disk ... (when) too much time has elapsed since a page has stayed dirty" (but there I couldn't find the default time). So for Linux, 40 seconds seems to be on the safe side. But is this true for all file systems / disks? What about Windows, Android, and so on? I would like to get an answer that applies to all common operating systems / file system / disk types, including Windows, Android, regular hard disks, SSDs, and so on.
Let me restate this your problem in only slightly-uncharitable terms: You're trying to control the behavior of a physical device which its driver in the operating system cannot control. What you're trying to do seems impossible, if what you want is an actual guarantee, rather than a pretty good guess. If all you want is a pretty good guess, fine, but beware of this and document accordingly.
You might be able to solve this with the right device driver. The SCSI protocol, for example, has a Force Unit Access (FUA) bit in its READ and WRITE commands that instructs the device to bypass any internal cache. Even if the data were originally written buffered, reading unbuffered should be able to verify that it was actually there.
The only way to reliably make sure that data has been synced is to use the OS specific syncing mechanism, and as per PostgreSQL's Reliability Docs.
When the operating system sends a write request to the storage
hardware, there is little it can do to make sure the data has arrived
at a truly non-volatile storage area. Rather, it is the
administrator's responsibility to make certain that all storage
components ensure data integrity.
So no, there are no truly portable solutions, but it is possible (but hard) to write portable wrappers and deploy a reliable solution.
First of all thanks for the information that hard disks lie about flushing data, that was new to me.
Now to your problem: you want to be sure that all data that you write has been written to the disk (lowest level). You are saying that there are two parts which need to be controlled: the time when the OS writes to the hard drive and the time when the hard drive writes to the disk.
Your only solution is to use a fuzzy logic timer to estimate when the data will be written.
In my opinion this is the wrong way. You have control about when the OS is writing to the hard drive, so use the possibility and control it! Then only the lying hard drive is your problem. This problem can't be solved reliably. I think, you should tell the user/admin that he must take care when choosing the right hard drive. Of course it might be a good idea to implement the additional timer you proposed.
I believe, it's up to you to start a row of tests with different hard drives and Brad Fitzgerald's tool to get a good estimation of when hard drives will have written all data. But of course - if the hard drive wants to lie, you can never be sure that the data really has been written to the disk.
There are a lot of caches involved in giving users a responsive system.
There is cpu cache, kernel/filesystem memory cache, disk drive memory cache, etc. What you are asking is how long does it take to flush all the caches?
Or, another way to look at it is, what happens if the disk drive goes bad? All the flushing is not going to guarantee a successful read or write operation.
Disk drives do go bad eventually. The solution you are looking for is how can you have a redundant cpu/disk drive system such that the system survives a component failure and still keeps working.
You could improve the likelihood that system will keep working with aid of hardware such as RAID arrays and other high availability configurations.
As far software solution goes, I think the answer is, trust the OS to do the optimal thing. Most of them flush buffers out routinely.
This is an old question but still relevant in 2019. For Windows, the answer appears to be "at least after every one second" based on this:
To ensure that the right amount of flushing occurs, the cache manager spawns a process every second called a lazy writer. The lazy writer process queues one-eighth of the pages that have not been flushed recently to be written to disk. It constantly reevaluates the amount of data being flushed for optimal system performance, and if more data needs to be written it queues more data.
To be clear, the above says the lazy writer is spawned after every second, which is not the same as writing out data every second, but it's the best I can find so far in my own search for an answer to a similar question (in my case, I have an Android apps which lazy-writes data back to disk and I noticed some data loss when using an interval of 3 seconds, so I am going to reduce it to 1 second and see if that helps...it may hurt performance but losing data kills performance a whole lot more if you consider the hours it takes to recover it).

implementing high performance distributed filesystem/database

I need to implement the fastest possible way to store a key/value pair in a distributed system on Linux. Records of the database are tiny, 256 bytes on average.
I am thinking to use open(), write() and read() system calls and write the key-value pairs directly at some offset in the file. I can omit fdatasync() system call since I will be using SSD disk with battery, so I don't have to worry about ACID compliance if an unexpected shutdown of the system happens.
Linux already provides disk cache implementation, so no reads/writes will happen on sectors that were already loaded in memory. This (i think) would be the fastest way to store data, much faster than any other cache capable database engine like for example GT.M or Intersystem's Globals.
However the data is not replicated, and to achieve replication, I can mount a filesystem of another Linux server with NFS and copy the data there, so for example, if I have 2 data servers (1 local and 1 remote), I would issue 2 open(), 2 write() and 2 close() calls. If a transaction fails on remote server, I would mark it as "out of sync" and simply copy the good file again when the remote server comes back.
What do you think of this approach? Will it be fast? I can use NFS over UDP so I will avoid the TCP Stack overhead.
Advantage list so far goes like this:
Linux disk cache reused
Few lines of code
High performance
I will be coding this in C. To locate the record in the file I will keep a btree in memory with a pointer to physical location.
A few suggestions come to mind.
is it necessary to open()/write()/close() for every transaction? the system call overhead of open() in particular is probably non-trivial
could you use mmap() instead of explicit write()s?
if you're doing 2 write() calls (1 local, 1 NFS) for each transaction, it seems like any kind of network problem (latency, dropped packets, etc.) has the potential to bring your application to a screeching halt if you're waiting for the NFS write() call to succeed. And if you're not waiting, for example by doing the NFS writes from a separate thread, your complexity will rapidly grow (I don't think "Few lines of code" will remain true.)
In general, I would suggest that you really prove to yourself that the available tools don't meet your performance requirements before choosing to re-invent this particular wheel.
You might look into a real distributed filesystem rather than using NFS, which as you point out, still provides a single point of failure and no replication.
The Andrew File System (AFS) originally developed by CMU may be a solution for you. It's a commercial product, but you might check out OpenAFS which works on linux (and other systems).
Warning though: AFS has a learning curve.

What is the best way to avoid overloading a parallel file-system when running embarrassingly parallel jobs?

We have a problem which is embarrassingly parallel - we run a large number of instances of a single program with a different data set for each; we do this simply by submitting the application many times to the batch queue with different parameters each time.
However with a large number of jobs, not all of them complete. It does not appear to be a problem in the queue - all of the jobs are started.
The issue appears to be that with a large number of instances of the application running, lots of jobs finish at roughly the same time and thus all try to write out their data to the parallel file-system at pretty much the same time.
The issue then seems to be that either the program is unable to write to the file-system and crashes in some manner, or just sits there waiting to write and the batch queue system kills the job after it's been sat waiting too long. (From what I have gathered on the problem, most of the jobs that fail to complete, if not all, do not leave core files)
What is the best way to schedule disk-writes to avoid this problem? I mention our program is embarrassingly parallel to highlight the fact the each process is not aware of the others - they cannot talk to each other to schedule their writes in some manner.
Although I have the source-code for the program, we'd like to solve the problem without having to modify this if possible as we don't maintain or develop it (plus most of the comments are in Italian).
I have had some thoughts on the matter:
Each job write to the local (scratch) disk of the node at first. We can then run another job which checks every now and then what jobs have completed and moves the files from the local disks to the parallel file-system.
Use an MPI wrapper around the program in master/slave system, where the master manages a queue of jobs and farms these off to each slave; and the slave wrapper runs the applications and catches the exception (could I do this reliably for a file-system timeout in C++, or possibly Java?), and sends a message back to the master to re-run the job
In the meantime I need to pester my supervisors for more information on the error itself - I've never run into it personally, but I haven't had to use the program for a very large number of datasets (yet).
In case it's useful: we run Solaris on our HPC system with the SGE (Sun GridEngine) batch queue system. The file-system is NFS4, and the storage servers also run Solaris. The HPC nodes and storage servers communicate over fibre channel links.
Most parallel file systems, particularly those at supercomputing centres, are targetted for HPC applications, rather than serial-farm type stuff. As a result, they're painstakingly optimized for bandwidth, not for IOPs (I/O operations per sec) - that is, they are aimed at big (1000+ process) jobs writing a handful of mammoth files, rather than zillions of little jobs outputting octillions of tiny little files. It is all to easy for users to run something that runs fine(ish) on their desktop and naively scale up to hundreds of simultaneous jobs to starve the system of IOPs, hanging their jobs and typically others on the same systems.
The main thing you can do here is aggregate, aggregate, aggregate. It would be best if you could tell us where you're running so we can get more information on the system. But some tried-and-true strategies:
If you are outputting many files per job, change your output strategy so that each job writes out one file which contains all the others. If you have local ramdisk, you can do something as simple as writing them to ramdisk, then tar-gzing them out to the real filesystem.
Write in binary, not in ascii. Big data never goes in ascii. Binary formats are ~10x faster to write, somewhat smaller, and you can write big chunks at a time rather than a few numbers in a loop, which leads to:
Big writes are better than little writes. Every IO operation is something the file system has to do. Make few, big, writes rather than looping over tiny writes.
Similarly, don't write in formats which require you to seek around to write in different parts of the file at different times. Seeks are slow and useless.
If you're running many jobs on a node, you can use the same ramdisk trick as above (or local disk) to tar up all the jobs' outputs and send them all out to the parallel file system at once.
The above suggestions will benefit the I/O performance of your code everywhere, not juston parallel file systems. IO is slow everywhere, and the more you can do in memory and the fewer actual IO operations you execute, the faster it will go. Some systems may be more sensitive than others, so you may not notice it so much on your laptop, but it will help.
Similarly, having fewer big files rather than many small files will speed up everything from directory listings to backups on your filesystem; it is good all around.
It is hard to decide if you don't know what exactly causes the crash. If you think it is an error related to the filesystem performance, you can try an distributed filesystem: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html
If you want to implement Master/Slave system, maybe Hadoop can be the answer.
But first of all I would try to find out what causes the crash...
OSes don't alway behave nicely when they run out of resources; sometimes they simply abort the process that asks for the first unit of resource the OS can't provide. Many OSes have file handle resource limits (Windows I think has a several-thousand handle resource, which you can bump up against in circumstances like yours), and failure to find a free handle usually means the OS does bad things to the requesting process.
One simple solution requiring a program change, is to agree that no more than N of your many jobs can be writing at once. You'll need a shared semaphore that all jobs can see; most OSes will provide you with facilities for one, often as a named resource (!). Initialize the semaphore to N before you launch any job.
Have each writing job acquire a resource unit from the semaphore when the job is about to write, and release that resource unit when it is done. The amount of code to accomplish this should be a handful of lines inserted once into your highly parallel application. Then you tune N until you no longer have the problem. N==1 will surely solve it, and you can presumably do lots better than that.

Getting CPU time in OS X

I have an objective-c application for OS X that compares two sqlite DB's and produces a diff in json format. The db are quite large (10,000 items with many fields). Sometimes this applications runs in about 55 sec(using 95% of the cpu). Sometimes it takes around 8 min (using 12% of the cpu). This is with the same DB's. When it is only using a small portion of the cpu the rest is available. There does not appear to be anything taking priority over the process. Adding "nice -20" on the command seems to assure I get the cpu usage. My questions are
If nothing else is using the cpu why
does my app not take advantage of
it?
Is there something I can do
programatically to change this?
Is there something I can do to OS X to
change this?
Question 1:
Since, I assume, you have to read in the databases from disk, you aren't making full use of the CPU because your code is blocking on disk reads. On Mac OS X there is a lot of stuff running in the background that doesn't use a lot of CPU time but does send out a lot of disk reads, like Spotlight.
Question 2:
Probably not, other than make the most efficient use of disk access possible.
Question 3:
Shut down any other processes that are accessing the disk. This includes many system processes that you really shouldn't shut down, so I don't think there's much you can do here other than try running it on Darwin without all the Mac OS X fanciness.
It sounds like you're IO bound in the long cases. Are you doing anything else on the machine? The CPU isn't throttling itself - it's definitely waiting for something.
You can use some of the developer tools to look at your app while it's running - perhaps most useful would be "Instruments", which is a GUI on top of dtrace. You should have this installed if you're running the most recent Xcode. You can also use Shark, which is somewhat easier to use at first glance, but less informative in the long run.
Usually you get all the performance that's available. If the CPU is not at 100% there's something blocking it. In case of databases it's often locking. Use Shark to find out what's going on in your application.
When your program uses little CPU, probably because it is waiting for disk, especially when other processes are accessing to the disk at the same time. Another possibility is your program uses too much memory and the OS begins to use swap space.

Resources