How Database handles the Disk Block while it is control by OS - database

I am not clear how database system control the disk blocks. I have read that the data which are related, put them in the same or nearby block. How it is possible for a database system(an application software). What I know is disk block is maintained by the OS so whatever algorithm is implemented in the OS it would allocate the disk block according to that.
If I am wrong, please correct me. I am novice in that.

Related

Which operating systems, and how, can pin pages in a database buffer pool?

Most relational database construction textbooks talk about the concept of being able to pin a page, i.e. prevent the operating system from swapping it out of memory. The concept is so that the database software can use it's own buffer replacement algorithm, which might be a better fit than whatever the OS virtual memory policy provides.
It is unclear to me whether typical desktop operating systems actually provide the programmer with the capability to pin pages. The best I can find on OS X, for example, refers to wired pages, but these seem to be only usable by the superuser.
Is the concept of pinning pages, and of defining appropriate buffer replacement strategies that supersede that of the OS, only of theoretical interest and not really implemented by real relational database systems? Or is it the case that typical desktop OS'es (Linux, Windows, OS X) do include hooks for pinning, and typical relational DB software (Oracle, SQL Server, PostgreSQL, MySQL, etc) uses them?
In PostgreSQL, the database server copies the pages from the file (or from the OS, really) into a shared memory segment which PostgreSQL controls. The OS doesn't know what the mapping is between the file system blocks and the shared memory blocks, so the OS couldn't write those pages back out to their disk locations even if it wanted to, until PostgreSQL tells it to do so by issuing a seek and a write.
The OS could decide to swap parts of shared memory out to disk into a swap partition (for example, if it were under severe memory stress), but it can't write them back to their native location on disk since it doesn't know what that location is.
There are ways to tell the OS not to page out certain parts of memory, such as shmctl(shmid,SHM_LOCK,NULL). But these are mostly intended for security purposes, not performance purposes. For example, you use it to prevent very sensitive information (like the decrypted copy of a private key) from accidentally getting written to swap partitions, from which it might be recovered by the bad guys.
#jjanes is correct to say that the OS can't really write out Pg's shared memory buffer, and can't control what PostgreSQL reads into it, so it doesn't make sense to "pin" it. But that's only half the story.
PostgreSQL does not offer any feature for pinning pages from tables in its shared memory segment. It could do so, and it might arguably be useful, but nobody has implemented it. In most cases the buffer replacement algorithm does a pretty good job by its self.
Partly this is because PostgreSQL relies heavily on the operating system's buffer caches, rather than trying to implement its own. Data might be evicted from shared_buffers, but it's usually still cached in the OS. It's not unreasonable to think of shared_buffers as a first-level cache, and the OS disk cache as the second-level cache.
The features available to control what's kept in the operating system's disk cache are whatever the OS provides. In general, that's not much, because again modern OSes tend to do a better job if you leave them alone and let them manage things themselves.
The idea of manual buffer management, etc, is IMO largely a relic of times when systems had simpler and less effective algorithms for managing caches and buffers automatically.
The main time that automation falls down is if you have something that's used only intermittently, but you want to ensure is available with extremely good response times when it is used; i.e. you wish to degrade the overall system's throughput to make one part of it more responsive. PostgreSQL doesn't offer much control over that; most people simply ensure that they have something regularly querying the data of interest to keep it warm in the cache.
You could write a relatively simple extension to mmap() a file and mlock() its range, but it'd be pretty wasteful and you'd have to fiddle with the default OS limits designed to stop you from locking too much memory.
(FWIW, I think Oracle offers quite a bit of control over pinning relations, indexes, etc, in tune with its "manually control everything whether you want to or not" philosophy, and it bypasses much of the operating system in the process.)
Speaking for SQL Server (on Windows, obviously), there's an OS setting that allows the SQL engine to ignore requests from the OS in response to memory pressure. That setting is called Lock Pages in Memory (LPIM). That permissions is granted on a per-account basis and needs to be granted to the account running your SQL service when the service is started.
Keep in mind that this isn't always a good idea. For example, in a virtualized environment, the hypervisor communicates its memory needs via a balloon driver process in the guest. If the hypervisor needs more memory, it inflates the memory needs of the balloon in the guest. If your SQL process has LPIM turned on, it won't respond and the hypervisor can start flagging as a result. And if the hypervisor isn't happy, ain't nobody happy.

Can I check for disk access at runtime?

I'm running some very specialized experiments for a research project. These experiments call for controlling memory accesses: my application should not, under any circumstances, swap information with the disk. That is, all information the application needs must stay in RAM for the duration of the execution, but it should use as much RAM as possible.
My question is: is there any way I can control disk access by my application, or at least count disk accesses for later analysis?
This is using C and Linux.
Please let me know if I can clarify the question... been working on this for so long I think everybody knows exactly what I'm talking about.
One thing you can do is actually create a ramfs or RAM file system. Are you working on a unix platform? If so you can check out mount and umount on how to create them.
http://linux.die.net/man/8/mount
http://linux.die.net/man/8/umount
Basically what you do is you create a file system stored in your RAM. You don't have to deal with all the disk read/write time anymore. If i read your question correctly you want to try avoiding disk access if you can. It's very simple to do really since you can have multiple file systems located on both a hard drive and memory.
http://www.cyberciti.biz/faq/howto-create-linux-ram-disk-filesystem/
http://www.alper.net/linuxunix/linux-ram-based-filesystem/
Hope this all helped.
The mlock system call allows you to lock part or all of your process's virtual memory to RAM, thus preventing it from being written to swap space. Notice that another process with root priviledges can still that memory area.

How does a file-system block gets translated to lba?

I understand a file-system can choose the size of blocks it uses on the disk.
On the other hand i understand that the disk is divided into LBA's.
The LBA is an address of a sector on the disk.
So whats the connection between the block used by the file system and the disk sectors (lba)?
Is there some kind of translation from a fs block and lba?
Is it different from fs to fs?
where can i read more about this?
thanks
Yes. File system usually sees a a continuous logical space without knowledge of the spindles underneath, thus it doesn't know disk LBA either. The translation work is usually done in a layer called volume, which is to hide the disk detail and present the file system a logically continuous space. For example, in Linux there's LVM (Logical Volume Manager) playing such roles.
The volume exposed to fs might not be disks. It could be constructed upon other volumes, thus sometimes come up with a very large disk.
The volume could also provide the functionality of RAID, which put several disks together that could relieve you from disk failure in some extent at the expense of performance and space efficiency.
Some file systems can manage disks directly and operate on raw disks, thus no layer of volume. As far as I know, NETAPP's WAFL is doing in that way.

Need data on disk drive management by OS: getting base I/O unit size, “sync” option, Direct Memory Access

I want to ensure I have done all I can to configure a system's disks for serious database use. The three areas I know of (any others?) to be concerned about are:
I/O size: the database engine and disk's native size should either match, or the database's native I/O size should be a multiple of the disk's native I/O size.
Disks that are capable of Direct Memory Access (eg. IDE) should be configured for it.
When a disk says it has written data persistently, it must be so! No keeping it in cache and lying about it.
I have been looking for information on how to ensure these are so for CENTOS and Ubuntu, but can't seem to find anything at all!
I want to be able to check these things and change them if needed.
Any and all input appreciated.
PLEASE NOTE: The actual hardware involved is VERY modest. The point is to get the most out of what hardware we do have, even though it's "not very serious hardware" from a broader perspective.
MORE:
I appreciate the time taken to read and reply, but I'm hoping to get "answers" that aren't just good database / hardware advice but answers that actually address the specific things I asked about. Namely:
1) What's a good easy way to tell what the I/O unit size is that the OS wants to do? How can I change it? (IOW: If this exclusively a file-system-format issue, how can I tell what was used on an already-created file system? I know /etc/fstab will tell me the file system format... In this case, it's ext3.
2) How can I tell if a disk drive has DMA? If so, how can I turn it on? (I've been told that some drives have this capability, but now I want to follow up and ensure that if these drives have it, it's turned on.)
And, finally;
3) How can I tell if a drive is merely telling the writer that their material is written when it's actually still in cache? And, more importantly, how can I set the system to NOT use such features if / when they exist?
Thank you for your insights.
RT
1) Check /sys/block/sdX/queue/{max_hw_sectors_kb,max_sectors_kb}. The first is that max transfer size the hw allows, the other is the current maximum which can be set to any value <= max_hw_sectors_kb
2) hdparm -i /dev/sdX
3) Turn off write-back caching (hdparm can do it), or make sure that the filesystem issues barriers when synchronizing (as in fsync(), or journal commit).
"serious database use" and you mention IDE in the same sentence?
SSDs or 15k SCSI in a many spindle RAID 1+0 array with separate arrays for data, log and backup. Consider a separate array for tempdb too.
You'd also switch the controller cache to 100% read too to avoid caching issues
Of course, if it's "serious" then you'd consider clustering etc: so a SAN comes in useful here but you may not be as quick as local spindles
You didn't include any info on filesystem or database, so here are some misc pointers.
It is inevitable that you will lose a disk eventually, so its equally important to put a good backup and recovery strategy in place, and mirror your transaction logs, so you can handle a disk failure or even full datafile loss.
1) If possible, put at least one copy of your transaction log on a fixed disk. Don't put your sole transaction log to an external storage subsystem. (Assuming you use a db that supports log mirroring).
2) I agree with gbn, in practice, don't use write caching. I've lost databases on RAID arrays with battery backup. Configure the storage controller card for write-through.
3) Raw devices provide guaranteed writes, but its not worth the hassle. Some filesystems provide synchronous write options too, use one if possible. I am partial to VxFS, but I'm from the Sun world. On Linux, btrfs is eminent at least, but for now, Ext3 works fine if you setup your db properly.

Accessing same resource across restarts in Windows

I will write some thing in a file/memory just before system shutdown or a service shutdown. In the next restart of system, Is it possible to access same file or same memory on the disk, before filesystem loads? Actual requirement is like this, we have a driver that sits between volume level drivers and filesystem driver...in that part of the driver code, I want to access some memory or file.
Thanks & Regards,
calvin
The logical thing here is to read/write this into the registry if it is not too big. Is there a reason you do not want to use the registry?
If you need to access large data and you are writing a volume or device filter and cannot rely on ZwOpen/Read/Write/Close functions in the kernel an approach would be to create the file in user mode, get its device name and cluster chain and store them in the registry. On the next boot, you can get the device and clusters from registry, and do direct I/O on them.
Since you want to access this before the filesystem loads, my first thought is to allocate and use a block of storage space on the hard drive outside of the filesystem. You can create a hidden mini-partition on the drive and use low-level I/O commands to read and write your data.
This is a common task in the world of embedded systems, and we often implement it by adding some sort of non-volatile memory device into the system (flash, battery-backed DRAM, etc) and reading and writing to that device. Since you likely don't have the same level of control over the available hardware as embedded developers do, the closest analogue I can think of would be to reserve a chunk of space on a physical disk that you can read from without having to mount as a filesystem. A dedicated mini-partition might work the best because if you know the size of it, you can treat it as one big raw-access buffer and can avoid having to hassle with filenames, filesystems, etc.

Resources