An explanation of DRBD Protocol C - distributed

What protocol does DRBD employ to guarantee that it can maintain 2 disks in sync with each other?
Does it use 2-phase commits (or a variant similar to 2PC)?
Does DRBD have an asynchronous/offline reconciler that is constantly checking to see if the disks have deviated?

By default, DRBD will use protocol C (fully synchronous) replication. It uses it's own internal protocols when replicating writes to it's peer, and typically is used in an Active/Passive manner.
DRBD keeps a bitmap in memory to keep track of what has been replicated and what's still "in flight". If DRBD were to become disconnected from it's peer, those bitmaps get pushed down to disk (into DRBD's metadata). When the peers reconnect, they exchange bitmaps and generation identifiers to determine which direction and which blocks to sync.
That image shows where DRBD sits in the Linux Kernel's Storage Stack. Hope that helps!

Related

How do I increase the speed of my USB cdc device?

I am upgrading the processor in an embedded system for work. This is all in C, with no OS. Part of that upgrade includes migrating the processor-PC communications interface from IEEE-488 to USB. I finally got the USB firmware written, and have been testing it. It was going great until I tried to push through lots of data only to discover my USB connection is slower than the old IEEE-488 connection. I have the USB device enumerating as a CDC device with a baud rate of 115200 bps, but it is clear that I am not even reaching that throughput, and I thought that number was a dummy value that is a holdover from RS232 days, but I might be wrong. I control every aspect of this from the front end on the PC to the firmware on the embedded system.
I am assuming my issue is how I write to the USB on the embedded system side. Right now my USB_Write function is run in free time, and is just a while loop that writes one char to the USB port until the write buffer is empty. Is there a more efficient way to do this?
One of my concerns that I have, is that in the old system we had a board in the system dedicated to communications. The CPU would just write data across a bus to this board, and it would handle communications, which means that the CPU didn't have to waste free time handling the actual communications, but could offload the communications to a "co processor" (not a CPU but functionally the same here). Even with this concern though I figured I should be getting faster speeds given that full speed USB is on the order of MB/s while IEEE-488 is on the order of kB/s.
In short is this more likely a fundamental system constraint or a software optimization issue?
I thought that number was a dummy value that is a holdover from RS232 days, but I might be wrong.
You are correct, the baud number is a dummy value. If you create a CDC/RS232 adapter you would use this to configure your RS232 hardware, in this case it means nothing.
Is there a more efficient way to do this?
Absolutely! You should be writing chunks of data the same size as your USB endpoint for maximum transfer speed. Depending on the device you are using your stream of single byte writes may be gathered into a single packet before sending but from my experience (and your results) this is unlikely.
Depending on your latency requirements you can stick in a circular buffer and only issue data from it to the USB_Write function when you have ENDPOINT_SZ number of byes. If this results in excessive latency or your interface is not always communicating you may want to implement Nagles algorithm.
One of my concerns that I have, is that in the old system we had a board in the system dedicated to communications.
The NXP part you mentioned in the comments is without a doubt fast enough to saturate a USB full speed connection.
In short is this more likely a fundamental system constraint or a software optimization issue?
I would consider this a software design issue rather than an optimisation one, but no, it is unlikely you are fundamentally stuck.
Do take care to figure out exactly what sort of USB connection you are using though, if you are using USB 1.1 you will be limited to 64KB/s, USB 2.0 full speed you will be limited to 512KB/s. If you require higher throughput you should migrate to using a separate bulk endpoint for the data transfer.
I would recommend reading through the USB made simple site to get a good overview of the various USB speeds and their capabilities.
One final issue, vendor CDC libraries are not always the best and implementations of the CDC standard can vary. You can theoretically get more data through a CDC endpoint by using larger endpoints, I have seen this bring host side drivers to their knees though - if you go this route create a custom driver using bulk endpoints.
Try testing your device on multiple systems, you may find you get quite different results between windows and linux. This will help to point the finger at the host end.
And finally, make sure you are doing big buffered reads on the host side, USB will stop transferring data once the host side buffers are full.

fflush, fsync and sync vs memory layers

I know there are already similar questions and I gave them a look but I couldn't find an explicit univocal answer to my question. I was just investigating online about these functions and their relationship with memory layers. In particular I found this beautiful article that gave me a good insight about memory layers
It seems that fflush() moves data from the application to kernel filesystem buffer and it's ok, everyone seems to agree on this point. The only thing that left me puzzled was that in the same article they assumed a write-back cache saying that with fsync() "the data is saved to the stable storage layer" and after they added that "the storage may itself store the data in a write-back cache, so fsync() is still required for files opened with O_DIRECT in order to save the data to stable storage"
Reading here and there it seems like the truth is that fsync() and sync() let the data enter the storage device but if this one has caching layers it is just moved here, not at once to permanent storage and data may even be lost if there is a power failure. Unless we have a filesystem with barriers enabled and then "sync()/fsync() and some other operations will cause the appropriate CACHE FLUSH (ATA) or SYNCHRONIZE CACHE (SCSI) commands to be sent to the device" [from your website answer]
Questions:
if the data to be updated are already in the kernel buffers and my device has a volatile cache layer in write-back mode is it true, like said by the article, that operations like fsync() [and sync() I suppose] synchronize data to the stable memory layer skipping the volatile one? I think this is what happens with a write-through cache, not a write-back one. From what I read I understood that with a write-back cache on fsync() can just send data to the device that will put them in the volatile cache and they will enter the permanent memory only after
I read that fsync() works with a file descriptor and then with a single file while sync() causes a total deployment for the buffers so it applies to every data to be updated. And from this page also that fsync() waits for the end of the writing to the disk while sync() doesn't wait for the end of the actual writing to the disk. Are there other differences connected to memory data transfers between the two?
Thanks to those who will try to help
1. As you correctly concluded from your research fflush synchronizes the user-space buffered data to kernel-level cache (since it's working with FILE objects that reside at user-level and are invisible to kernel), whereas fsync or sync (working directly with file descriptors) synchronize kernel cached data with device. However, the latter comes without a guarantee that the data has been actually written to the storage device — as these usually come with their own caches as well. I would expect the same holds for msync called with MS_SYNC flag as well.
Relatedly, I find the distinction between synchronized and synchronous operations very useful when talking about the topic. Here's how Robert Love puts it succinctly:
A synchronous write operation does not return until the written data is—at least—stored in the kernel’s buffer cache. [...] A synchronized operation is more restrictive and safer than a merely synchronous operation. A synchronized write operation flushes the data to disk, ensuring that the on-disk data is always synchronized vis-à-vis the corresponding kernel buffers.
With that in mind you can call open with O_SYNC flag (together with some other flag that opens the file with a write permission) to enforce synchronized write operations. Again, as you correctly assumed this will work only with WRITE THROUGH disk caching policy, which effectively amounts to disabling disk caching.
You can read this answer about how to disable disk caching on Linux. Be sure to also check this website which also covers SCSI-based in addition to ATA-based devices (to read about different types of disks see this page on Microsoft SQL Server 2005, last updated: Apr 19, 2018).
Speaking of which, it is very informative to read about how the issue is dealt with on Windows machines:
To open a file for unbuffered I/O, call the CreateFile function with the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags. This prevents the file contents from being cached and flushes the metadata to disk with each write. For more information, see CreateFile.
Apparently, this is how Microsoft SQL Server 2005 family ensures data integrity:
All versions of SQL Server open the log and data files using the Win32 CreateFile function. The dwFlagsAndAttributes member includes the FILE_FLAG_WRITE_THROUGH option when opened by SQL Server. [...]
This option instructs the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.
I'm saying this is informative in particular because of this blog post from 2012 showing that some SATA disks ignore the FILE_FLAG_WRITE_THROUGH! I don't know what the current state of affairs is, but it seems that in order to ensure that writing to a disk is truly synchronized, you need to:
Disable disk caching using your device drivers.
Make sure that the specific device you're using supports write-through/no-caching policy.
However, if you're looking for a guarantee of data integrity you could just buy a disk with its own battery-based power supply that goes beyond capacitors (which is usually only enough for completing the on-going write processes). As put in the conclusion in the blog article mentioned above:
Bottom-line, use Enterprise-Class disks for your data and transaction log files. [...] Actually, the situation is not as dramatic as it seems. Many RAID controllers have battery-backed cache and do not need to honor the write-through requirement.
2. To (partially) answer the second question, this is from the man pages SYNC(2):
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.)
This would imply that fsync and sync work differently, however, note they're both implemented in unistd.h which suggests some consistency between them. However, I would follow Robert Love who does not recommend using sync syscall when writing your own code.
The only real use for sync() is in the implementation of the sync utility. Applications should use fsync() and fdatasync() to commit to disk the data of only the requisite file descriptors. Note that sync() may take several minutes or longer to complete on a busy system.
"I don't have any solution, but certainly admire the problem."
From all I read from your good references, is that there is no standard. The standard ends somewhere in the kernel. The kernel controls the device driver and the device driver (possibly supplied by the disk manufacturer) controls the disk through an API (device has small computer on board). The manufacturer may have added capacitors/battery with just enough power to flush its device buffers in case of power failure, or he may have not. The device may provide a sync function but whether this truely syncs (flushes) the device buffers is not known (device dependent). So unless you select and install a device according to your specifications (and verify those specs), you are never sure.
This is a fair problem. Even after handling error conditions, you are not safe of the data presence in your storage.
man page of fsync explains this issue clearly!! :)
For applications that require tighter guarantees about the integrity of
their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage.
Applications, such as databases, that require a strict ordering of writes
should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.
Yes, fflush() ensures the data leaves the process memory space, but it may be in dirty pages of RAM awaiting write back. This is proof against app abort, but not system crash or power failure. Even if the power is backed up, the system could crash due to some software vulnerability! As mentioned in other answers/comments, getting the data from dirty pages written to disk magnetically or whatever SSD do, not stuck in some volatile buffer in the disk controller or drive, is a combination of the right calls or open options and the right controllers and devices! Calls give you more control over the overhead, writing more in bulk at the end of a transaction.
RDBMS, for instance, need to worry not only about the database holding files but even more about the log files that allow recovery, both after disk loss and on any RDBMS restart after a crash. In fact, some may be more sync'd in the log than the database, to preserve speed, since recovery is not a frequent process and not usually a long one. Things written to the log by transactions are guaranteed to be recoverable if the log is intact.

Which operating systems, and how, can pin pages in a database buffer pool?

Most relational database construction textbooks talk about the concept of being able to pin a page, i.e. prevent the operating system from swapping it out of memory. The concept is so that the database software can use it's own buffer replacement algorithm, which might be a better fit than whatever the OS virtual memory policy provides.
It is unclear to me whether typical desktop operating systems actually provide the programmer with the capability to pin pages. The best I can find on OS X, for example, refers to wired pages, but these seem to be only usable by the superuser.
Is the concept of pinning pages, and of defining appropriate buffer replacement strategies that supersede that of the OS, only of theoretical interest and not really implemented by real relational database systems? Or is it the case that typical desktop OS'es (Linux, Windows, OS X) do include hooks for pinning, and typical relational DB software (Oracle, SQL Server, PostgreSQL, MySQL, etc) uses them?
In PostgreSQL, the database server copies the pages from the file (or from the OS, really) into a shared memory segment which PostgreSQL controls. The OS doesn't know what the mapping is between the file system blocks and the shared memory blocks, so the OS couldn't write those pages back out to their disk locations even if it wanted to, until PostgreSQL tells it to do so by issuing a seek and a write.
The OS could decide to swap parts of shared memory out to disk into a swap partition (for example, if it were under severe memory stress), but it can't write them back to their native location on disk since it doesn't know what that location is.
There are ways to tell the OS not to page out certain parts of memory, such as shmctl(shmid,SHM_LOCK,NULL). But these are mostly intended for security purposes, not performance purposes. For example, you use it to prevent very sensitive information (like the decrypted copy of a private key) from accidentally getting written to swap partitions, from which it might be recovered by the bad guys.
#jjanes is correct to say that the OS can't really write out Pg's shared memory buffer, and can't control what PostgreSQL reads into it, so it doesn't make sense to "pin" it. But that's only half the story.
PostgreSQL does not offer any feature for pinning pages from tables in its shared memory segment. It could do so, and it might arguably be useful, but nobody has implemented it. In most cases the buffer replacement algorithm does a pretty good job by its self.
Partly this is because PostgreSQL relies heavily on the operating system's buffer caches, rather than trying to implement its own. Data might be evicted from shared_buffers, but it's usually still cached in the OS. It's not unreasonable to think of shared_buffers as a first-level cache, and the OS disk cache as the second-level cache.
The features available to control what's kept in the operating system's disk cache are whatever the OS provides. In general, that's not much, because again modern OSes tend to do a better job if you leave them alone and let them manage things themselves.
The idea of manual buffer management, etc, is IMO largely a relic of times when systems had simpler and less effective algorithms for managing caches and buffers automatically.
The main time that automation falls down is if you have something that's used only intermittently, but you want to ensure is available with extremely good response times when it is used; i.e. you wish to degrade the overall system's throughput to make one part of it more responsive. PostgreSQL doesn't offer much control over that; most people simply ensure that they have something regularly querying the data of interest to keep it warm in the cache.
You could write a relatively simple extension to mmap() a file and mlock() its range, but it'd be pretty wasteful and you'd have to fiddle with the default OS limits designed to stop you from locking too much memory.
(FWIW, I think Oracle offers quite a bit of control over pinning relations, indexes, etc, in tune with its "manually control everything whether you want to or not" philosophy, and it bypasses much of the operating system in the process.)
Speaking for SQL Server (on Windows, obviously), there's an OS setting that allows the SQL engine to ignore requests from the OS in response to memory pressure. That setting is called Lock Pages in Memory (LPIM). That permissions is granted on a per-account basis and needs to be granted to the account running your SQL service when the service is started.
Keep in mind that this isn't always a good idea. For example, in a virtualized environment, the hypervisor communicates its memory needs via a balloon driver process in the guest. If the hypervisor needs more memory, it inflates the memory needs of the balloon in the guest. If your SQL process has LPIM turned on, it won't respond and the hypervisor can start flagging as a result. And if the hypervisor isn't happy, ain't nobody happy.

Are there any distributed high-availability filesystems (for Linux) that are actively-developed?

Are there any distributed, high-availability filesystems (for Linux) that are actively-developed?
Let me be more specific:
Distributed means it deals gracefully with client-to-server latencies like you'd find over the public worldwide internet (300ms and up being commonplace) and occasional connectivity flakiness. This means really good client-side caching (i.e. with callbacks) is required. NFS does not do this. It also means encryption of on-the-wire data without needing an IPSEC VPN.
High availability means that data can be stored on multiple servers and the client is smart enough to try another server if it encounters problems. Putting that intelligence in the client is really important, and it's why this sort of thing can't just be grafted onto NFS. At a minimum this needs to be possible for read-only data. It would be nice for read-write data but I know that's hard.
Filesystem means a kernel driver exporting a POSIX interface and permissions and access control are enforced in the face of untrustworthy clients. SAN systems often assume the clients are trustworthy.
I'm an OpenAFS refugee. I love it but at this point I can no longer accept its requirement that all the file servers effectively "have root" on all other file servers. The proprietary disk format and overhead of having to run Kerberos infrastructure (which I wouldn't otherwise need) are also becoming increasingly problematic.
Are there any systems other than OpenAFS with these properties? Intermezzo and Coda probably qualify but aren't active projects any longer. Lustre is cool but seems to be designed for ultra-low-latency data centres. Ceph is awesome but not really a filesystem, more of a thing that runs under a filesystem (yes, there's CephFS, but it's really a showcase for Ceph and explicitly not production-ready and there's no timetable for that). Tahoe-LAFS is cool but it and GoogleFS aren't really filesystems in that they don't export a POSIX interface through a kernel module. My understanding of GFS (Global Filesystem) is that the clients can manipulate the on-disk data structures directly, so they're implicitly root-level trusted (and this is part of why it's fast) -- correct me if I'm wrong here.
Needs to be open source since I can't afford to have my data locked up in something proprietary. I don't mind paying for software, but I can't be held hostage in this situation.
Thanks,
First of all you can use local file system (mounted with -o user_xattr) to cache NFS (mounted with -o fsc) using cachefilesd (provided by cachefilesd package on Debian) through fscache facility.
Although file system that you are looking for probably do not exist, IMHO two projects came pretty close with fairly good FUSE client implementations:
LizardFS (GPL-3 licensed, hosted at Github), fork of now proprietary MooseFS.
Gfarm file system (BSD/Apache-2.0, hosted at SourceForge)
After evaluating Ceph for quite a while I came to conclusion that it is flawed (with no hope for improvement in the foreseeable future) and not suitable for serious use. XtreemFS is a disappointment too. I hope that upcoming OrangeFS version 3 (with promised data integrity checks) might not be too bad but that's remains to be seen...

implementing high performance distributed filesystem/database

I need to implement the fastest possible way to store a key/value pair in a distributed system on Linux. Records of the database are tiny, 256 bytes on average.
I am thinking to use open(), write() and read() system calls and write the key-value pairs directly at some offset in the file. I can omit fdatasync() system call since I will be using SSD disk with battery, so I don't have to worry about ACID compliance if an unexpected shutdown of the system happens.
Linux already provides disk cache implementation, so no reads/writes will happen on sectors that were already loaded in memory. This (i think) would be the fastest way to store data, much faster than any other cache capable database engine like for example GT.M or Intersystem's Globals.
However the data is not replicated, and to achieve replication, I can mount a filesystem of another Linux server with NFS and copy the data there, so for example, if I have 2 data servers (1 local and 1 remote), I would issue 2 open(), 2 write() and 2 close() calls. If a transaction fails on remote server, I would mark it as "out of sync" and simply copy the good file again when the remote server comes back.
What do you think of this approach? Will it be fast? I can use NFS over UDP so I will avoid the TCP Stack overhead.
Advantage list so far goes like this:
Linux disk cache reused
Few lines of code
High performance
I will be coding this in C. To locate the record in the file I will keep a btree in memory with a pointer to physical location.
A few suggestions come to mind.
is it necessary to open()/write()/close() for every transaction? the system call overhead of open() in particular is probably non-trivial
could you use mmap() instead of explicit write()s?
if you're doing 2 write() calls (1 local, 1 NFS) for each transaction, it seems like any kind of network problem (latency, dropped packets, etc.) has the potential to bring your application to a screeching halt if you're waiting for the NFS write() call to succeed. And if you're not waiting, for example by doing the NFS writes from a separate thread, your complexity will rapidly grow (I don't think "Few lines of code" will remain true.)
In general, I would suggest that you really prove to yourself that the available tools don't meet your performance requirements before choosing to re-invent this particular wheel.
You might look into a real distributed filesystem rather than using NFS, which as you point out, still provides a single point of failure and no replication.
The Andrew File System (AFS) originally developed by CMU may be a solution for you. It's a commercial product, but you might check out OpenAFS which works on linux (and other systems).
Warning though: AFS has a learning curve.

Resources