Lustre file locking for concurrent access - c

I'm trying to develop an application that will be running on multiple computers linked to a shared Lustre storage, performing various actions, including but not limited to:
Appending data to a file.
Reading data from a file.
Reading from and writing to a file, modifying all of its content pass a certain offset.
Reading from and writing to a file, modifying its content at a specific offset.
As you can see, the basic I/O one can wish for.
Since it's concurrent for most of that, I ought to need some kind of locking to allow safely doing the different writings, but I've seen Lustre doesn't support flock(2)s by default (and I'm not sure I want to use it over fcntl(2), I guess I will if it comes to it), and I haven't seen anything about fcntl(2) to confirm its support.
Researching it mostly resulted in me reading lot of papers about I/O optimization using Lustre, but those usually explain how the structure of their hardware / software / network works rather than explaining how it's done in the code.
So, can I use fcntl(2) with Lustre? Should I use it? If not, what are other alternatives to allow different clients to perform concurrent modifications of the data?
Or is it even possible ? (I've seen in Lustre tickets that mmap is possible, so fcntl should work too (no logic behind statement), but there might be limitations I would want to be aware of.)
I'll keep on writing a test application to check it out, but I figured I should still ask in case there are better alternatives (or if there are limitations to its functionalities that I should be aware of, since my test will be limited and we don't want unknown limitations to become an issue later in the development process).
Thanks,
Edit: The base question has been properly answered by LustreOne, here I give more specific informations about my use case to allow people to add pertinent additional informations about Lustre concurrent access.
The Lustre clients will be server to other applications.
Clients of those applications will each have their own set of files, but we want to support allowing clients to log to their client space from multiple machines at the same time and, for that purpose, we need to allow concurrent file read and write.
These, however, will always be a pretty small percentage of total I/O operations.
While really interesting insights were given in LustreOne's answer, not many of them apply to this use case (or rather, they do apply, but adding the complexity to the overall system might not be desired for the impact on performances).
That is, for the use case considered at present, I'm sure it can be of much help to some, and ourselves later on. However, what we are seeking right now is more of a way to easily allow two nodes or threads a node responding to two request to modify data to let one pass and detect the conflict, effectively preventing concerned client.
I believed file locking would be enough for that use case, but had a preference for byte locking since some of the most concerned file are getting appended non-stop by some clients, and read/modified up to the end by others.
However, judging from what I understood from LustreOne's answer:
That said, there is no strict requirement for this if your application
knows what it is doing. Lustre will already keep non-overlapping
writes consistent, and can handle concurrent O_APPEND writes as well.
The later case is already managed by Lustre out of the box.
Any opinion on what could be the best alternatives ? Will using simple flock() on complete file be enough ?
Note that some file will also have index, which can be used to determine availability of data without locking any of the data file, shall that be used or are bytes lock quick enough for us to avoid increasing codebase size to support both case?
A final mention on mmap. I'm pretty sure it doesn't fit our use case much since we got so many files and many clients, so OST might not be able to cache much, but to be sure... shall it be used, and if so, how? ^^
Sorry for being so verbose, it's one of my bad traits. :/
Have a nice day,

You should mount all clients with the "-o flock" mount option to enable globally coherent locking. Then flock() (and I think fcntl() locking) will work.
That said, there is no strict requirement for this if your application knows what it is doing. Lustre will already keep non-overlapping writes consistent, and can handle concurrent O_APPEND writes as well. However, since Lustre has to do internal locking for appends, this can hurt write performance significantly if there are a lot of different clients appending to the same file concurrently. (Note this is not a problem if only a single client is appending).
If you are writing the application yourself, then there are a lot of things you can do to make performance better:
- have some central thread assign a "write slot number" to each writer (essentially an incrementing integer), and then the client writes to offset = recordsize * slot number. Beyond assigning the slot number (which could be done in batches for better performance), there is no contention between clients. In most HPC applications the threads use the MPI rank as the slot number, since it is unique, and threads on the same node will typically be assigned adjacent slots so Lustre can further aggregate the writes. That doesn't work if you use a producer/consumer model where threads may produce variable numbers of records.
- make the IO recordsize a multiple of 4KiB in size to avoid contention between threads. Otherwise, the clients or servers will be forced to do read-modify-write for the partial records in a disk block, which is inefficient.
- Depending on whether your workflow allows it or not, rather than doing read and write into the same file, it will probably be more efficient to write a bunch of records into one file, then process the file as a whole and write into a second file. Not that Lustre can't do concurrent read and write to a single file, but this causes unnecessary contention that could be avoided.

Related

Usage PPM on top of Aerys

There is "Don’t use any blocking I/O functions in Aerys."
warning at https://amphp.org/aerys/io#blocking-io. Should I use PPM instead of Aerys if I need usage of PDO (e.g., Prooph components) and want to reuse initialized application instance for handling different requests?
I'm not bound to any existent PPM adapter (e.g., Symfony). Is there a way to reuse Aerys code (e.g., Router) for request-response logic when using PPM on top of Aerys (https://github.com/php-pm/php-pm/pull/267)?
You can just increase the worker count using the -w switch for the command line script to be higher if you want to use blocking functions. It's definitely not optimal, but with enough workers the blocking shouldn't be too noticeable, except for an increasing latency, which might occur.
Another possibility is to move the blocking calls into one or multiple worker threads with amphp/parallel.
As long as the responses are relatively fast everything should be fine. The issue begins if there's a lot of load and things get slower and might time out, because these are very long blocks then.
PHP-PM doesn't offer too much benefit over using Aerys directly. It redirects requests to a currently free worker, but with high enough load the kernel load balancing will probably good enough and not all requests that take longer will be routed to one worker. In fact, using Aerys will probably be better, because it's production ready and has multiple independent workers instead of one master that might be a bottleneck. PHP-PM could solve that in a better way, but it's currently not implemented. Additionally, Aerys supports keep-alive connections, which PHP-PM does currently not support.

Write once read many in memory key value store

I have a particular use case for multiple in memory key value maps that need very fast lookup time. They are set just set once a day so can be considered immutable for all practical purposes. Redis is not an option since it gets CPU throttled in case of multiple threads accessing it. Multi instance redis takes up too much memory because of data replication. The important thing to consider here is that the read rate is very high in bursts. Around 10 million requests in bursts from around 40-50 workers simultaneously.
I was thinking of creating a simple client server architecture with multiple readers connecting to a server to read from shared memory maps. However I wonder if such an architecture already exists and has been tested profusely for this use case in which case I should not be reinventing the wheel.
So to sum up what is my best alternative? TIA.
Might not be suitable for you but you could try RBLDNSD and store your values in DNS. It's high performance and results will be cached, and it's easy to read the values from pretty much any programming environment. To write values to it you'll need to write directly to its zone files, but the format is simple and easy to write.
You don't mention the size of your maps, but given that performance is so critical, it sounds like you may want to consider keeping copies of your 'multiple in memory key value maps' with each worker.
You could then implement a simple mechanism to notify each worker that it's time to refresh their maps (e.g. Redis PUBLISH, or any other pubsub type framework).
At the risk of running afoul of the stackoverlow self-promotion police :-) eXtremeDB might be a consideration. It's not schema-less, but your schema can simply define a key-value pair. It supports MVCC (optimistic, non-blocking) concurrency so even the relatively infrequent writes won't get in the way of readers, and you'll be able to utilize all the CPU cores.

Implementing multithreaded application under C

I am implementing a small database like MySQL.. Its a part of a larger project..
Right now i have designed the core database, by which i mean i have implemented a parser and i can now execute some basic sql queries on my database.. it can store, update, delete and retrieve data from files.. As of now its fine.. however i want to implement this on network..
I want more than one user to be able to access my database server and execute queries on it at the same time... I am working under Linux so there is no issue of portability right now..
I know i need to use Sockets which is fine.. I also know that i need to use a concept like Thread Pool where i will be required to create a maximum number of threads initially and then for each client request wake up a thread and assign it to the client..
As for now what i am unable to figure out is how all this is actually going to be bundled together.. Where should i implement multithreading.. on client side / server side.? how is my parser going to be configured to take input from each of the clients separately?(mostly via files i think?)
If anyone has idea about how i can implement this pls do tell me bcos i am stuck here in this project...
Thanks.. :)
If you haven't already, take a look at Beej's Guide to Network Programming to get your hands dirty in some socket programming.
Next I would take his example of a stream client and server and just use that as a single threaded query system. Once you have got this down, you'll need to choose if you're going to actually use threads or use select(). My gut says your on disk database doesn't yet support parallel writes (maybe reads), so likely a single server thread servicing requests is your best bet for starters!
In the multiple client model, you could use a simple per-socket hashtable of client information and return any results immediately when you process their query. Once you get into threading with the networking and db queries, it can get pretty complicated. So work up from the single client, add polling for multiple clients, and then start reading up on and tackling threaded (probably with pthreads) client-server models.
Server side, as it is the only person who can understand the information. You need to design locks or come up with your own model to make sure that the modification/editing doesn't affect those getting served.
As an alternative to multithreading, you might consider event-based single threaded approach (e.g. using poll or epoll). An example of a very fast (non-SQL) database which uses exactly this approach is redis.
This design has two obvious disadvantages: you only ever use a single CPU core, and a lengthy query will block other clients for a noticeable time. However, if queries are reasonably fast, nobody will notice.
On the other hand, the single thread design has the advantage of automatically serializing requests. There are no ambiguities, no locking needs. No write can come in between a read (or another write), it just can't happen.
If you don't have something like a robust, working MVCC built into your database (or are at least working on it), knowing that you need not worry can be a huge advantage. Concurrent reads are not so much an issue, but concurrent reads and writes are.
Alternatively, you might consider doing the input/output and syntax checking in one thread, and running the actual queries in another (query passed via a queue). That, too, will remove the synchronisation woes, and it will at least offer some latency hiding and some multi-core.

File writes per second

I want to log visits to my website with a high visits rate to file. How much writes to log file can I perform per second?
If you can't use Analytics, why wouldn't you use your webserver's existing logging system? If you are using a real webserver, it almost certainly as a logging mechanism that is already optimized for maximum throughput.
Your question is impossible to answer in all other respects. The number of possible writes is governed by hardware, operating system and contention from other running software.
Don't do that, use Google Analytics instead. You'd end up running into many problems trying to open files, write to them, close them, so on and so forth. Problems would arise when you overwrite data that hasn't yet been committed, etc.
If you need your own local solution (within a private network, etc) you can look into an option like AWStats which operates off of crawling through your log files.
Or just analyze the Apache access log files. For example with AWStats.
File writes are not expensive until you actually flush the data to disk. Usually your operating system will cache things aggressively so you can have very good write performance if you don't try to fsync() your data manually (but of course you might lose the latest log entries if there's a crash).
Another problem however is that file I/O is not necessarily thread-safe, and writing to the same file from multiple threads or processes (which will probably happen if we're talking about a Web app) might produce the wrong results: missing or duplicate or intermingled log lines, for example.
If your hard disk drive can write 40 MB/s and your log file lines are approx. 300 bytes in length, I'd assume that you can write 140000 HTTP requests per second to your logfile if you keep it open.
Anyway, you should not do that on your own, since most web servers already write to logfiles and they know very good how to do that, how to roll the files if a maximum limit is reached and how to format the log lines according to some well-known patterns.
File access is very expensive, especially when doing writes. I would recommend saving them to RAM (using whatever cache method suits you best) and periodically writing the results to disk.
You could also use a database for this. Something like:
UPDATE stats SET hits = hits + 1
Try out a couple different solutions, benchmark the performance, and implement whichever works fast enough with minimal resource usage.
If using Apache, I'd recommend using the rotatelogs utility supplied as a part of the standard kit.
We use this to allow rotating the server logs out on a daily basis without having to stop and start the server. N.B. Use the new "||" syntax when declaring the log directive.
The site I'm involved with is one of the largest on the Internet with hit rates peaking in the millions per second for extended periods of time.
Edit: I forgot to say that the site uses standard Apache logging directives and we have not needed to customise the Apache logging code at all.
Edit: BTW Unless you really need it, don't log bytes served as this causes all sorts of issues around the midnight boundary.
Let Apache do it; do the analysis work on the back-end.

What's more costly on every page view - Database Writes or File Writes?

What is the most efficient solution when you need to record some data on every page view in your application - should you write to a file or write to the database?
Or maybe neither - perhaps you should cache the data in memory or a file and only write it to the database (or file system if you use a memory cache) occasionally?
If it's purely recording a small amount of data with no subsequent lookups, straight file I/O is almost guaranteed to be more efficient. You're losing all the advantages of a DBMS though -- indexing, transactional integrity (really, ACID in general), concurrent access, etc..
It almost sounds like you're talking about what amounts to simple logging. If that's the case, and you don't need to do frequent complex queries on the resulting data, you're probably better off with straight file I/O if performance is a serious issue. Be careful of concurrent-write issues, though.
If the properties of an RDBMS are desirable, you might think about using SQLite, which for simplistic loads will get you better performance than most RDBMSs with less overhead, at the cost of some of the benefits (highly concurrent access and availability over the network to other machines are a couple of the "biggies"). It still wouldn't be as fast as straight file I/O in the general case, though.
Your later mention of it being for page view tracking causes me to ask: Are you incrementing a counter, rather than logging data about the page view? If so, I'd strongly suggest going with something like SQLite (doing something like UPDATE tbl SET counter = counter+1). You really don't want to get into the timing issues involved in doing this by hand -- if you don't do it right, you'll start losing counts on simultaneous access (A reads "100", B reads "100", A writes "101", B writes "101"; B should have written 102, but has no way of knowing that).
Conceptually, writing to the database is always slower than writing to a file.
The database has to write to a file too, with the extra overhead of communication to get the data to the database, so it can write it to a file. Therefore, it must be slower.
That said, databases do disk I/O very well, probably better than you will. Don't be surprised if you find out that a simple file logger is slower than writing it to a database. The database has a lot of I/O optimizations, and has some tricks available that you may not (depending on your web lanaguage and environment).
Don't be surprised if the answer changes over time. When your site is small, logging to a database is very fast. As your site grows, the logging table can become a major pain: It uses a lot of disk space, makes the backups take forever, and consumes all the disk I/O when you try to query it. This is why you should benchmark both methods yourself. Then you can re-test in the future, when conditions change.
Hitting the database is most likely going to be more expensive than writing to a file.
If your pageviews per second are high, and if the data doesn't need to be available in the database right away, then writing to a file and periodically loading the data into the DB will be a more optimal solution.
However it all depends on the nature of the data you're recording per page view and how critical it is to whatever business function it serves.
That highly depends on your needs for data safety. If you can afford to lose some data in case of a crash then keeping the data in memory and writing it periodically to a persistent store is certainly the most efficient way to go.
Edit: You mentioned pageviews. In that case I would keep the counters in memory and periodically update a database table (like every minute or so).
That depends.
Ands it really does: it depends on the DBMS and/or the OS+filesystem you use. In other words: your mileage varies.
If you just append data somewhere modern DBMS/OS+filesystems should handle this equally well and fast. Problems arise when you want to change data.
Caching - depends too on what kind of caching granularity you can afford (need to have every stepped logged crash-safe versus potential saving).
Use a hybrid solution like redis its designed for this sort of stuff

Resources