I'm currently developing a cache server which for its own nature uses a lot of RAM (I'm testing it on a server with a lot of HTTP traffic and both a WordPress and a custom web application using it to cache data in memory).
The server obviously performs a lot of malloc/realloc/free operations which are expensive, so I was wondering if I should use a custom memory allocator, maybe something which preallocates a big "memory pool" at the beginning and then using it to give free "pieces" of the requested size when a malloc/realloc is performed, and flagging them as freed when a free is called.
Am I taking the right path or don't I really need such a thing? Is there an allocator like that or do I have to code one of my own?
Important notes:
The server is single-threaded (using multiplexing), so I don't
need allocators with high performance grades in multi-threading
applications (such as jemalloc, which as far as I understood is
just as good as the normal malloc in single-threaded applications...
Correct me if I'm wrong, please).
Before you ask/suggest, I've already used Valgrind to remove every
possible memory leak. I just need to optimize, not to fix.
Memory fragmentation is a problem, so I should use an approach to optimize this too.
Using an appropriate configuration directive, the user can set the
maximum usable memory from the server, so that's why a preallocated
fixed memory pool came to my mind.
I don't have performance problems; I'm developing this just for fun and curiosity. I like to learn and experiment with new programming techniques.
And yes, I've used callgrind, and malloc is one of the most expensive operations.
Since you said you don't have a performance problem, you don't need to do anything. Set that aside.
You need a foothold to get any kind of improvement, since malloc is really fast. Offhand I recall about 100-200 cycles, on Mac OS X from a few years ago. (But free can also take more time, which should show up in the profiling statistics.) Writing a better general-purpose allocator is essentially impossible, beyond skill and luck.
Patterns specific to your application can still expose opportunities, though. I've had luck with programs that free objects in roughly the same order as their creation.
Create some memory buckets.
malloc returns blocks from the buckets in linear fashion.
free marks blocks as unused. This may utilize a bitmap, a managed smart-pointer, or the like. For pure garbage collection, no explicit call is needed.
When the last bucket is full, sweep the last recently used one(s) but only check if completely empty.
If no empty bucket, make a new one.
This is a horrible strategy if there's any fragmentation but it can improve on plain malloc by a couple orders of magnitude, because scanning along a bucket until the end just takes 1-2 cycles instead of 100-200, and for little enough fragmentation you can always make the sweeps infrequent enough.
The one-in-a-million slow sweep also disqualifies this approach from many applications.
I don't know much about cache servers, but perhaps this approach would work for HTTP connection objects which are highly transient. You need to focus not on all of malloc, but the subset of allocations which incur the biggest cost in the most predictable pattern.
Related
I'm trying to develop an application that will be running on multiple computers linked to a shared Lustre storage, performing various actions, including but not limited to:
Appending data to a file.
Reading data from a file.
Reading from and writing to a file, modifying all of its content pass a certain offset.
Reading from and writing to a file, modifying its content at a specific offset.
As you can see, the basic I/O one can wish for.
Since it's concurrent for most of that, I ought to need some kind of locking to allow safely doing the different writings, but I've seen Lustre doesn't support flock(2)s by default (and I'm not sure I want to use it over fcntl(2), I guess I will if it comes to it), and I haven't seen anything about fcntl(2) to confirm its support.
Researching it mostly resulted in me reading lot of papers about I/O optimization using Lustre, but those usually explain how the structure of their hardware / software / network works rather than explaining how it's done in the code.
So, can I use fcntl(2) with Lustre? Should I use it? If not, what are other alternatives to allow different clients to perform concurrent modifications of the data?
Or is it even possible ? (I've seen in Lustre tickets that mmap is possible, so fcntl should work too (no logic behind statement), but there might be limitations I would want to be aware of.)
I'll keep on writing a test application to check it out, but I figured I should still ask in case there are better alternatives (or if there are limitations to its functionalities that I should be aware of, since my test will be limited and we don't want unknown limitations to become an issue later in the development process).
Thanks,
Edit: The base question has been properly answered by LustreOne, here I give more specific informations about my use case to allow people to add pertinent additional informations about Lustre concurrent access.
The Lustre clients will be server to other applications.
Clients of those applications will each have their own set of files, but we want to support allowing clients to log to their client space from multiple machines at the same time and, for that purpose, we need to allow concurrent file read and write.
These, however, will always be a pretty small percentage of total I/O operations.
While really interesting insights were given in LustreOne's answer, not many of them apply to this use case (or rather, they do apply, but adding the complexity to the overall system might not be desired for the impact on performances).
That is, for the use case considered at present, I'm sure it can be of much help to some, and ourselves later on. However, what we are seeking right now is more of a way to easily allow two nodes or threads a node responding to two request to modify data to let one pass and detect the conflict, effectively preventing concerned client.
I believed file locking would be enough for that use case, but had a preference for byte locking since some of the most concerned file are getting appended non-stop by some clients, and read/modified up to the end by others.
However, judging from what I understood from LustreOne's answer:
That said, there is no strict requirement for this if your application
knows what it is doing. Lustre will already keep non-overlapping
writes consistent, and can handle concurrent O_APPEND writes as well.
The later case is already managed by Lustre out of the box.
Any opinion on what could be the best alternatives ? Will using simple flock() on complete file be enough ?
Note that some file will also have index, which can be used to determine availability of data without locking any of the data file, shall that be used or are bytes lock quick enough for us to avoid increasing codebase size to support both case?
A final mention on mmap. I'm pretty sure it doesn't fit our use case much since we got so many files and many clients, so OST might not be able to cache much, but to be sure... shall it be used, and if so, how? ^^
Sorry for being so verbose, it's one of my bad traits. :/
Have a nice day,
You should mount all clients with the "-o flock" mount option to enable globally coherent locking. Then flock() (and I think fcntl() locking) will work.
That said, there is no strict requirement for this if your application knows what it is doing. Lustre will already keep non-overlapping writes consistent, and can handle concurrent O_APPEND writes as well. However, since Lustre has to do internal locking for appends, this can hurt write performance significantly if there are a lot of different clients appending to the same file concurrently. (Note this is not a problem if only a single client is appending).
If you are writing the application yourself, then there are a lot of things you can do to make performance better:
- have some central thread assign a "write slot number" to each writer (essentially an incrementing integer), and then the client writes to offset = recordsize * slot number. Beyond assigning the slot number (which could be done in batches for better performance), there is no contention between clients. In most HPC applications the threads use the MPI rank as the slot number, since it is unique, and threads on the same node will typically be assigned adjacent slots so Lustre can further aggregate the writes. That doesn't work if you use a producer/consumer model where threads may produce variable numbers of records.
- make the IO recordsize a multiple of 4KiB in size to avoid contention between threads. Otherwise, the clients or servers will be forced to do read-modify-write for the partial records in a disk block, which is inefficient.
- Depending on whether your workflow allows it or not, rather than doing read and write into the same file, it will probably be more efficient to write a bunch of records into one file, then process the file as a whole and write into a second file. Not that Lustre can't do concurrent read and write to a single file, but this causes unnecessary contention that could be avoided.
I saw this Python question: App Engine Deferred: Tracking Down Memory Leaks
... Similarly, I've run into this dreaded error:
Exceeded soft private memory limit of 128 MB with 128 MB after servicing 384 requests total
...
After handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
According to that other question, it could be that the "instance class" is too small to run this application, but before increasing it I want to be sure.
After checking through the application I can't see anything obvious as to where a leak might be (for example, unclosed buffers, etc.) ... and so whatever it is it's got to be a very small but perhaps common mistake.
Because this is running on GAE, I can't really profile it locally very easily as far as I know as that's the runtime environment. Might anyone have a suggestion as to how to proceed and ensure that memory is being recycled properly? — I'm sort of new to Go but I've enjoyed working with it so far.
For a starting point, you might be able to try pprof.WriteHeapProfile. It'll write to any Writer, including an http.ResponseWriter, so you can write a view that checks for some auth and gives you a heap profile. An annoying thing about that is that it's really tracking allocations, not what remains allocated after GC. So in a sense it's telling you what's RAM-hungry, but doesn't target leaks specifically.
The standard expvar package can expose some JSON including memstats, which tells you about GCs and the number allocs and frees of particular sizes of allocation (example). If there's a leak you could use allocs-frees to get a sense of whether it's large allocs or small that are growing over time, but that's not very fine-grained.
Finally, there's a function to dump the current state of the heap, but I'm not sure it works in GAE and it seems to be kind of rarely used.
Note that, to keep GC work down, Go processes grow to be about twice as large as their actual live data as part of normal steady-state operation. (The exact % it grows before GC depends on runtime.GOGC, which people sometimes increase to save collector work in exchange for using more memory.) A (very old) thread suggests App Engine processes regulate GC like any other, though they could have tweaked it since 2011. Anyhow, if you're allocating slowly (good for you!) you should expect slow process growth; it's just that usage should drop back down again after each collection cycle.
A possible approach to check if your app has indeed a memory leak is to upgrade temporarily the instance class and check the memory usage pattern (in the developer console on the Instances page select the Memory Usage view for the respective module version).
If the pattern eventually levels out and the instance no longer restarts then indeed your instance class was too low. Done :)
If the usage pattern keeps growing (with a rate proportional with the app's activity) then indeed you have a memory leak. During this exercise you might be able to also narrow the search area - if you manage to correlate the graph growth areas with certain activities of the app.
Even if there is a leak, using a higher instance class should increase the time between the instance restarts, maybe even making them tolerable (comparable with the automatic shutdown of dynamically managed instances, for example). Which would allow putting the memory leak investigation on the back burner and focusing on more pressing matters, if that's of interest to you. One could look at such restarts as an instance refresh/self-cleaning "feature" :)
I'm running the same datomic-backed application on a variety of architectures with varying amounts of memory (1GB - 16GB). When I do bulk imports of data I frequently run into timeouts or out-of-memory errors.
After looking at the documentation I happened upon this helpful document (and this one) which seem to outline best practices to obtain good performance under heavy imports.
I'm not as interested in performance as I am in making imports "just work." This leads to my main question:
What is the minimum complexity configuration to ensure that an arbitrarily large import process terminates on a given machine?
I understand that this configuration may be a function of my available memory, that's fine. I also understand that it may not be maximally performant; that's also fine. But I do need know that it will terminate.
Data Distribution
I think the critical pieces of information missing from your question are the type of data and its distribution and the available metrics for your system during the bulk imports. Why?
Datomic's transaction rate is limited by the cost of background indexing jobs and the cost of that indexing is a function of the distribution of new values and the size of your database.
What this means is that, for example, if you have indexed attributes (i.e. :db/index) and, as your bulk import goes through, the distribution of those attribute values is random, you'll put a lot of pressure in the indexing jobs as it rewrites an ever increasing number of segments. As your database size grows, the indexing will dominate the work of the transactor and won't be able to catch up.
Transactor Memory
As described in the docs, the more memory you can give to object-cache-max, the better. This is especially important if your data has a lot uniqueness constraints (i.e. db/unique) since that will prevent the transactor from fetching some storage segments multiple times.
Depending on your data distribution, increasing the memory-index-threshold and memory-index-max settings may let your imports run longer... until the indexing job can't keep up. This seems to be what it's happening to you.
Recommendations
Try reducing the memory-index-threshold and memory-index-max settings. That may seem counterintuitive but you'll have much better chances to have any import complete (of course they'll take more time, but you could almost guarantee they will finish). The key is to make the transactor throttle your (peer) requests before it's not able to catch up with the indexing jobs.
I deployed an instance of Solr onto a ubuntu machine with tomcat. Then i have a single thread client program to read and inject data into Solr. I am observing memory and cpu usages, and realized that I still have a lot of resources (in terms of memory and CPUs) to use. I wonder if I should change my indexing code to multi-threading to inject into Solr? To index 20 millions of data using current single thread program, it needs about 14 hours. This is why i wonder if i should change to use multi-threading as well. Thanks in advance for your suggestions and help! :)
Multi-threading while indexing in Solr is widely used.
What you say is not very clear if you can also multi-thread the reading from your source, but I think that is the way to go.
I suggest you try it, but first try to analize your code and see which part of the code is the slowest and include that in the multi-threading.
Also keep an eye on your commit strategy.
From the Solr documentation: (http://wiki.apache.org/solr/SolrPerformanceFactors)
"In general, adding many documents per update request is faster than one per update request. ...
Reducing the frequency of automatic commits or disabling them entirely may speed indexing. Beware that this can lead to increased memory usage, which can cause performance issues of its own, such as excessive swapping or garbage collection."
Here's the deal. We would have taken the complete static html road to solve performance issues, but since the site will be partially dynamic, this won't work out for us.
What we have thought of instead is using memcache + eAccelerator to speed up PHP and take care of caching for the most used data.
Here's our two approaches that we have thought of right now:
Using memcache on >>all<< major queries and leaving it alone to do what it does best.
Usinc memcache for most commonly retrieved data, and combining with a standard harddrive-stored cache for further usage.
The major advantage of only using memcache is of course the performance, but as users increases, the memory usage gets heavy. Combining the two sounds like a more natural approach to us, even though the theoretical compromize in performance.
Memcached appears to have some replication features available as well, which may come handy when it's time to increase the nodes.
What approach should we use?
- Is it stupid to compromize and combine the two methods? Should we insted be focusing on utilizing memcache and instead focusing on upgrading the memory as the load increases with the number of users?
Thanks a lot!
Compromize and combine this two method is a very clever way, I think.
The most obvious cache management rule is latency v.s. size rule, which is used in CPU cached also. In multi level caches each next level should have more size for compensating higher latency. We have higher latency but higher cache hit ratio. So, I didn't recommend you to place disk based cache in front of memcache. Сonversely it's should be place behind memcache. The only exception is if you cache directory mounted in memory (tmpfs). In this case file based cache could compensate high load on memcache, and also could have latency profits (because of data locality).
This two storages (file based, memcache) are not only storages that are convenient for cache. You also could use almost any KV database as they are very good at concurrency control.
Cache invalidation is separate question which can engage your attention. There are several tricks you could use to provide more subtle cache update on cache misses. One of them is dog pile effect prediction. If several concurrent threads got cache miss simultaneously all of them go to backend (database). Application should allow only one of them to proceed and rest of them should wait on cache. Second is background cache update. It's nice to update cache not in web request thread but in background. In background you can control concurrency level and update timeouts more gracefully.
Actually there is one cool method which allows you to do tag based cache tracking (memcached-tag for example). It's very simple under the hood. With every cache entry you save a vector of tags versions which it is belongs to (for example: {directory#5: 1, user#8: 2}). When you reading cache line you also read all actual vector numbers from memcached (this could be effectively performed with multiget). If at least one actual tag version is greater than tag version saved in cache line then cache is invalidated. And when you change objects (for example directory) appropriate tag version should be incremented. It's very simple and powerful method, but have it's own disadvantages, though. In this scheme you couldn't perform efficient cache invalidation. Memcached could easily drop out live entries and keep old entries.
And of course you should remember: "There are only two hard things in Computer Science: cache invalidation and naming things" - Phil Karlton.
Memcached is quite a scalable system. For instance, you can replicate cache to decrease access time for certain key buckets or implement Ketama algorithm that enables you to add/remove Memcached instances from pool without remap of all keys. In this way, you can easily add new machines dedicated to Memcached when you happen to have extra memory. Furthermore, as its instance can be run with different sizes, you can throw up one instance by adding more RAM to an old machine. Generally, this approach is more economic and to some extent does not inferior to the first one, especially for multiget() requests. Regarding a performance drop with data growth, the runtime of the algorithms used in Memcached does not vary with the size of the data, and therefore the access time depend only on number of simultaneous requests. Finally, if you want to tune your memory/performance priorities you can set expire time and available memory configuration values which will strict RAM usage or increase cache hits.
At the same time, when you use a hard-disk the file system can become a bottleneck of your application. Besides general I/O latency, such things as fragmentation and huge directories can noticeably affect your overall request speed. Also, beware that default Linux hard disk settings are tuned more for compatibility than for speed, so it is advisable to configure it properly before usage (for instance, you can try hdparm utility).
Thus, before adding one more integrating point, I think you should tune the existent system. Usually, properly designed database, configured PHP, Memcached and handling of static data should be enough even for a high-load web site.
I would suggest that you first use memcache for all major queries. Then, test to find queries that are least used or data that is rarely changed and then provide a cache for this.
If you can isolate common data from rarely used data, then you can focus on improving performance on the more commonly used data.
Memcached is something that you use when you're sure you need to. You don't worry about it being heavy on memory, because when you evaluate it, you include the cost of the dedicated boxes that you're going to deploy it on.
In most cases putting memcached on a shared machine is a waste of time, as its memory would be better used caching whatever else it does instead.
The benefit of memcached is that you can use it as a shared cache between many machines, which increases the hit rate. Moreover, you can have the cache size and performance higher than a single box can give, as you can (and normally would) deploy several boxes (per geographical location).
Also the way memcached is normally used is dependent on a low latency link from your app servers; so you wouldn't normally use the same memcached cluster in different geographical locations within your infrastructure (each DC would have its own cluster)
The process is:
Identify performance problems
Decide how much performance improvement is enough
Reproduce problems in your test lab, on production-grade hardware with necessary driver machines - this is nontrivial and you may need a lot of dedicated (even specialised) hardware to drive your app hard enough.
Test a proposed solution
If it works, release it to production, if not, try more options and start again.
You should not
Cache "everything"
Do things without measuring their actual impact.
As your performance test environment will never be perfect, you should have sufficient instrumentation / monitoring that you can measure performance and profile your app IN PRODUCTION.
This also means that every single thing that you cache should have a cache hit/miss counter on it. You can use this to determine when the cache is being wasted. If a cache has a low hit rate (< 90%, say), then it is probably not worthwhile.
It may also be worth having the individual caches switchable in production.
Remember: OPTIMISATIONS INTRODUCE FUNCTIONAL BUGS. Do as few optimisations as possible, and be sure that they are necessary AND effective.
You can delegate the combination of disk/memory cache to the OS (if your OS is smart enough).
For Solaris, you can actually even add SSD layer in the middle; this technology is called L2ARC.
I'd recommend you to read this for a start: http://blogs.oracle.com/brendan/entry/test.