Zend_Cache_Backend_Sqlite vs Zend_Cache_Backend_File - file

Currently i'm using Zend_Cache_Backend_File for caching my project (especially responses from external web services). I was wandering if I could find some benefit in migrating the structure to Zend_Cache_Backend_Sqlite.
Possible advantages are:
File system is well-ordered (only 1 file in cache folder)
Removing expired entries should be quicker (my assumption, since zend wouldn't need to scan internal-metadatas for expiring date of each cache)
Possible disadvantages:
Finding record to read (with files zend check if file exists based on filename and should be a bit quicker) in term of speed.
I've tried to search a bit in internet but it seems that there are not a lot of discussion about the matter.
What do you think about it?
Thanks in advance.

I'd say, it depends on your application.
Switch shouldn't be hard. Just test both cases, and see which is the best for you. No benchmark is objective except your own.
Measuring just performance, Zend_Cache_Backend_Static is the fastest one.

One other disadvantage of Zend_Cache_Backend_File is that if you have a lot of cache files it could take your OS a long time to load a single one because it has to open and scan the entire cache directory each time. So say you have 10,000 cache files, try doing an ls shell command on the cache dir to see how long it takes to read in all the files and print the list. This same lag will translate to your app every time the cache needs to be accessed.
You can use the hashed_directory_level option to mitigate this issue a bit, but it only nests up to two directories deep, which may not be enough if you have a lot of cache files. I ran into this problem on a project, causing performance to actually degrade over time as the cache got bigger and bigger. We couldn't switch to Zend_Cache_Backend_Memcached because we needed tag functionality (not supported by Memcached). Switching to Zend_Cache_Backend_Sqlite is a good option to solve this performance degradation problem.

Related

Filesystem performance issue with lots of sub-directories for attachments

all the attachments from issues are put into a #hashed directory. Each attachment creates its own hashed directory. Over time it can create a lot of sub-directories in a single directory. A project with 100K issues with an average 10 images each can lead to 1 million. This can be a performance bottleneck. How does Gitlab suggest self hosting users do for this. Thanks
Answer here https://forum.gitlab.com/t/filesystem-performance-issue-with-lots-of-sub-directories-for-attachments/76004
Basically quoting #gitlab-greg from the gitlab forum
Usually bottlenecks are caused by excessive read/write operations where input/output is the limiting factor. Attachments usually take up minimal storage space, so it’s actually much easier and more common for someone to hit max I/O by downloading/pushing a lot of Git Repo, LFS, or (Package|Container) Registry data in parallel. Alternatively, if storage space is filled up (90%+ full), you might see some performance issues.
I’ve not heard of any situations where the existence of high number of image uploads or a large amount of sub-directories in #hashed directory has caused a performance bottleneck in itself. GitLab uses a database to keep track of where issue attachments are stored, so it’s not like it’s crawling or looking through every directory in #hashed every time an upload or image is requested.
I’ve worked with GitLab admins who have (tens of) thousands of users and I’ve never seen subdirectories for attachments result in a performance bottleneck. If you find it does cause a bottleneck, please create an issue.

Which caching will be faster, data as files in folders and sub folders or data as database content?

After getting answer from #Andrew Regan, I have edited my question and its explanation.
I want to serve html data to millions. And I came to know - its done by caching.
I knew HTML as files and now I have also read that HTML pages are stored in databases to serve.
Thus my question is, which of the following caching will be quicker
- Caching of HTML files in different folders and sub folders
or
- Caching of HTML data in database.
Even if this experiment is done on only single file / table record, which method would be faster? (No doubt the result for single file or record will be out in nano seconds, yet, which caching will happen faster? e.g. either one procedure would take 0.000000001 second and another procedure would take 0.000000002 seconds.
You haven't told us anything about your application's architecture, or your expected traffic, or whether you've considered existing frameworks for solving your problem. So this will be a very high-level view.
The answer is caches.
With static content you shouldn't need to expose the end user to the performance of either the filesystem or the database. You shouldn't want to anyway.
If you're simply serving up fixed, unchanging, static content, on a single server, the most effective option is to simply read the whole lot into a cache (ideally held in RAM, not disk) on startup and proceed from there without any additional loads or fetches. (Even better, you'd use a proven cache outside of your network.)
That should be extremely fast for the end-user. As far as performance on the machine is concerned, it shouldn't make a big difference how you seed the cache, whether from the filesystem or the database, though disk is perhaps easier to work with.
You can load data lazily into the cache if you prefer, or if you have changing content. It still won't really matter if you load it from disk or from a DB provided you cache aggressively enough. Set as long a TTL (lifetime) as you can to avoid unnecessary reloads.

Logs Files will reduce the website speed in cakephp?

Sorry for this basic question
I logging($this->log()) lot of things from all controllers in my live site. My tmp/logs folder contain around 1 gb of log files, whether this will reduce the site speed/ or slows site speed? we are removing log files once in a week from live server
Yes it will affect performance
If you have large log files, before writing to the log it's necessary to scan to the end of the log file - although this is not normally noticable it can with large files degrade performance.
It's for this reasons that tools like logrotate exist; and why it's a good idea to get in the habit of deleting/archiving your log files from time to time - or regularly if they are building up quickly.
the impact of writing log is pretty negligible (although I don't know how much you are logging). Besides, try to limit logging to debugging purpose, when you are done, remove it.

How to gear towards scalability for a start up e-commerce portal?

I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.
First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.
1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).
Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.
You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.

File writes per second

I want to log visits to my website with a high visits rate to file. How much writes to log file can I perform per second?
If you can't use Analytics, why wouldn't you use your webserver's existing logging system? If you are using a real webserver, it almost certainly as a logging mechanism that is already optimized for maximum throughput.
Your question is impossible to answer in all other respects. The number of possible writes is governed by hardware, operating system and contention from other running software.
Don't do that, use Google Analytics instead. You'd end up running into many problems trying to open files, write to them, close them, so on and so forth. Problems would arise when you overwrite data that hasn't yet been committed, etc.
If you need your own local solution (within a private network, etc) you can look into an option like AWStats which operates off of crawling through your log files.
Or just analyze the Apache access log files. For example with AWStats.
File writes are not expensive until you actually flush the data to disk. Usually your operating system will cache things aggressively so you can have very good write performance if you don't try to fsync() your data manually (but of course you might lose the latest log entries if there's a crash).
Another problem however is that file I/O is not necessarily thread-safe, and writing to the same file from multiple threads or processes (which will probably happen if we're talking about a Web app) might produce the wrong results: missing or duplicate or intermingled log lines, for example.
If your hard disk drive can write 40 MB/s and your log file lines are approx. 300 bytes in length, I'd assume that you can write 140000 HTTP requests per second to your logfile if you keep it open.
Anyway, you should not do that on your own, since most web servers already write to logfiles and they know very good how to do that, how to roll the files if a maximum limit is reached and how to format the log lines according to some well-known patterns.
File access is very expensive, especially when doing writes. I would recommend saving them to RAM (using whatever cache method suits you best) and periodically writing the results to disk.
You could also use a database for this. Something like:
UPDATE stats SET hits = hits + 1
Try out a couple different solutions, benchmark the performance, and implement whichever works fast enough with minimal resource usage.
If using Apache, I'd recommend using the rotatelogs utility supplied as a part of the standard kit.
We use this to allow rotating the server logs out on a daily basis without having to stop and start the server. N.B. Use the new "||" syntax when declaring the log directive.
The site I'm involved with is one of the largest on the Internet with hit rates peaking in the millions per second for extended periods of time.
Edit: I forgot to say that the site uses standard Apache logging directives and we have not needed to customise the Apache logging code at all.
Edit: BTW Unless you really need it, don't log bytes served as this causes all sorts of issues around the midnight boundary.
Let Apache do it; do the analysis work on the back-end.

Resources