How to reduce CPU usage of textdb? - database

Normally, my website based on textdb and use a low CPU usage.
But I notice when my website get visitors or traffic. A CPU usage was increased very quickly then it make my hosting account was temporary suspended from this cause.
So, I would like to find a solution to optimized it.
Thanks

In most cases it is better to have an actual .html or .php page linked together rather than have a blog based off a database. You can use PHP include and just create unique pages to optimize your site.

Related

What would be a good pipeline to create a scalable, distributed web crawler and scraper?

I would like to build a semi-general crawler and scraper for pharmacy product webpages.
I know that most of the webs are not equal, but most of the URLs I have in a list follow one specific type of logic:
For example, by using Microdata, JSON-ld, etc. I can already scrape a certain group of webpages.
By using XPath stored in configuration files I can crawl and scrape some other websites.
And other methods work good for the rest of the websites and if I can already extract the information I need from 80% of the data, I would be more than happy with the result.
In essence, I am worried about building a good pipeline to address issues related with monitoring (to handle webpages that suddenly change their structure), scalability and performance.
I have thought of the following pipeline (not taking into account storage):
Create 2 main spiders. One that crawls the websites given their domains. It gets all the URLs inside a webpage (obeying robots.txt of course) and puts it into a queue system that stores the URLs that are scrape-ready. Then, the second spider picks up the last URL in the Queue and extracts it using either metadata, XPath or any other method. Then, this is put again into another queue system that will be eventually be handled by a module that puts all the data in the queue into a database (which I still do not know if it should be SQL or NoSQL).
The advantages of this system is that by putting queues in between the main processes of extraction and storage, parallelization and scalability becomes feasible.
Is there anything flawed in my logic? What are the things that I am missing?
Thank you so much.
First off, that approach will work; my team and I have built numerous crawlers based on that structure, and they are efficient.
That said, if you're looking to scale, I would recommend a slightly different approach. For my own large-scale crawler, I have a 3-program approach.
There is one program to schedule which handles which URLs to download.
There is a program to perform the actual downloading
There is a program to extract the information from the downloaded pages and add in new links for the program that handles the scheduling.
The other major recommendation is that if you're using cURL at all, you'll want to use the cURL multi interface and a FIFO queue to handle sending the data from the scheduler to the downloader.
The advantage of this approach is that it separates out the processing from the downloading. This allows you to scale up your crawler by adding new servers and operating in parallel.
At Potent Pages, this is the architecture we use for our site spider that handles downloading hundreds of sites simultaneously. We use MySQL for the data saving (links, etc), but as you scale up, you'll need to do a lot of optimization. Plus phpmyadmin starts to break down if you have a lot of databases, but having one database per site really speeds up the parsing process so you don't have to go through millions of rows of data.

AngularJS Performance vs Page Size

My Site is ~500 KB Gzipped including js, css and images. It is built on AngularJS. A lot of people in my company are complaining about the site being slow in lower bandwidths. There are a few questions I would like to get answered,
Is 500KB Gzipped too high for lower bandwidths? People claim it takes 20 sec for it to load on their machine, which I believe is an exaggeration. Is it really due to anugularJS and its evaluation time?
How does size of the app matters in lower bandwidths? If my site is 500KB and if I reduce it to 150KB by making a custom framework, Would it really help me in lower bandwidth? If so, how much?
It's all subjective, and the definition of "low bandwidth" is rather wide. However...using https://www.download-time.com/ you can get a rough idea of how long it would take to download 500Kb on different bandwidths.
So, on any connection above 512Kbps (minimum aDSL speeds, most are now better than 5Mbps, and 3G mobile is around the same mark), it's unlikely that the file size is the problem.
If "low bandwidth" also implies "limited hardware" (RAM, CPU), it's possible the performance problem lies in unzipping and processing your application. Angular is pretty responsive, but low-end hardware may struggle.
The above root causes would justify rewriting the application using your own custom framework.
The most likely problem, however, is any assets/resources/templates your angular app requires on initialization - images, JSON files etc. This is hard to figure out without specific details - each app is different. The good news is that you shouldn't need to rewrite your application - you should be able to use your existing application and tweak it. I'm assuming the 500Kb application can't be significantly reduced in size without a rewrite, and that the speed problem is down to loading additional assets as part of start-up.
I'd use Google Chrome's Developer tools to see what's going on. The "performance" tab has an option to simulate various types of network conditions. The "network" tab allows you to see which assets are loaded, and how long they take. I'd look at which assets take time, and seeing which of those can be lazy loaded. For instance, if the application is loading a very large image file on start-up, perhaps that could be lazy-loaded, allowing the application to appear responsive to the end user more quickly.
A common way to improve perceived performance is to use lazy loading.
To decrease your load time just process your caching and find the right download tool to calculate the download speed of your file. You can use https://downloadtime.org for reference. If you have any issues let me know. Also to To decrease the page load time try to create chunks of your javascript functionalities which consist only of the functionality which is needed for e.g. the index page to decrease the load time.
As angular.js itself has a gzipped size of 57kb it seems there is much more loaded with this initial page call which is ~10 times the size of angular.js.
To decrease the page load time try to create chunks of your javascript functionalities which consist only of the functionality which is needed for e.g. the index page to decrease the load time.
For example when you're using Webpack the recommended default maximum file size is around 244kb see here

What are the ways to decrease GAE CPU% usage?

What did you do to make sure the CPU% is low?
Any sample code to look at?
I ask because every datastore read/query seems to push the CPU% beyond 100% and I get the yellow & red highlight in my dashboard. I read from else where that it's normal but surely there's something can be done about it.
Use appstats to get more detail on any long running tasks. It does a good job breaking down exactly how the CPU time is spent and lets you drill down individual calls and view the stack to narrow down which command is running long.
Urlfetch's and database calls tend to be expensive. As Sam suggests, both can be memcached for very significant savings.
You profile your code and improve its efficiency.
Datastore operations are expensive. Try reducing their usage with the help of memcache
Is your app restarting a lot?
I notice even a very minimal app will take over 1sec to load when it has been inactive for a while -- which brings up a warning marker in the log.
For pages you can cache you can use cache-control if you have a request handler.
self.response.headers["Cache-Control"] = "public,max-age=%s" % 86400
In many cases you also can use a cron job to regularly update your cache.
I've written a simple library to reduce datastore operations by using local instance and memcache as storage layers along with datastore. It also supports cached GQL results. I managed to cut my apps' CPU usage by 50% at least. You can give it a try if you're not using any sensitive data.

How to gear towards scalability for a start up e-commerce portal?

I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.
First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.
1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).
Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.
You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.

Zend_Cache_Backend_Sqlite vs Zend_Cache_Backend_File

Currently i'm using Zend_Cache_Backend_File for caching my project (especially responses from external web services). I was wandering if I could find some benefit in migrating the structure to Zend_Cache_Backend_Sqlite.
Possible advantages are:
File system is well-ordered (only 1 file in cache folder)
Removing expired entries should be quicker (my assumption, since zend wouldn't need to scan internal-metadatas for expiring date of each cache)
Possible disadvantages:
Finding record to read (with files zend check if file exists based on filename and should be a bit quicker) in term of speed.
I've tried to search a bit in internet but it seems that there are not a lot of discussion about the matter.
What do you think about it?
Thanks in advance.
I'd say, it depends on your application.
Switch shouldn't be hard. Just test both cases, and see which is the best for you. No benchmark is objective except your own.
Measuring just performance, Zend_Cache_Backend_Static is the fastest one.
One other disadvantage of Zend_Cache_Backend_File is that if you have a lot of cache files it could take your OS a long time to load a single one because it has to open and scan the entire cache directory each time. So say you have 10,000 cache files, try doing an ls shell command on the cache dir to see how long it takes to read in all the files and print the list. This same lag will translate to your app every time the cache needs to be accessed.
You can use the hashed_directory_level option to mitigate this issue a bit, but it only nests up to two directories deep, which may not be enough if you have a lot of cache files. I ran into this problem on a project, causing performance to actually degrade over time as the cache got bigger and bigger. We couldn't switch to Zend_Cache_Backend_Memcached because we needed tag functionality (not supported by Memcached). Switching to Zend_Cache_Backend_Sqlite is a good option to solve this performance degradation problem.

Resources