Should I set max limits on as many entities as possible in my webapp? - database

I am working on a CRUD webapp where users create and manager their data assets.
Having no desire to be bombarded with tons of data for the first time, I think that it would be reasonable to set limits where possible in the database.
For example, limit number of created items A = 10, B = 20, C = 50 then, if user reaches the limit, have a look at his account and figure out if I should update the rules if it doesn't break the code and performance.
Is it a good practice at all to set such limits from the performance/maintenance side, not from the business side or should I think like data entities are unlimited and try to make it well-performing with lots of data from the start?

You suggest to test your application's performance on real users, which is bad. In addition, your solution will create inconvenience for users by limiting them, when there is no reason for that (at least from user's point of view), which decreases user's satisfaction.
Instead, you should test performance before you release. It will give you understanding of your application's and infrastructure's limits of running under high load. Also, it will help you to find and eliminate bottle necks in your code. You can perform such testing with tools like JMeter and many others.
Also, if you afraid of tons of data at start moment, you can release your application as private beta: just make a simple form where users can ask for early access and receive invite. By sending invites you can easily control growth of user base and therefore loading on you app.
But you should, of course, create limitations where it is necessary, for example, limit items per page, etc.

Related

Notion API Pagination Random Database Entry

I'm trying to retrieve a random entry from a database using the Notion API. There is a page limit on how many entries you can retrieve at once, so pagination is utilized to sift through the pages 100 entries at a time. Since there is no database attribute telling you how long the database is, you have to go through the pages in order until reaching the end in order to choose a random entry. This is fine for small databases, but I have a cron job going that regularly chooses a random entry from a notion database with thousands of entries. Additionally, if I make too many calls simultaneously I risk being rate limited pretty often. Is there a better way to go about choosing a random value from a database that uses pagination? Thanks!
I don't think there is a better way to do it right now (sadly). If your entries don't change often, think about caching the pages. Saves you a lot of execution time in your cron job. For the rate limit, if you use Node.js, you can build a rate-limited queue (3 requests/second) pretty easily with something like bull

Memory usage in WPF application - how to track and manage? (BEGINNER SCOPE)

I have a small-scale WPF application using VB.net as the code behind and I want to add certain features, but i'm concerned about performance. I REALLY appreciate any responses especially if you could include beginner-friendly articles regarding this, but please help me so I can be at ease...
1) My app interacts with a third party database to display "realtime" data to the user. My proposed method is to create a background worker to query a database every 30 seconds and display the data. I query about 2,000 records all long integer type, store them in a dataset, and then use LINQ to create subsets of observable collections which the WPF controls are bound to.
Is this too intensive? how much memory am i using for 2,000 records of long int? Is the background worker querying every 30 seconds too taxing? will it crash eventually? Will it interfere with the users other daily work (excel, email, etc)?
2) If an application is constantly reading/writing from text files can that somehow be a detriment to the user if they are doing day to day work? I want the app to read/write text files, but I don't want it to somehow interfere with something else the person is doing since this app will be more of a "run in the background check it when I need to" app.
3) Is there a way to quantify how taxing a certain block of code, variable storage, or data storage will be to the end user? What is acceptable?
4) I have several list(of t) that I use as "global" lists where I can hit them from any window in my application to display data. Is there a way to quantify how much memory these lists take up? The lists range from lists of integers to lists of variables with dozens of properties. Can I somehow quantify how taxing this is on the app or the end user?
Thank you for any help and I will continue to search for articles to answer my questions
IF you really want/need to get into the details of memory usage of an application you should use a memory profiler:
http://memprofiler.com/ (commercial)
http://www.red-gate.com/products/dotnet-development/ants-memory-profiler/ (commercial)
http://www.jetbrains.com/profiler/ (commercial)
http://www.microsoft.com/download/en/details.aspx?id=16273 (free)
http://www.scitech.se/blog/ (commercial)
Your other questions are hard to answer since all relevant aspects are rather unknown:
what DB is used ?
how powerful is the machine running the DB server ?
how many users ?
etc.
On some things a performance profiler can help - for example the above mentioned memory profilers (esp. from RedGate / JetBrains etc.) usually are available in a packaged together with a performance profiler...
I will just try a few. A byte integer uses a byte of memory. An int32 uses 4 bytes. So 2000 Int32 would use 8 kb. If you have a query you need to run a lot and it takes 5-10 seconds you need to look close at that query and add any missing indexes. If this is dynamic data then with (nolock) may be OK and faster with less (no) locking. If the query is returning the same data for all users then I hope you don't have all users running the same query. You should have a two tier application where the server runs the query every x seconds and sends that answer to the multiple clients that request it. As for size of an object just add it up - a byte is a byte. You can put you app in debug and get a feel for which statements are fast and slow.

How to gear towards scalability for a start up e-commerce portal?

I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.
First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.
1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).
Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.
You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.

Performance Testing - How much data should I create

I'm very new to performance engineering, so I have a very basic question.
I'm working in a client-server system that uses SQL server backend. The application is a huge tax-related application that requires testing performance at peak load. Meaning that there should be like 10 million tax returns in the system when we run scenarios related to creating tax returns and submitting them. Then there will also be proportional number of users that need to be created.
Now I'm hearing in meetings that we need to create 10 million records to test performance and run scenarios with 5000 users and I just don't think it is feasible.
When one talks about creating a smaller dataset and extrapolating the performance planning, a very common answer I hear is that we need to 10 million records because we cannot tell from a smaller data set how the database or network will behave.
So how does one plan capacity and test performance on large enterprise application without creating peak level of data or running peak number of scenarios?
Thanks.
Personally, I would throw as much data and traffic at it as you can. Forget what traffic you "think you need to handle". And just see how much traffic you CAN handle and go from there. Knowing the limits of your system is more valuable than simply knowing it can handle 10 million records.
Maybe it does handle 10 million, but at 11 million it dies a horrible death. Or maybe it's well written and will scale to 100 million before it dies. There's a very distinct difference between the two even though both pass the "10 million test"
Now I'm hearing in meetings that we need to create 10 million records to test performance and run scenarios with 5000 users and I just don't think it is feasible.
Why do you think so?
Of course you can (and should) test with limited amounts of data, but you also really, really need to test with a realistic load, which means testing with the amount (and type) of data that you will use in production.
This is just a special case of a general rule: For system or integration testing, you need to test in a scenario that is as close as possible to production; ideally you just copy/clone a live production system, data, config and all and use that for testing. That is actually what we do (if we technically can and the client agrees). We just run a few SQL scripts to randomize personal data in the test data set, so prevent privacy concerns.
There are always issues that crop up because production data is somehow different from what you tested on, and this is the only way to prevent (or at least limit) these problems.
I've planned and implemented reporting and imports, and they invariably break or misbehave the first time they're exposed to real data, because there are always special cases or scaling problems you didn't expect. You want that breakage to happen during development, not in production :-).
In short:
Bite the bullet, and (after having done all the tests with "toy data"), get a realistic dataset to test on. If you don't have the hardware to handle that, then you don't have the right hardware for your tests :-).
I would take a look at Redgate's SQL Data Generator. It does a good job of generating representative data.
Have a peek at "The art of application performance testing / Ian Molyneaux, O’Reilly, 2009".
Your test data is ideally a realistic variety of records. But for first approximations you could have just a few unique records, and duplicate them until you have the desired size. Then use ApacheBench to roughly approximate the traffic.
To help generate data look at ruby faker and perl data faker. I have had good luck with it in generating large data sets for testing. SQL generator from redgate is good too.

displaying # views on a page without hitting database all the time

More and more sites are displaying the number of views (and clicks like on dzone.com) certain pages receive. What is the best practice for keeping track of view #'s without hitting the database every load?
I have a bunch of potential ideas on how to do this in my head but none of them seem viable.
Thanks,
first time user.
I would try the database approach first - returning the value of an autoincrement counter should be a fairly cheap operation so you might be surprised. Even keeping a table of many items on which to record the hit count should be fairly performant.
But the question was how to avoid hitting the db every call. I'd suggest loading the table into the webapp and incrementing it there, only backing it up to the db periodically or on webapp shutdown.
One cheap trick would be to simply cache the value for a few minutes.
The exact number of views doesn't matter much anyway since, on a busy site, in the time a visitor goes through the page, a whole batch of new views is already in.
One way is to use memcached as a counter. You could modify this rate limit implementation to instead act as general counter. The key could be in yyyymmddhhmm format with an expiration of 15 or 30 minutes (depending on what you consider to be concurrent visitors) and then simply get those keys when displaying the page.
Nice libraries for communicating with the memcache server are available in many languages.
You could set up a flat file that has the number of hits in it. This would have issues scaling, but it could work.
If you don't care about displaying the number of page views, you could use something like google analytics or piwik. Both make requests after the page is already loaded, so it won't impact load times. There might be a way to make a ajax request to the analytics server, but I don't know for sure. Piwik is opensource, so you can probably hack something together.
If you are using server side scripting, increment it in a variable. It's likely to get reset if you restart the services so not such a good idea if accuracy is needed.

Resources