We have a poorly designed shopping cart database. All processed objects that will be used to the front site are stored in HttpContext.Current.Cache on Application_Start. Processed objects I mean results from sql script that has many joins and where conditions.
Looking for best solution to remove caching or improve the current caching process. I'm thinking of storing the processed objects to a SQL Server table that will be repopulated every midnight. And use Dapper ORM to retrieve data from this SQL Server table and implement output caching.
Hope someone will share a high speed and maintainable solution for this problem. :)
Thanks!
What you are describing is really : duplicating the data into a second (technically redundant) model, more suitable for query. If that is the case, then sure : have fun with that - that isn't exactly uncommon. However, before doing all that, you might want to try indexed views - it could be that this solves most everything without you having to write all the maintenance code.
I would suggest, however, not to "remove caching" - but simply "make the cache expire at some point"; there's an important difference. Hitting the database for the same data on every single request is not a great idea.
Related
I'm trying to design a database schema for Djabgo rest framework web application.
At some point, I have two choces:
1- Choose a schema in which in one or several apies, I have to get a queryset from database and iterate and order it with python. (For example, I can store some datas in an array-data-typed column, get them from database and sort them with python.)
2- store the data in another table and insert a kind of big number of rows with each insert. This way, I can get the data in my favorite format in much less lines with orm codes.
I tried some basic tests and benchmarking to see which way is faster, and letting database handle more of the job (second way) didn't let me down. But I don't have the means of setting a more real situatuin and here's the question:
Is it still a good idea to let database handle the job when it also has to handle hundreds of requests from other apies and clients each second?
Is database (and orm) usually faster and more reliable than backend?
As a general rule, you want to let the database do work when the work is appropriate for the database. Sorting result sets would be in that category.
Keep in mind:
The database is running on a server, often on a distributed system and so it has access to more resources.
Databases are designed to handle large data, so they are not limited by the memory in a single thread.
When this question comes up, often more data needs to be passed back to the application than is strictly needed. Consider a problem such as getting the top 10 of something.
Mixing processing in the application and the database often requires multiple queries and passing data back and forth, which is expensive.
(And there are no doubt other considerations.)
There are some situations where it might be more efficient or convenient to do work in the application. A common example is formatting result sets for the application -- say turning 1234.56 into $1,234.56. Other examples would be when the application language has capabilities that are not directly in SQL or are hard to implement in SQL.
I tried but couldn't find a similar post, I apologize if I have missed a post and made a duplicate here.
I need to find the best mechanism to save data for my following requirement and thought to get your opinion.
The main requirement
We receive a lot of data from a collection of electronic sensors. The amount of data is about 50,000 records per second and each record contains a floating point value and a date/time stamp.
Also, we need to keep this data for at least 5 years and process them to make predictions.
Currently we are using MS Sql server but we are very keen to explore into new areas like NO SQL.
We can be are flexible on these
we wouldn't need a great deal of consistency as the structure of data is very simple
we can manage atomicity from code when saving (if required)
We would need the DB end to be reliable on these
Fast retrieval - so that it won't add much time to what's already required by heavy prediction algorithms
Reliability when saving - our middle tier will have to throw a lot of data at a high speed and hope the db could save all.
Data need to be safe (durability)
I have been reading on this and I am beginning to wonder if we could use both MS SQL and NO SQL in conjunction. What I am thinking of is continue using MS SQL for regular use of data and use a NO SQL solution for long term storage/processing.
As you may have realized by now I am very new to No SQL.
What do you think is the best way to store this much data while retaining the performance and accuracy?
I would be very grateful if you could shed some light on this so we can provide an efficient solution to this problem.
We are also thinking about eliminating almost identical records that arrive close to each other (e.g. 45.9344563V, 45.9344565V, 45.9344562V arrived within 3 microseconds - We will ignore first 2 and take the third). Have any of you solved similar problem before, any algorithms you used?
I am not trying to get a complete solution here. Just trying to start a dialog with other professionals out there... please give your opinion.
Many thanks for your time, your opinion is greatly appreciated!
NoSQL is pretty cool and will handle one of your requirements well (quick storage and non-relational retrieval). However, the problem with NoSQL ends up becoming what to do when you start trying to use the data relationally, where it won't really perform quite as well as an RDBMS.
When storing large quantities of data in an RDBMS, there are several strategies you can use to handle large quantities of data. The most obvious one coming to mind is using Partitions. You can read more about that for SQL Server here: https://msdn.microsoft.com/en-us/library/ms190787.aspx
You might also want to consider creating a job to periodically move historical data that isn't accessed as often to a separate disk. This may enable you to use a new feature in SQL Server 2014 called in memory OLTP for the more heavily used recent data (assuming it's under 250gb): https://msdn.microsoft.com/en-us/library/dn133186.aspx
A little context first:
I would say I have good SQL server experience, but am a developer, not a DBA. My current task is to improve on the performance of a database, which has been built with a very heavy reliance on views. The application is peppered with inline sql, and using Tuning Advisor only suggests a couple of missing indexes, which look reasonable.
I feel that reworking the database design so that the data created by these particular views (lots of CASE WHENS) is persisted as hard data is too much work given the budget and time scales here. A full rewrite of these enormous views and all of the code that relies on them also appears to be out of the question.
My proposed solution:
Testing has shown that if I do a SELECT INTO with the view data to persist it in permenant table, and then replace references to the view with this table, query times go down to 44% of what they were when using the view.
Since the data is refreshed by a spider process overnight, I think I can just drop and recreate this table on a daily basis, and then make minor modifications to the queries to use this view.
Can someone with good DBA experience give me an opinion on whether that is a good / *&?!! awful idea. If the latter, is there a better way to approach this?
Thanks.
You say: "...reworking the database design so that the data created by these particular views ... is persisted as hard data is too much work given the budget and time scales here." and yet this is exactly what you are proposing to do. It's just that you use the views themselves instead of making the code a function or a stored procedure.
Anyway, my opinion is that if you invest a bit of effort in making this robust (i.e. you ensure that if the spider runs, the persisted data always get refreshed, and you never run the select-into before the end of the spidering) your solution is ok.
What would concern me is that this is a hack - however brilliant - so whoever inherits your solution may find it difficult to understand the why and how. See if you can provide a good explanation either by comments or a seperate document.
I am trying to layout the tables for use in new public-facing website. Seeing how there will lots more reading than writing data (guessing >85% reading) I would like to optimize the database for reading.
Whenever we list members we are planning on showing summary information about the members. Something akin to the reputation points and badges that stackoverflow uses. Instead of doing a subquery to find the information each time we do a search, I wanted to have a "calculated" field in the member table.
Whenever an action is initiated that would affect this field, say the member gets more points, we simply update this field by running a query to calculate the new values.
Obviously, there would be the need to keep this field up to date, but even if the field gets out of sync, we can always rerun the query to update this field.
My question: Is this an appropriate approach to optimizing the database? Or are the subqueries fast enough where the performance would not suffer.
There are two parts:
Caching
Tuned Query
Indexed Views (AKA Materialized views)
Tuned table
The best solution requires querying the database as little as possible, which would require caching. But you still need a query to fill that cache, and the cache needs to be refreshed when it is stale...
Indexed views are the next consideration. Because they are indexed, querying against is faster than an ordinary view (which is equivalent to a subquery). Nonclustered indexes can be applied to indexed views as well. The problem is that indexed views (materialized views in general) are very constrained to what they support - they can't have non-deterministic functions (IE: GETDATE()), extremely limited aggregate support, etc.
If what you need can't be handled by an indexed view, a table where the data is dumped & refreshed via a SQL Server Job is the next alternative. Like the indexed view, indexes would be applied to make fetching data faster. But data change means cleaning up the indexes to ensure the query is running as best it can, and this maintenance can take time.
The least expensive database query is the one that you don't have to run against the database at all.
In the scenario you describe, using a high-performance cache technology (example: memcached) to store query results in your application can be a lot better strategy than trying to trick out the database to be highly scalable.
The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do it yet.
Michael A. Jackson
If you are just designing the tables, I'd say, it's definitely premature to optimize.
You might want to redesign your database a few days later, you might find out that things work pretty fast without any clever hacks, you might find out they work slow, but in a different way than you expected. In either case you would waste your time, if you start optimizing now.
The approach you describe is generally fine; you could get some pre-computed values, either using triggers/SPs to preserve data consistency, or running a job to update these values time-to-time.
All databases are more than 85% read only! Usually high nineties too.
Tune it when you need to and not before.
I ever developed several projects based on python framework Django. And it greatly improved my production. But when the project was released and there are more and more visitors the db becomes the bottleneck of the performance.
I try to address the issue, and find that it's ORM(django) to make it become so slow. Why? Because Django have to serve a uniform interface for the programmer no matter what db backend you are using. So it definitely sacrifice some db's performance(make one raw sql to several sqls and never use the db-specific operation).
I'm wondering the ORM is definitely useful and it can:
Offer a uniform OO interface for the progarammers
Make the db backend migration much easier (from mysql to sql server or others)
Improve the robust of the code(using ORM means less code, and less code means less error)
But if I don't have the requirement of migration, What's the meaning of the ORM to me?
ps. Recently my friend told me that what he is doing now is just rewriting the ORM code to the raw sql to get a better performance. what a pity!
So what's the real meaning of ORM except what I mentioned above?
(Please correct me if I made a mistake. Thanks.)
You have mostly answered your own question when you listed the benefits of an ORM. There are definitely some optimisation issues that you will encounter but the abstraction of the database interface probably over-rides these downsides.
You mention that the ORM sometimes uses many sql statements where it could use only one. You may want to look at "eager loading", if this is supported by your ORM. This tells the ORM to fetch the data from related models at the same time as it fetches data from another model. This should result in more performant sql.
I would suggest that you stick with your ORM and optimise the parts that need it, but, explore any methods within the ORM that allow you to increase performance before reverting to writing SQL to do the access.
A good ORM allows you to tune the data access if you discover that certain queries are a bottleneck.
But the fact that you might need to do this does not in any way remove the value of the ORM approach, because it rapidly gets you to the point where you can discover where the bottlenecks are. It is rarely the case that every line of code needs the same amount of careful hand-optimisation. Most of it won't. Only a few hotspots require attention.
If you write all the SQL by hand, you are "micro optimising" across the whole product, including the parts that don't need it. So you're mostly wasting effort.
here is the definition from Wikipedia
Object-relational mapping is a programming technique for converting data between incompatible type systems in relational databases and object-oriented programming languages. This creates, in effect, a "virtual object database" that can be used from within the programming language.
a good ORM (like Django's) makes it much faster to develop and evolve your application; it lets you assume you have available all related data without having to factor every use in your hand-tuned queries.
but a simple one (like Django's) doesn't relieve you from good old DB design. if you're seeing DB bottleneck with less than several hundred simultaneous users, you have serious problems. Either your DB isn't well tuned (typically you're missing some indexes), or it doesn't appropriately represents the data design (if you need many different queries for every page this is your problem).
So, i wouldn't ditch the ORM unless you're twitter or flickr. First do all the usual DB analysis: You see a lot of full-table scans? add appropriate indexes. Lots of queries per page? rethink your tables. Every user needs lots of statistics? precalculate them in a batch job and serve from there.
ORM separates you from having to write that pesky SQL.
It's also helpful for when you (never) port your software to another database engine.
On the downside: you lose performance, which you fix by writing a custom flavor of SQL - that it tried to insulate from having to write in the first place.
ORM generates sql queries for you and then return as object to you. that's why it slower than if you access to database directly. But i think it slow a little bit ... i recommend you to tune your database. may be you need to check about index of table etc.
Oracle for example, need to be tuned if you need to get faster ( i don't know why, but my db admin did that and it works faster with queries that involved with lots of data).
I have recommendation, if you need to do complex query (eg: reports) other than (Create Update Delete/CRUD) and if your application won't use another database, you should use direct sql (I think Django has it feature)