Why use NoSQL over Materialized Views? - database

There has been a lot of talk recently about NoSQL.
The #1 reason why I hear people use NoSQL is because they start to de-normalize their DBMS data so much so, to increase performance, that they end up with just one table with all of their data within that single table.
With Materialized Views however, you can keep your data normalized, yet have it stored as a single table view for the same reasons why you'd use NoSQL.
As such, why would someone use NoSQL over Materialized Views?

One reason is that materialized views will perform poorly in an OLTP situation where there is a heavy amount of INSERTs vs. SELECTs.
Everytime data is inserted the materialized views indexes must be updated, which not only slows down inserts but selects as well. The primary reason for using NoSQL is performance. By being basically a hash-key store, you get insanely fast reads/writes, at the cost of less control over constraints, which typically must be done at the application layer.
So, while materialized views may help reads, they do nothing to speed up writes.

NoSQL is not about getting better performance out of your SQL database. It is about considering options other than the default SQL storage when there is no particular reason for the data to be in SQL at all.
If you have an established SQL Database with a well designed schema and your only new requirement is improved performance, adding indexes and views is definitely the right approach.
If you need to save a user profile object that you know will only ever need to be accessed by its key, SQL may not be the best option - you gain nothing from a system with all sorts of query functionality you won't use, but being able to leave out the ORM layer while improving the performance of the queries you will be using is quite valuable.

Another reason is the dynamic nature of NoSQL. Each view you create will need created before-hand and a "guess" as to how an application might use it.
With NoSQL you can change as the data changes; dynamically varying your data to suit the application.

Related

Multiple linq queries or just build a SQL view?

I have a mvc view that requires data from 5 different DB tables. I currently have a big LINQ query that joins all the tables and returns the results, works fine. However, I am wondering if it would better to build a DB view to make the LINQ query simple.
Querying 5 tables via a single query isn't necessarily a problem. It depends on a ton of external factors like how performant is your database setup and the characteristics of the tables themselves: are they huge with millions of rows or only a few hundred?
Assuming it is a problem, causing excessive load on your database or long page load times, then yes, you might want to look into an alternative solution, but a view is almost certainly not the right choice.
Views have a very key negative in that they cannot have keys nor indexes. That means unless you plan to just return everything in the view, it will almost always be slower to query into a view than even doing joins across tables. Frankly, I've pretty much never found a good use for a database view in a web application context. Maybe they work in other environments, such as reporting, but other than that, they're useless. If you need an alternative to Entity Framework, use a stored procedure.
Since your objective is performance, keep with the 5 joins. You could enable SQL Profiler and track the query that is being generated by EF. Probably, if you write the query manually and then send to EF execute it, you'll get a better performance too.

Architecting a high performing "inserting solution"

I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.
It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.
It is important for the solution to be able to handle the heavy database inserting load.
After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.
We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached
I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.
Your thoughts and comments are most appreciated. Thanks.
Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?
It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.
Some initial theory
To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).
I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.
1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on.
And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.
Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).
2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.
Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).
3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.
Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.
In practice
Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.
You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.
Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.
Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.
Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.

is it needed to normalize your database when you are using mongodb?

i am making an web application and using a mongodb for my database. but im new in mongodb and i just want to know if i still needed to normalize my database like what others do in using RDBMS. that before making a table or database it is normalize.
Normalizing your data like you would with a relational database is usually not a good idea in MongoDB.
Normalization in relational databases is only feasible under the premise that JOINs between tables are relatively cheap. The $lookup aggregation operator provides some limited JOIN functionality, but it doesn't work with sharded collections. So joins often need to be emulated by the application through multiple subsequent database queries, which is very slow (see question MongoDB and JOINs for more information).
For that reason, you should design your database in a way that the most common queries can be satisfied by querying a single collection, even when this means that you will have some redundancy in your database. A good tool to model 1:n relations are often arrays of sub-objects. However, try to avoid objects which grow indefinitely over time, because in that case they will be moved around the file a lot which will impact write performance.

Database design: one huge table or separate tables?

Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several customers.
Recently, I have become worried with the fact that one table in particular will be getting very big. Each customer has approximately 20.000.000 rows of data, and there will soon be 30 customers in the database (if not more). A lot of queries will be done on this table. I am already noticing performance issues and users being temporarily locked out.
My question, will we be able to handle this table in the future, or is it better to split this table up into smaller tables for each customer?
Update: It has now been about half a year since we first created the tables. Following the advices below, I created a handful of huge tables. Since then, I have been experimenting with indexes and decided on a clustered index on the first two columns (Hospital code and Department code) on which we would have partitioned the table had we had Enterprise Edition. This setup worked fine until recently, as Galwegian predicted, performance issues are springing up. Rebuilding an index takes ages, users lock each other out, queries frequently take longer than they should, and for most queries it pays off to first copy the relevant part of the data into a temp table, create indices on the temp table and run the query. This is not how it should be. Therefore, we are considering to buy Enterprise Edition for use of partitioned tables. If the purchase cannot go through I plan to use a workaround to accomplish partitioning in Standard Edition.
Start out with one large table, and then apply 2008's table partitioning capabilities where appropriate, if performance becomes an issue.
Datawarehouses are supposed to be big (the clue is in the name). Twenty million rows is about medium by warehousing standards, although six hundred million can be considered large.
The thing to bear in mind is that such large tables have a different physics, like black holes. So tuning them takes a different set of techniques. The other thing is, users of a datawarehouse must understand that they are dealing with huge amounts of data, and so they must not expect sub-second response (or indeed sub-minute) for every query.
Partitioning can be useful, especially if you have clear demarcations such as, as in your case, CUSTOMER. You have to be aware that partitioning can degrade the performance of queries which cut across the grain of the partitioning key. So it is not a silver bullet.
Splitting tables for performance reasons is called sharding. Also, a database schema can be more or less normalized. A normalized schema has separate tables with relations between them, and data is not duplicated.
I am assuming you have your database properly normalized. It shouldn't be a problem to deal with the data volume you refer to on a single table in SQL Server; what I think you need to do is review your indexes.
Since you've tagged your question as 'datawarehouse' as well I assume you know some things about the subject. Depending on your goals you could go for a star-schema (a multidemensional model with a fact and dimensiontables). Store all fastchanging data in 1 table (per subject) and the slowchaning data in another dimension/'snowflake' tables.
An other option is the DataVault method by Dan Lindstedt. Which is a bit more complex but provides you with full flexibility.
http://danlinstedt.com/category/datavault/
In a properly designed database, that is not a huge anmout of records and SQl server should handle with ease.
A partioned single table is usually the best way to go. Trying to maintain separate indivudal customer tables is very costly in termas of time and effort and far more probne to errors.
Also examine you current queries if you are experiencing performance issues. If you don't have proper indexing (did you for instance index the foreign key fields?) queries will be slow, if you don't have sargeable queries they will be slow if you used correlated subqueries or cursors, they will be slow. Are you returning more data than is striclty needed? If you have select * anywhere in your production code, get rid of it and only return the fields you need. If you used views that call views that call views or if you used EAV table, you willhave performance iisues at this level. If you allowed a framework to autogenerate SQl code, you may well have badly perforimng queries. Remember Profiler is your friend. Of course you could also have a hardware issue, you need a pretty good sized dedicated server for that number of records. It won't work to run this on your web server or a small box.
I suggest you need to hire a professional dba with performance tuning experience. It is quite complex stuff. Databases desigend by application programmers often are bad performers when they get a real number of users and records. Database MUST be designed with data integrity, performance and security in mind. If you didn't do that the changes of having them are slim indeed.
Partioning is definately something to look into. I had a database that had 2 tables sharded. Each table contained around 30-35million records. I have since merged this into one large table and assigned some good indexes. So far, I've not had to partition this table as it's working a treat, but I'm keep partitioning in mind. One thing that I have noticed, compared to when the data was sharded, and that's the data import. It is now slower, but I can live with that as the Import tool can be re-written ;o)
One table and use table partitioning.
I think the advice to use NOLOCK is unjustified based on the information given. NOLOCK means you will get inaccurate and unreliable results from your queries (dirty and phantom reads). Before using NOLOCK you need to be sure that's not going to be a problem for your customers.
Is this a single flat table (no particular model)? Typically in data warehouses, you either have a normalized data model (third normal form at least - usually in an entity-relationship-model) or you have dimensional data (Kimball method or variations - usually fact tables with associated dimension tables in a set of stars).
In both cases, indexes play a large part, and partitioning can also play a part in getting queries to perform (but partitioning is not usually about performance but about maintenance being able to add and drop partitions quickly) over very large data sets - but it really depends on the order of aggregation and the types of queries.
One table, then worry about performance. That is, assuming you are collecting the exact same information for each customer. That way, if you have to add/remove/modify a column, you are only doing it in one place.
If you're on MS SQL server and you want to keep the single table, table partitioning could be one solution.
Keep one table - 20M rows isn't huge, and customers aren't exactly the kind of table that you can easily 'archive off', and the aggrevation of searching multiple tables to find a customer isn't worth the effort (SQL is likely to be much more efficient at BTree searching than your own invention is)
You will need to look into the performance and locking issues however - this will prevent your db from scaling.
You can also create supplemental tables that hold already calculated details on historical information if there are common queries.

What should I keep in mind if I wish to merge many DBs into one DB?

I am working with a half dozen DBs. The DBs all have the same schemas, the same SPs, etc. Speaking to the person who originally designed the DBs, a big part of the motivation for using many DBs was efficiency; the alternative would be to add a column to pretty much every table and sp in the database indicating which set of data was being worked in, resulting in one giant (and thus slower) DB instead of several samll DBs. In place of having a column to indicate which set of data is being queried, the connection string is used to select which database is being hit.
The only reason I really dislike this organization is that it involves a lot of code duplication and thus hurts maintenance. For example, every time I wish to change a stored procedure, I need to run the alter statement on every database.
One solution I have considered is to combine all of the data into one big database, adding an extra column all over the place to indicate which database the data would be in if I had not combined it. Then, I could partition all of the tables by this column's value. In theory, the result of all of this is that the underlying representation of all of the data itself will be morally the same as it is now, but without the redundancies in the indexes, schemas, SPs, etc.
My questions are this:
Is this a good idea? Is there a better way to accomplish this?
Are there any gotchas in doing this?
Will this have any impact on performance?
Everyone will deal with this at some point. My own personal opinion is that multiple databases are a pain in the backside and are not faster. They are a pain because of the maintenance headaches. Adding an extra column in each table as necessary will not slow your process done that much, if indexing is set properly. And your maintenance will be much easier. Plus, doing transactions across multiple DB's can be a hassle and involve MTC.
BTW, using a single database is often called a multi-tenant database. You might want to research this a bit. But I would avoid multiple DB's like this if possible.
I'm of a different mind than Randy.
The multi-tenant model has its advantages.
For one, maintenance is not really much different whether you have 5 databases or 500. At some point you stop looking at maintenance of individual databases and look at the set. Yes you must serialize backups and you can't be performing index reorg/rebuild across all databases at once.
But for code changes across multiple more-or-less identical databases, there are easy ways to script a lot of things to be done to multiple databases without really lifting an extra finger. I use a tool called SQLFarms Combine (now sold by JNetDirect), but there are other offerings such as RedGate MultiScript that I haven't played with.
What I like most about the multi-tenant model is that when you grow and scale and suddenly need a new database server, it is very easy to move one of the tenants (say, the busiest or fastest growing) to the new server. If everybody is jammed into the same database, this extraction of only their data becomes quite difficult, especially if there is to be minimized downtime. In the multi-tenant model, you can set up mirroring for just their database, and then switch the primary when you're ready.
I'd be in favor of combining these databases. There are other facilities built into SQL Server to account for the potential performance downfalls of a very large database, like additional indexing on a second physical disk, partitioning, clustering, etc. The headache and overhead involved in deploying schema updates to that many different databases can be time consuming when it's easily handled in a single database. I think SQL Server scales really well in cases like this - let the database server do what it's designed to do and provide responsive access to your data. You can focus on application design and leave the storage model to SQL Server.
Also, though this isn't mentioned above, I'd suspect that there's some level of dynamic SQL involved in the applications that use this "many database" model because you've got to switch between databases based on something you know, so it can't be hard coded into the application or in a configuration file, meaning that either connection strings or actual SQL statements have to be generated on the fly, and that can be a really big security risk (read about "SQL Injection" if you're unfamiliar with the potential risks of dynamic SQL).

Resources