Do NoSQL databases use or need indexes? - database

I am a newbie in NoSQL databases and this may sound a bit stupid but I was wondering if NoSQL databases use or need indexes?
If yes, how to make or manage them? any links?
Thanks

CouchDB and MongoDB definitely yes. I mentioned that in my book:
http://use-the-index-luke.com/sql/testing-scalability/response-time-throughput-scaling-horizontal
Here are the respective docs:
http://guide.couchdb.org/draft/btree.html
http://www.mongodb.org/display/DOCS/Indexes
NoSQL is, however, too fragmented to give a definite "yes, all NoSQL systems need indexes", I believe. Most systems require and provide indexes but not at level most SQL databases do. Recently, the Cassandra people were proudly introducing secondary indexes, i.e., more than a single clustered index.
http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes (well, not so recently as I remember)

Definitely nosql databases need index,
i.e. but in most popular databases you need not to maintain index by yourself because as per current needs of nosql databases communities of nosql databases is developing with new features and with "Code Less, Get More"

Related

Database selection for performance and scalability for C# APIs

I've been working on a project for dating-like app, kind of tinder/bumble. I've been debating which database to use Cassandra or MongoDB. So far I have experience only with MS SQL, mysql and Unidata... I've been looking into Cassandra and MongoDB because of scalability, but I've heard Tinder had issues with their MongoDB, thus they had to call in for help. Even if it is not any of those 2, what else would you suggest? Learning DB would not be an issue for me, but I am looking for performance and scalability. Main programming language will be C# (if it helps) and preferably I am looking for building this in cloud (Azure Cosmos DB, aws dynamoDB or similar). My thoughts are NoSQL DB because of scalability but I wouldn't be opposed to select RDBMS if there is strong reason.
Suggestions, comments, thoughts?
Cassandra has some advantages over mongodb.
There is no master-slave in cassandra. Any node can receive any
query. If master goes down on mongodb, you'll face with little down time.
It is easy to scale cassandra, adding a node is not a challange.
Writes are very fast.
Read query with primary key is fast.
Also
There is no aggregation in cassandra
Bad performance for very high update/delete (increasing tombstones causes bad performance impact : http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html)
Not efficient for fulltext search applications
No transactions
No joins
Secondary indexes are not equal to rdbs indexes and should not use very often
So you can not use cassandra for every use cases. If your data model does not fit for cassandra just consider another db which fits your requirements.
Also take a look at : https://blog.pythian.com/cassandra-use-cases/

How is database design done in Joomla?

In phpMyAdmin, the tables do not show any relationships and they cannot be implemented in MyISAM tables. That means all the database relationships and data integrity need to be taken care of in the code level rather than at the database level? Is there any advantage or disadvantage of this?
I think the lack of relationships is probably a legacy from the early days of mysql. I have seen that the latest versions of Joomla is using InnoDB - tables instead of MyIsam, and this might indicate that they are thinking of adding relationships? (Innodb supports foreign keys and triggers, while myisam does not).
Obviously, it is much easier to maintain data integrity by using relationships and triggers, that doing everything at the application level. The only disadvantage I can think of is that it might require a bit more work with the coding to make sure the application don't give errors when you try to delete items with foreign keys ...
regards Jonas

advantage of noSql over newSql

It seems now that Google bet on NewSql solutions for big data storages.
I'm wondering if there is still some advantages of a NoSql solution comparing to a newSql solution ? (Like memory managment or others things)
NewSql databases are a new "strain" of databases if you will that are attempting to take the long established benefits of the traditional Relationl Database Management System (RDBMS) and make it compete with the highlights of NoSql data stores. They are not updates or improvements to RDBMS but more often rewrites that include middleware that abstracts the practice of database "sharding" or the ability to distribute a database over a grid of computers like NoSql does.
The power of the RDBMS comes mostly from queryability via the Structured Query Language (SQL), their transactionality and adhereance to the ACID principal (Atomicity, Consistency, Isolation, Durability) and the powerful tools developed over time to manage them. A lesser benefit comes from the fact that the relational model eliminates repetitive storage of the same information in multiple places.
The benefits of the NoSql is high speed, the ability to scale laterally across a comuting grid, and the lack of schema to maintain. This makes them very highly performant even against hugh data stores. But they lack the benefits that you get from the traditional RDBMS in that the query language to manipulate data isn't really there (yet), they can't be transactional across a computing grid, and they lack the tools to work against them like MS Sql Server Management Studio.
NewSql is attempting to take the best parts of both worlds and I think it eventually will. Here is a great write up of the RDBMS V.s. NoSql V.s. NewSql on bananagunprogramming.com.

Is there any different between a Column-oriented DBMS and RDBMS when designing a database schema?

Is there any different between these two kind of database? If yes, what is the different? Thank you.
The question isn't really answerable because "RDBMS" and "column-oriented" refer to very different aspects of a DBMS and are not mutually exclusive.
A RDBMS is any DBMS that implements the relational model.
A column-oriented DBMS is any DBMS that uses a columnar storage for data. That could be an RDBMS or it could be something else.
A column-oriented database is typically used for data warehouses and where you need to aggregate large amounts of data. It can be substantially different than a 'typical' transactional database.
Is this this what you are desiring to build (a data warehouse)?
When the column-oriented DBMS supports SQL, it replaces the SQL schema internally with a fully normalised version. Therefore performance considerations in the design of the schema that usually apply to traditional RDBMS no longer apply.

What do I need to know about databases?

In general, I think I do alright when it comes to coding in programming languages, but I think I'm missing something huge when it comes to databases.
I see job ads requesting knowledge of MySQL, MSSQL, Oracle, etc. but I'm at a loss to determine what the differences would be.
You see, like so many new programmers, I tend to treat my databases as a dumping ground for data. Most of what I do comes down to relatively simple SQL (INSERT this, SELECT that, DELETE this_other_thing), which is mostly independent of the engine I'm using (with minor exceptions, of course, mostly minor tweaks for syntax).
Could someone explain some common use cases for databases where the specific platform comes into play?
I'm sure things like stored procedures are a big one, but (a) these are mostly written in a specific language (T-SQL, etc) which would be a different job ad requirement than the specific RDBMS itself, and (b) I've heard from various sources that stored procedures are on their way out and that in a lot of cases they shouldn't be used now anyway. I believe Jeff Atwood is a member of this camp.
Thanks.
The above concepts do not vary much for MySQL, SQL Server, Oracle, etc.
With this question, I'm mostly trying to determine the important difference between these. I.e. why would a job ad demand n years experience with MySQL when most common use cases are relatively stable across RDBMS platforms.
CRUD statements, joins, indexes.. all of these are relatively straightforward within the confines of a certain engine. The concepts are easily transferable if you know a different RDBMS.
What I'm looking for are the specifics which would cause an employer to specify a specific engine rather than "experience using common database engines."
I believe that the essential knowledge about databases should be:
What database are for?
Basic CRUD Operations
SELECT queries with JOINs
Normalization
Basic Indexing
Referential Integrity with Foreign Key Constraints
Basic Check Constraints
The above concepts do not vary much between MySQL, SQL Server, Oracle, Postgres, and other relational database systems. However you'd find a different set of concepts for the now-popular NoSQL databases, such as CouchDB, MongoDB, SimpleDB, Cassandra, Bigtable, and many others.
After the CRUD statements, to be an effective DB programmer I think some of the most important things to understand are JOIN statements. Understand the difference between LEFT and RIGHT, OUTER and INNER joins, and know when to use each. Most importantly, know what the database is actually constructing when it performs a JOIN.
For me, the Wikipedia article was very helpful.
Also, indexing is very important - this is how relational databases can perform fast queries. Understand how to use them and what happens under the hood.
Wikipedia article on DB indexing.
You should also know how to construct a many-to-one relationship (using foreign keys) and a many-to-many relationship (using join tables).
I know that in your question you're asking about specific DB implementations, but if you're to be taken literally and you only know about SELECT, INSERT, UPDATE, and DELETE, then the above concepts will be far more valuable than learning the intricacies of a particular implementation.
It's not just stored procs and functions. Each database has fundamental differences and quirks that are important to understand even though SQL works more or less the same.
Examples:
Oracle and MySQL handle locking differently, in different situations.
Oracle doesn't have autoincrementing primary keys like MySQL and SQL Server.
Subtle vendor-specific behavior, like the way Oracle does sorting for VARCHARs differently depending on locale.
If you really want to improve your applications, you eventually have to become familiar with the details about how your specific database works. Most of the time it doesn't make a lot of difference, but when it does matter, it usually makes a big difference, especially when it comes to performance.
Some things which seem to come up when talking with my Database-keen colleagues:
Row vs page vs table locking escalation when doing multiple complex joins, implies sometimes doing very different things on different vendors dbs. This is where the theory is really hitting the tarmac and often it is non-intuitive.
Differences between how cursors are best used on different vendor db implementations
Odd stuff in the stored proc language variants, like how best to handle failure cases
Differences in how temporary tables and views are best used depending on the underlying implementations.
All of these kind of things don't really matter until you are trying to solve something that either has to
- Run very fast
- Contain lots and lots of data
- Gets very big and complex (i.e. multiple queries hitting same tables simultaneously)
These are the kinds of things that DBAs should be helping with, so depends on if you are aiming to be a DBA or a programmer. None of the above have really hurt me yet, because I've not worked on db-intensive systems, but I've worked near a few, and the programmers on those end up knowing a lot about the internals, restrictions, and good features about the specific database they are using.
Best way to get knowledge like that (other than on the job) is to read the manuals or hang out with people that already know and ask them about it.
Don't forget relation schemas, Primary and foreign keys and how they are related. To start with DB, I would use MySql and MSSQL as these are most common in the market. I take Oracle as more advanced and complex db
As for the level of differences there are between vendors, it is because SQL is a standard (http://en.wikipedia.org/wiki/SQL#Standardization) and vendors implement that std differently.
Each of these vendors try to offer extras to have the crowd by their side... that's why you see functions available to one and not the other. But sometimes that function make its way into the standard so its not always a bad thing.
For stored proc. I would agree as ORMs and practices of today tend to do a greater separation of concerns by removing business logic from the database and considering it "only" a repository.
My 2 cents
I see job ads requesting knowledge of MySQL, MSSQL, Oracle, etc. but I'm at a loss to determine what the differences would be.
I'm what's called a SQL Developer. You won't see the differences much when you are doing run of the mill database work (CRUD). However the differences become quite apparent when you are dealing with the databases own brand of SQL.
When talking SQL outside of the standards, there are 4 distinctive types of commands. These are:
Data Manipulation Language (DML)
Data Definition Language (DDL)
Data Control Language (DCL)
Transactional Control Language (TCL)
The biggest differences come in the last two, DCL and TCL. Those have a LOT of database specific non-standard SQL commands. The first two, DML and DDL, are very similar across any database that use the relational model.
Also the bigger database vendors have nicknamed their SQL implementation. Here's a short sample:
SQL Server : T-SQL
Oracle : PL-SQL
PostgreSQL : P-SQL or NG-SQL
Firebird : IB-SQL
MySQL : mSQL
The list goes on, but you get the point. Wikipedia has good articles on the different command acronyms.
I have found that most employers won't be able to articulate this, because most will use non-technical managers and/or HR to do the hiring. They are basically being told by the tech managers that the new hires need to know X technology. This and also, because the majority are too lazy to hire for intelligence, instead they fall back on the "We have X, so darn it, we need to hire somebody that knows X!" meme. The differences are actually not that hard to learn, for the people who frequent StackOverflow. I'm confident that anybody here can learn these fairly fast.
Even something as simple as an auto-incrementing primary key can be very different in Oracle, mysql, and SQL Server.
Some other important differences:
SQL Server makes a distinction between clustering key and primary key; other database do not. This choice comes with major performance implications.
SQL Server allows the SET #Total = Total = #Total + Amount syntax for fast computations of things like running totals. mysql lets you use a user variable in a similar way (I think). In other databases you'd probably have to use a correlated subquery. Huge difference in performance.
SQL Server can generate "sequential GUIDs" with newsequentialid. I'm not sure which other databases have this feature, but as with the above two points, there are significant performance implications to using a traditional GUID as opposed to a sequential or comb.
Oracle's CONNECT BY is a very useful and pretty unique syntax. Common Table Expressions in SQL Server and mysql are similar but not exactly the same.
Support for ranking/ordering functions varies vastly across different databases. I'm constantly posting answers here invoking ROW_NUMBER. A lot of queries are much harder to write without this - but at the same time, abusing it can hurt performance.
XML support is all over the map. Most databases have reasonably good support for it now, but both syntax and semantics are completely different on every platform.
Date/time handling can be quite different. Oracle has several different date/time-related types, some including time zone information. In general, Oracle is way better than other databases at managing temporal data, and has several features that you will miss if you switch. Until recently, Microsoft didn't have the date and time types, just datetime, which was much harder to normalize.
Spatial types are different and/or nonexistent in different databases. mysql exposes an entire OpenGIS model; Microsoft's support is a bit more basic but still competent. Oracle has it, but it's a little hard to find information on, and it's some sort of optional add-on. I think DB2 is starting to get it, but support is still a little spotty.
mysql actually lets you choose how to store an index (i.e. btree or hash). This is also an important performance consideration.
SQL Server allows you to INCLUDE columns in an index - very important for performance.
Oracle allows you to create function-based indexes, bitmap indexes, and so on. These can be pretty difficult to wrap your head around.
Oracle can perform "skip seeks" in very specific situations, something that I don't believe is supported in other databases (yet). This might factor into how you order index columns.
SQL Server has CLR types/functions/aggregates. Obviously not supported in any other database product.
Trigger support varies significantly. SQL Server has AFTER and INSTEAD OF. mysql has BEFORE and AFTER. Oracle has all of those and more. These all behave quite differently.
I'm sure that there are many, many more differences, but that should give you at least a basic idea of why 5 years of experience with Oracle is completely different from 5 years of experience with SQL Server.
That databases are encoded collections of assertions of fact.
That the logical structure of the tables corresponds to the syntactical structure of those "assertions of fact".
That Normalization theory helps you find the most optimal logical structure of the database, by minimizing redundancy, i.e. minimizing the possibility for contradictions in said assertions of fact to occur.
That database constraints are really nothing else than business rules, expressed in a formal way and in terms of the components of the database.
That really every and any business rule can be expressed as a database constraint.
That therefore, it is possible for the DBMS to enforce any and every business rule you can imagine.
That there is a very important difference between logical design and physical design.
That SQL and SQL systems are, eurhm, not really helpful (and that's putting it mildly), in supporting developers to recognise this important distinction.
That SQL and SQL systems are, eurhm, significantly deficient (and that's putting it mildly), in their support for database constraints.
That these latter two examples are a very good illustration of the importance of the difference between a model (Codd's RM) and its implementation (some particular SQL system). As far as relational database technology is concerned, the latters deviate ever more propostrously from the former.
And whatever else I forgot to remember.

Resources