Is Crate a relational DB? - database

I thought at first that it isn't a relational DB, but after I read that I can join tables and it was written on their site https://crate.io/overview/ (see Use cases), I'm not sure.
Especially I got confused by the senctence:
CrateDB is based on a NoSQL architecture, but features standard SQL.
from https://crate.io/overview/high-level-architecture/

Going by a Codd's 12 rules (which have been used to identify relational databases), CrateDB is not a relational database - yet. CrateDB's eventual consistency model does not prohibit that.
Rule 0: For any system that is advertised as, or claimed to be, a relational data base management system, that system must be able to manage data bases entirely through its relational capabilities.
CrateDB doesn't have another interface with which data can be inserted, retrieved, and updated.
Rule 1: All information in a relational data base is represented explicitly at the logical level and in exactly one way — by values in tables.
Exactly what can be found in CrateDB.
Rule 2: Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.
This is strictly enforced. Access through primary keys will even give you read-after-write consistency.
Rule 3: Null values (distinct from the empty character string or a string of blank characters and distinct from zero or any other number) are supported in fully relational DBMS for representing missing information and inapplicable information in a systematic way, independent of data type.
CrateDB supports null.
Rule 4: The data base description is represented at the logical level in the same way as ordinary data, so that authorized users can apply the same relational language to its interrogation as they apply to the regular data.
CrateDB has among other meta-tables, Information Schema tables
Rule 5: A relational system may support several languages and various modes of terminal use (for example, the fill-in-the-blanks mode). However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and that is comprehensive in supporting all of the following items:
Data definition.
View definition.
Data manipulation (interactive and by program).
Integrity constraints.
Authorization.
Transaction boundaries (begin, commit and rollback).
CrateDB supports data definition and data manipulation parts and only a single integrity constraint (primary key). This is definitely incomplete.
Rule 6: All views that are theoretically updatable are also updatable by the system.
CrateDB does not support views yet.
Rule 7: The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data but also to the insertion, update and deletion of data.
CrateDB currently only does that for data retrieval...
Rule 8: Application programs and terminal activities remain logically unimpared whenever any changes are made in either storage representations or access methods.
CrateDB's use of SQL allows for this; performance/storage level improvements are even delivered via system upgrades.
Rule 9: Application programs and terminal activites remain logically unimpared when information-preserving changes of any kind that theoretically permit unimpairment are made to the base tables.
Parts of this are still missing (the views, inserts/updates on joins). However for retrieving data, this is already the case.
Rule 10: Integrity constraints specific to a particular relational data base must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.
This is quite tricky for a distributed database, specifically the foreign key constraints. CrateDB only supports primary key constraints for now.
Rule 11: A relational DBMS has distribution independence.
In CrateDB any kind of sharding/partitioning/distribution is handled transparently for the user. Any kinds of constraints/settings for data distribution are applied on the data definition level.
Rule 12: If a relational system has a low-level (single-record-at-a-time) language, that low level cannot be used to subvert or bypass the integrity rules and constraints expressed in the higher level relational language (multiple-records-at-a-time).
One could argue that COPY FROM directly violates this rule since there is no type checking and conversion happening underneath. However there is no other command/language/API that would allow data manipulation otherwise.
While CrateDB certainly has some catching up to do, there is no reason why it wouldn't become a relational database in this sense soon. Its SQL support may not be on par with Oracle's or Postgres' but many people can live without some very use-case specific features.
As said above, all of the rules above are not directly violated, but rather not implemented yet in a satisfactory manner, so there is no reason why CrateDB can't become a fully relational database eventually.
(Disclaimer: I work there)

Since the beginning of the relational model the three main components that a system must have to be considered relational are (applying Codd's three-component definition of "data model" to the relational model):
data is presented as relations (tables)
manipulation is via relation and/or logic operators/expressions
integrity is enforced by relation and/or logic operators/expressions
Also a multi-user DMBS has been understood to support apparently atomic persistent transactions while benefiting from implementation via overlapped execution (ACID) and a distributed DBMS has been understood to support an apparent single database while benefiting from implementation at multiple sites.
By these criteria CrateDB is not relational.
It has tables, but its manipulation of tables in extremely limited and it has almost no integrity functionality. Re manipulation, it allows querying for rows of a table meeting a condition (including aggregation), and it allows joining multiple tables, but that's not optimized, even for equijoin. Re constraints, its only functionality is column typing, primary keys and non-null columns. It uses a tiny subset of SQL.
See the pages at your link re Supported Features and Standard SQL Compliance as addressed in:
Crate SQL
Data Definition
Constraints (PRIMARY KEY Constraint, NOT NULL Constraint)
Indices
Data Manipulation
Querying Crate
Retrieving Data (FROM Clause, Joins)
Joins
Crate SQL Syntax Reference
As usual with non-relational DBMSs, their documentation does not reflect an understanding or appreciation of the relational model or other fundamental DBMS functionality.

CrateDB is a distributed SQL database. The underlying technology is similar to what so called NoSQL databases typically use (shared nothing architecture, columnar indexes, eventual-consistency, support for semi-structured records) - but makes it accessible via a traditional SQL interface.
So therefor - YES, CrateDB is somewhat of a relational SQL DB.

Related

Does SQLXML breaks 1NF?

I see that Oracle, DB2 and SQL Server contain a new column XML. I'm developing using DB2 and from a database design you can break the 1NF if the xml contains a list.
Am I wrong to assume that SQLXML can break 1NF ?
Thank you,
The relational model is orthogonal to types and places no particular limitations on type complexity. A type could be arbitrarily complex, perhaps containing documents, images, video, etc, as long as all relational operations are supported for relations containing that type. First Normal Form is really just the definition of what a relation schema is, so in principle XML types are permissable by 1NF.
Oracle, DB2 and Microsoft SQL Server are not truly relational however and don't always represent relations and relational operations faithfully. For example SQL Server doesn't support comparison between XML values which means operations like σ(x=x)R or even π(x)R are not possible if x is an XML column. I haven't tried the same with DB2 and Oracle. It is moot whether such tables can properly be said to satisfy 1NF since the XML is implemented as "special" data that doesn't behave as we expect data to behave in relations. Given such limitations I think the important question is whether the proprietary XML type in your chosen DBMS is actually fit for your purposes at all.
The SQL standard defines in its part 14 the XML data type, its semantics and functions around that data type ("SQL/XML"). You could "legally" store few bytes in the XML column or stuff an entire database into a single XML value. It is up to the user and yes, it breaks classic database design. However, if the rest of the database is in 1NF and the XML-typed column is used only for some special payloads (app data, configurations, legal docs, digital signatures, ...) they make a great combination.
There are already other data types and SQL features that allow to break 1NF. Same as above, it is up to the user.

Difference between DBMS and RDBMS with some example tools?

What is the difference between a DBMS and an RDBMS with some examples and some new tools as examples. Why can't we really use a DBMS instead of an RDBMS or vice versa?
A relational DBMS will expose to its users "relations, and nothing else". Other DBMS's will violate that principle in various ways. E.g. in IDMS, you could do <ACCEPT <hostvar> FROM CURRENCY> and this would expose the internal record id of the "current record" to the user, violating the "nothing else".
A relational DBMS will allow its users to operate exclusively at the logical level, i.e. work exclusively with assertions of fact (which are represented as tuples). Other DBMS's made/make their users operate more at the "record" level (too "low" on the conceptual-logical-physical scale) or at the "document" level (in a certain sense too "high" on that same scale, since a "document" is often one particular view of a multitude of underlying facts).
A relational DBMS will also offer facilities for manipulation of the data, in the form of a language that supports the operations of the relational algebra. Other DBMS's, seeing as they don't support relations to boot, obviously cannot build their data manipulation facilities on relational algebra, and as a consequence the data manipulation facilities/language is mostly ad-hoc. On the "too low" end of the spectrum, this forces DBMS users to hand-write operations such as JOIN again and again and again. On the "too high" end of the spectrum, it causes problems of combinatorial explosion in language complexity/size (the RA has some 4 or 5 primitive operators and that's all it needs - can you imagine 4 or 5 operators that will allow you to do just any "document transform" anyone would ever want to do ?)
(Note very carefully that even SQL systems violate basic relational principles quite seriously, so "relational DBMS" is a thing that arguably doesn't even exist, except then in rather small specialized spaces, see e.g. http://www.thethirdmanifesto.com/ - projects page.)
DBMS : Database management system, here we can store some data and collect.
Imagine a single table , save and read.
RDBMS : Relational Database Management , here you can join several tables together and get related data and queried data ( say data for a particular user or for an particular order,not all users or all orders)
The Noramalization forms comes into play in RDBMS, we dont need to store repeated data again and again, can store in one table, and use the id in other table, easier to update, and for reading we can join both the table and get what we want.
DBMS:
DBMS applications store data as file.In DBMS, data is generally stored in either a hierarchical form or a navigational form.Normalization is not present in DBMS.
RDBMS:
RDBMS applications store data in a tabular form.In RDBMS, the tables have an identifier called primary key and the data values are stored in the form of tables.Normalization is present in RDBMS.

Efficient persistence strategy for many-to-many relationship

TL;DR: should I use an SQL JOIN table or Redis sets to store large amounts of many-to-many relationships
I have in-memory object graph structure where I have a "many-to-many" index represented as a bidirectional mapping between ordered sets:
group_by_user | user_by_group
--------------+---------------
louis: [1,2] | 1: [louis]
john: [2,3] | 2: [john, louis]
| 3: [john]
The basic operations that I need to be able to perform are atomic "insert at" and "delete" operations on the individual sets. I also need to be able to do efficient key lookup (e.g. lookup all groups a user is a member of, or lookup all the users who are members of one group). I am looking at a 70/30 read/write use case.
My question is: what is my best bet for persisting this kind of data structure? Should I be looking at building my own optimized on-disk storage system? Otherwise, is there a particular database that would excel at storing this kind of structure?
Before you read any further: stop being afraid of JOINs. This is a classic case for using a genuine relational database such as Postgres.
There are a few reasons for this:
This is what a real RDBMS is optimized for
The database can take care of your integrity constraints as a matter of course
This is what a real RDBMS is optimized for
You will have to push "join" logic into your own code
This is what a real RDBMS is optimized for
You will have to deal with integrity concerns in your own code
This is what a real RDBMS is optimized for
You will wind up reinventing database features in your own code
This is what a real RDBMS is optimized for
Yes, I am being a little silly, but because I'm trying to drive home a point.
I am beating on that drum so hard because this is a classic case that has a readily available, extremely optimized and profoundly stable tool custom designed for it.
When I say that you will wind up reinventing database features I mean that you will start having to make basic data management decisions in your own code. For example, you will have to choose when to actually write the data to disk, when to pull it, how to keep track of the highest-frequency use data and cache it in memory (and how to manage that cache), etc. Making performance assumptions into your code early can give your whole codebase cancer early on without you noticing it -- and if those assumptions prove false later changing them can require a major rewrite.
If you store the data on either end of the many-to-many relationship in one store and the many-to-many map in another store you will have to:
Locate the initial data on one side of the mapping
Extract the key(s)
Query for the key(s) in the many-to-many handler
Receive the response set(s)
Query whatever is relevant from your other storage based on the result
Build your answer for use within the system
If you structure your data within an RDBMS to begin with your code will look more like:
Run a pre-built query indexed over whatever your search criteria is
Build an answer from the response
JOINs are a lot less scary than doing it all yourself -- especially in a concurrent system where other things may be changing in the course of your ad hoc locate-extract-query-receive-query-build procedure (which can be managed, of course, but why manage it when an RDBMS is already designed to manage it?).
JOIN isn't even a slow operation in decent databases. I have some business applications that join 20 tables constantly over fairly large tables (several millions of rows) and it zips right through them. It is highly optimized for this sort of thing which is why I use it. Oracle does well at this (but I can't afford it), DB2 is awesome (can't afford that, either), and SQL Server has come a long way (can't afford the good version of that one either!). MySQL, on the other hand, was really designed with the key-value store use-case in mind and matured in the "performance above all else" world of web applications -- and so it has some problems with integrity constraints and JOINs (but has handled replication very well for a very long time). So not all RDBMSes are created equal, but without knowing anything else about your problem they are the kind of datastore that will serve you best.
Even slightly non-trivial data can make your code explode in complexity -- hence the popularity of database systems. They aren't (supposed to be) religions, they are tools to let you separate a generic data-handling task from your own program's logic so you don't have to reinvent the wheel every project (but we tend to anyway).
But
Q: When would you not want to do this?
A: When you are really building a graph and not a set of many-to-many relations.
There is other type of database designed specifically to handle that case. You need to keep in mind, though, what your actual requirements are. Is this data ephemeral? Does it have to be correct? Do you care if you lose it? Does it need to be replicated? etc. Most of the time requirements are relatively trivial and the answer is "no" to these sort of higher-flying questions -- but if you have some special operational needs then you may need to take them into account when making your architectural decision.
If you are storing things that are actually documents (instead of structured records) on the one hand, and need to track a graph of relationships among them on the other then a combination of back-ends may be a good idea. A document database + a graphing database glued together by some custom code could be the right thing.
Think carefully about which kind of situation you are actually facing instead of assuming you have case X because it is what you are already familiar with.
In relational databases (e. g. SqlServer, MySql, Oracle...), the typical way of representing such data structures is with a "link table". For example:
users table:
userId (primary key)
userName
...
groups table:
groupId (primary key)
...
userGroups table: (this is the link table)
userId (foreign key to users table)
groupId (foreign key to groups table)
compound primary key of (userId, groupId)
Thus, to find all groups with users named "fred", you might write the following query:
SELECT g.*
FROM users u
JOIN userGroups ug ON ug.userId = u.userId
JOIN groups g ON g.groupId = ug.groupId
WHERE u.name = 'fred'
To achieve atomic inserts, updates, and deletes of this structure, you'll have to execute the queries that modify the various tables in transactions. ORM's such as EntityFramework (for .NET) will typically handle this for you.

In ACID is C the responsibility of the DBMS implementation?

I am confused on the very concept of ACID.
All references/textbooks describe ACID as a set of properties that the database system is expected/required to maintain in order to preserve data integrity.
But the C part of ACID i.e. Consistency does not really seem to actually be a responsibility of the database.
In some references (e.g. Silberschatz) in the sense that the code itself of a transaction (if run in isolation) lefts the database in consistent state i.e. the transaction code is correct and hence this is the application's programmer's perspective not of the DBMS.
And in other references the description is vague like "leaving the database in consistent state"
So which is correct?
In transactions, the technical term consistent means "satisfying all known integrity constraints".
By definition, integrity constraints are declared to the dbms, and are expected to be enforced by the dbms. Why? Because if application programmers were responsible, every application programmer might make a different decision about what constitutes a consistent update operation.
For example, one application programmer might decide that every unit price that's more than $10,000 is in error. Another application programmer might decide that every unit price that's more than $12,000 is in error--but that $11,000 is a valid unit price. One programmer's work will accept a unit price of $11,000; the other's will kick it out as an error.
Relational databases resolve that inconsistency by centralizing decisions about unit prices. In this particular case, that decision might be centralized in the form of a CHECK() constraint on the "unit_price" column. Once that integrity constraint is in place, every update operation will result in a consistent value in the "unit_price" column. It doesn't matter
how many application programmers there are,
how well (or how poorly) trained they are,
how many different applications there are,
what languages they're written in,
whether the sleep-deprived DBA is executing an update from a command-line console.
All those update operations will find it impossible to set the "unit_price" column to a value fails to satisfy all the known integrity constraints. That means all those update operations will result in a state that satisfies all the known integrity constraints. That's the definition of consistent.
In the relational model, integrity constraints and business rules mean the same thing.1 Some people use business rules to mean something else; you have to determine their meaning by careful reading.
Integrity constraints (or business rules) should never be under the control of end users if your data is important. Constraints can easily be changed by a DBA, usually with a single SQL statement. But knowing which statement to execute and when to execute it is not in most end user's skill set.
The terms consistent and correct mean two different things. A database state can be consistent without being correct. A unit price that is within the range of a CHECK() constraint might still be the wrong price, a person's name might be misspelled, etc.
Neither the relational model nor the SQL standards are defined by a particular SQL implementation. They're especially not defined by MySQL's behavior, which is just barely SQL. (CHECK constraints parsed but not enforced, indeterminate results using GROUP BY, no analytic functions, nonstandard quoting in backticks, etc.)
If I had Database System Concepts in front of me, and I wanted to understand what the author meant by "Ensuring consistency for an individual transaction is the responsibility of the programmer who codes the transaction.", I'd ask myself these questions.
Is there any errata that refers to the section in question?
What does the author say about ACID properties in earlier editions? In later editions?
Does the author distinguish correct transactions (they have the right data) from consistent transactions (they don't violate any of the integrity constraints)?
Does the author always say that consistency is the responsibility of the programmer? (I'd say that correctness is; consistency isn't.)
What programmer is the author talking about? (Application programmer writing a Rails app? DBA writing a stored proc? Hard-core programmer writing the transactional subsystem for PostgreSQL?)
Does the author never say that consistency is the responsibility of the dbms?
Does the author ever say that, of the four ACID properties of transactions, A, I, and D are the responsibility of the dbms, but C isn't? (I'll bet he doesn't.)
Is consistency of an individual transaction different from consistency of a database? Where else does the author talk about that?
"Centralized control of the database can help in avoiding such problems--insofar as they can be avoided--by permitting the data administrator to define, and the DBA to implement, integrity constraints (also known as business rules) to be checked whenever any update operation is performed." An Introduction to Database Systems, 7th ed, C.J. Date, p 18.

Database with support for JOINs and a flexible data schema

For our project, we need a database that supports JOINs and has the ability to easily add and modify attributes of the entity (schema-less/free). Key points:
The system is designed to work with customers (CRM)
Basic entities: User, Customer, Case, Case Interaction, Order
Currently in the database there are ~200k customers and ~250k orders
Customer entity contains 15-20 optional attributes that are most often not filled
About 100 new cases a day
The data is synchronized with several other sources in the background
Requirements (high to low priority):
Ability to implement search/sort by related entities, e.g. Case by linked Customer name (support JOINs)
Having the flexibility to change the schema of the data and do not store NULL for a large number of attributes
Performance
ORM for Python with support for monitoring changes and the possibility of storing only the changes to the database
What we've tried:
MongoDB does not satisfy paragraph 1.
PostgreSQL with all the attributes in one table does not satisfy paragraph 2.
PostgreSQL with a separate table for each attribute or EAV does not satisfy paragraph 3 (a lot of slow joins), but seems a better solution than others.
Can you suggest any database or design of the system that will meet our needs?
Datomic might be worth checking out (http://www.datomic.com/). It satisfies requirements 1-3, and although there's no python ORM, there is a REST API.
Datomic is based on an Entity Attribute Value schema (it's not quite schema free - you need to specify a name and type for each attribute - but any entity can have any attribute). It is transactional and has support for joins, unlike some of the other flexible "NoSQL" solutions. Interestingly, it also has first-class support for time (e.g. what is the history of this entity/what did the database look like at time t,etc), which might be useful if you're tracking cases and interactions.
Queries are based on datalog, which queries by unification. Query by unification looks a bit odd at first but is brilliant once you get used to it.
For example, a query to find cases by linked customer name would be something like this:
[find ?x
:in $
:where [?x :case/linked-customers ?c
?c :customer/name "Barry"]]
The query engine looks in the database, and tries to satisfy the where clause by unifying all occurrences of a given variable. In this case, only ?c appears twice (the case has a linked customer c whose name is Barry), but queries can obviously get a lot more complex. The $ here represents the database.
You may want to consider storing the "flexible" part as XML. Some databases, e.g. DB2, allow XML indexing so lookup performance should be as good as with the relational data store. DB2 Express-C is free and does not have an artificial limit on the database size.
Update Since 2015 DB2 Express-C limits the database user data volume to 15 TB, which still should be plenty.

Resources