Flat data with struct type vs document store - database

I know this is a 'soft' question, which is usually frowned upon on SO, but I have been using BigQuery to do data analysis on (obviously) flat data, which contains both structs and repeated data. Let's just use a very basic example, a row might look like this:
ID
Title (str)
ReleaseYear (int)
Genres (str[])
Credits (struct[])
And an example piece of data might look like:
{
"ID": "T-1997",
"Title": "Titanic",
"ReleaseYear": 1997,
"Genres": ["Drama", "Romance"],
"Credits": {
"Actors": ["Leonardo DiCaprio", "Kate Winslet"],
"Directors": ["James Cameron"]
}
}
My question is basically what type of operations or queries can be done in a native document store, such as MongoDB or CouchBase, that couldn't be done in a relational DB that supports arbitrarily-nested data. In other words, my assumption (and I hope I'm wrong or misguided) is that as long as a DB supports structs, it can do everything that a document-store can do. If not, what are some places where it is either: (1) something that can be done in MongoDB (or any other document-store) that cannot be done in BigQuery (or any other database that supports structs)? and (2) something that can be done much more easily in MongoDB that in a relational DB?

what type of operations or queries can be done in a native document
store, such as MongoDB or CouchBase, that couldn't be done in a
relational DB that supports arbitrarily-nested data.
Even if does support arbitrarily nested data, BigQuery allows limited nesting compared to MongoDB .MongoDB supports more levels of nesting.
In BigQuery, your schema cannot contain more than 15 levels of nested STRUCTs. MongoDB supports unto 100 levels of nesting for BSON documents.
In other words, my assumption (and I hope I'm wrong or misguided) is
that as long as a DB supports structs, it can do everything that a
document-store can do.
Not exactly - nested columns are columns within columns. But sharding in an RDBMS is a complex endeavor compared to a NoSQL database like Mongo. Technically you can do, but it wasn't designed for the same purpose. Its like using a wrench as a hammer - sure you can, but its purpose was something different. You should use the right tool for the right purpose.
If not, what are some places where it is either: (1) something that
can be done in MongoDB (or any other document-store) that cannot be
done in BigQuery (or any other database that supports structs)? and
(2) something that can be done much more easily in MongoDB that in a
relational DB?
The crux of the matter is, an RDBMS may tack on features to "technically" allow you to do some things that you can do in a NoSQL database. But it doesn't mean it may work just as well. For example, because of the features that make an RDBMS an RDBMS (ACID compliance, transactions etc), there will always be an additional performance hit compared to a NoSQL database. If an RDBMS removes these features, then it is no longer an RDBMS!
This answer illustrates how MongoDB achieves better performance because it doesn't need to support RDBMS features :
https://softwareengineering.stackexchange.com/questions/54373/when-would-someone-use-mongodb-or-similar-over-a-relational-dbms
MongoDB has a lower latency per query & spends less CPU time per query because it is doing a lot less work (e.g. no joins,
transactions).
As a result, it can handle a higher load in terms of queries per second and is thus often used if you have a massive # of users.
MongoDB is easier to shard (use in a cluster) because it doesn't have to worry about transactions and consistency. - MongoDB has a
faster write speed because it does not have to worry about
transactions or rollbacks (and thus does not have to worry about
locking).
MongoDB does not have a schema in case you have a special use case that can take advantage of that.
Another feature is sharding - sharding is easier with mongodb because it doesn't need to support many of the features which make an RDBMS an RDBMS, such as being ACID compliant. In contrast, sharding is complex for an RDBMS because an RDBMS must remain ACID compliant.
Take a look at the following two images:
The speed boat would out perform the "amphibious car" in the water 10/10 times. The amphibious car technically can navigate in water, but it wasn't designed to, hence is much slower and unsuited for its purpose.
Like wise, look at the difference in aerodynamics of the speed boat and this sweet automobile. Even if you tacked on wheels to the boat, its not going to perform as well as this car on land. (As an analogy you could say that NoSQL databases don't do joins - you have to implement them yourself. - but will it perform better than an RDBMS for join heavy operations ?)
The point I'm making with the analogies, is that each kind of database was initially designed for a specific goal, and over time features have been added to try and make it solve problems it was not designed for (hence it doesn't do it as well as something specifically designed for that purpose).
Hence in your question, even if BigQuery or some RDBMS can do something, it doesn't mean that you should use them for the job. The same applies for NoSQL databases. You should use the best tool for the job.

Disclaimer: I don't have experience in MongoDB or CouchBase. My answer is based on BigQuery's capability on STRUCT.
Performance
BigQuery's STRUCT is optimized for query. For example, if you query select a.nested_b.nested_c.nested_d from table_t, the query only scans data for the left STRUCT field nested_d, it is fast and cheap.
Usability
If your data is write-once or append-only, then STRUCT column is comparable with document store AFAIK.
But if you want to update only certain nested field later, nested STRUCT makes it pretty difficult to do, because there is no way to update single item in REPEATED field, you have to load the whole array, scan and change, and repack to update a column. You will be writing something like:
UPDATE table
SET Credits.Actors = (SELECT ARRAY_AGG(...) FROM UNNEST(Credits.Actors) WHERE ...)
WHERE ...
It may become a bigger problem when there is array of struct of arrays (and even more nested levels). Based on my understanding of document store, updating single nested field of a document should be easier than this. Basically, this is kind of the price you have to pay to get the performance benefit mentioned earlier.

Related

Using Doctrine 2 for large data models

I have a legacy in-house human resources web app that I'd like to rebuild using more modern technologies. Doctrine 2 is looking good. But I've not been able to find articles or documentation on how best to organise the Entities for a large-ish database (120 tables). Can you help?
My main problem is the Person table (of course! it's an HR system!). It currently has 70 columns. I want to refactor that to extract several subsets into one-to-one sub tables, which will leave me with about 30 columns. There are about 50 other supporting one-to-many tables called person_address, person_medical, person_status, person_travel, person_education, person_profession etc. More will be added later.
If I put all the doctrine associations (http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/working-with-associations.html) in the Person entity class along with the set/get/add/remove methods for each, along with the original 30 columns and their methods, and some supporting utility functions then the Person entity is going to be 1000+ lines long and a nightmare to test.
FWIW i plan to create a PersonRepository to handle the common bulk queries, a PersonProfessionRepository for the bulk queries / reports on that sub table etc, and Person*Service s which will contain some of the more complex business logic where needed. So organising the rest of the app logic is fine: this is a question about how to correctly organise lots of sub-table Entities with Doctrine that all have relationships / associations back to one primary table. How do I avoid bloating out the Person entity class?
Identifying types of objects
It sounds like you have a nicely normalized database and I suggest you keep it that way. Removing columns from the people table to create separate tables for one-to-one relations isn't going to help in performance nor maintainability.
The fact that you recognize several groups of properties in the Person entity might indicate you have found cases for a Value Object. Even some of the one-to-many tables (like person_address) sound more like Value Objects than Entities.
Starting with Doctrine 2.5 (which is not yet stable at the time of this writing) it will support embedding single Value Objects. Unfortunately we will have to wait for a future version for support of collections of Value objects.
Putting that aside, you can mimic embedding Value Objects, Ross Tuck has blogged about this.
Lasagna Code
Your plan of implementing an entity, repository, service (and maybe controller?) for Person, PersonProfession, etc sounds like a road to Lasagna Code.
Without extensive knowledge about your domain, I'd say you want to have an aggregate Person, of which the Person entity is the aggregate root. That aggregate needs a single repository. (But maybe I'm off here and being simplistic, as I said, I don't know your domain.)
Creating a service for Person (and other entities / value objects) indicates data-minded thinking. For services it's better to think of behavior. Think of what kind of tasks you want to perform, and group coherent sets of tasks into services. I suspect that for a HR system you'll end up with many services that evolve around your Person aggregate.
Is Doctrine 2 suitable?
I would say: yes. Doctrine itself has no problems with large amounts of tables and large amounts of columns. But performance highly depends on how you use it.
OLTP vs OLAP
For OLTP systems an ORM can be very helpful. OLTP involves many short transactions, writing a single (or short list) of aggregates to the database.
For OLAP systems an ORM is not suited. OLAP involves many complex analytical queries, usually resulting in large object-graphs. For these kind of operations, native SQL is much more convenient.
Even in case of OLAP systems Doctrine 2 can be of help:
You can use DQL queries (in stead of native SQL) to use the power of your mapping metadata. Then use scalar or array hydration to fetch the data.
Doctrine also support arbitrary joins, which means you can join entities that are not associated to each other according by mapping metadata.
And you can make use of the NativeQuery object with which you can map the results to whatever you want.
I think a HR system is a perfect example of where you have both OLTP and OLAP. OLTP when it comes to adding a new Person to the system for example. OLAP when it comes to various reports and analytics.
So there's nothing wrong with using an ORM for transactional operations, while using plain SQL for analytical operations.
Choose wisely
I think the key is to carefully choose when to use what, on a case by case basis.
Hydrating entities is great for transactional operations. Make use of lazy loading associations which can prevent fetching data you're not going to use. But also choose to eager load certain associations (using DQL) where it makes sense.
Use scalar or array hydration when working with large data sets. Data sets usually grow where you're doing analytical operations, where you don't really need full blown entities anyway.
#Quicker makes a valid point by saying you can create specialized View objects. You can fetch only the data you need in specific cases and manually mold that data into objects. This is accompanied by his point to don't bloat the user interface with options a user with a certain role doesn't need.
A technique you might want to look into is Command Query Responsibility Segregation (CQRS).
I understood that you have a fully normalized table persons and now you are asking for how to denormalize that best.
As long as you do not hit any technical constaints (such as max 64 K Byte) I find 70 columns definitly not overloaded for a persons table in a HR system. Do yourself a favour to not segment that information for following reasons:
selects potentially become more complex
each extract table needs (an) extra index/indeces, which increases your overall memory utilization -> this sounds to be a minor issue as disk is cheap. However keep in mind that via caching the RAM to disk space utilization ratio determines your performance to a huge extend
changes become more complex as extra relations demand for extra care
as any edit/update/read view can be restricted to deal with slices of your physical data from the tables only no "cosmetics" pressure arises from end user (or even admin) perspective
In summary your the table subsetting causes lots of issues and effort but does add low if not no value.
Btw. databases are optimized for data storage. Millions of rows and some dozens of columns are no brainers at that end.

Efficient persistence strategy for many-to-many relationship

TL;DR: should I use an SQL JOIN table or Redis sets to store large amounts of many-to-many relationships
I have in-memory object graph structure where I have a "many-to-many" index represented as a bidirectional mapping between ordered sets:
group_by_user | user_by_group
--------------+---------------
louis: [1,2] | 1: [louis]
john: [2,3] | 2: [john, louis]
| 3: [john]
The basic operations that I need to be able to perform are atomic "insert at" and "delete" operations on the individual sets. I also need to be able to do efficient key lookup (e.g. lookup all groups a user is a member of, or lookup all the users who are members of one group). I am looking at a 70/30 read/write use case.
My question is: what is my best bet for persisting this kind of data structure? Should I be looking at building my own optimized on-disk storage system? Otherwise, is there a particular database that would excel at storing this kind of structure?
Before you read any further: stop being afraid of JOINs. This is a classic case for using a genuine relational database such as Postgres.
There are a few reasons for this:
This is what a real RDBMS is optimized for
The database can take care of your integrity constraints as a matter of course
This is what a real RDBMS is optimized for
You will have to push "join" logic into your own code
This is what a real RDBMS is optimized for
You will have to deal with integrity concerns in your own code
This is what a real RDBMS is optimized for
You will wind up reinventing database features in your own code
This is what a real RDBMS is optimized for
Yes, I am being a little silly, but because I'm trying to drive home a point.
I am beating on that drum so hard because this is a classic case that has a readily available, extremely optimized and profoundly stable tool custom designed for it.
When I say that you will wind up reinventing database features I mean that you will start having to make basic data management decisions in your own code. For example, you will have to choose when to actually write the data to disk, when to pull it, how to keep track of the highest-frequency use data and cache it in memory (and how to manage that cache), etc. Making performance assumptions into your code early can give your whole codebase cancer early on without you noticing it -- and if those assumptions prove false later changing them can require a major rewrite.
If you store the data on either end of the many-to-many relationship in one store and the many-to-many map in another store you will have to:
Locate the initial data on one side of the mapping
Extract the key(s)
Query for the key(s) in the many-to-many handler
Receive the response set(s)
Query whatever is relevant from your other storage based on the result
Build your answer for use within the system
If you structure your data within an RDBMS to begin with your code will look more like:
Run a pre-built query indexed over whatever your search criteria is
Build an answer from the response
JOINs are a lot less scary than doing it all yourself -- especially in a concurrent system where other things may be changing in the course of your ad hoc locate-extract-query-receive-query-build procedure (which can be managed, of course, but why manage it when an RDBMS is already designed to manage it?).
JOIN isn't even a slow operation in decent databases. I have some business applications that join 20 tables constantly over fairly large tables (several millions of rows) and it zips right through them. It is highly optimized for this sort of thing which is why I use it. Oracle does well at this (but I can't afford it), DB2 is awesome (can't afford that, either), and SQL Server has come a long way (can't afford the good version of that one either!). MySQL, on the other hand, was really designed with the key-value store use-case in mind and matured in the "performance above all else" world of web applications -- and so it has some problems with integrity constraints and JOINs (but has handled replication very well for a very long time). So not all RDBMSes are created equal, but without knowing anything else about your problem they are the kind of datastore that will serve you best.
Even slightly non-trivial data can make your code explode in complexity -- hence the popularity of database systems. They aren't (supposed to be) religions, they are tools to let you separate a generic data-handling task from your own program's logic so you don't have to reinvent the wheel every project (but we tend to anyway).
But
Q: When would you not want to do this?
A: When you are really building a graph and not a set of many-to-many relations.
There is other type of database designed specifically to handle that case. You need to keep in mind, though, what your actual requirements are. Is this data ephemeral? Does it have to be correct? Do you care if you lose it? Does it need to be replicated? etc. Most of the time requirements are relatively trivial and the answer is "no" to these sort of higher-flying questions -- but if you have some special operational needs then you may need to take them into account when making your architectural decision.
If you are storing things that are actually documents (instead of structured records) on the one hand, and need to track a graph of relationships among them on the other then a combination of back-ends may be a good idea. A document database + a graphing database glued together by some custom code could be the right thing.
Think carefully about which kind of situation you are actually facing instead of assuming you have case X because it is what you are already familiar with.
In relational databases (e. g. SqlServer, MySql, Oracle...), the typical way of representing such data structures is with a "link table". For example:
users table:
userId (primary key)
userName
...
groups table:
groupId (primary key)
...
userGroups table: (this is the link table)
userId (foreign key to users table)
groupId (foreign key to groups table)
compound primary key of (userId, groupId)
Thus, to find all groups with users named "fred", you might write the following query:
SELECT g.*
FROM users u
JOIN userGroups ug ON ug.userId = u.userId
JOIN groups g ON g.groupId = ug.groupId
WHERE u.name = 'fred'
To achieve atomic inserts, updates, and deletes of this structure, you'll have to execute the queries that modify the various tables in transactions. ORM's such as EntityFramework (for .NET) will typically handle this for you.

Polyglot persistece with a graph database for relationships is a good ideia?

I would like to know if worth the idea of use graph databases to work specifically with relationships.
I pretend to use relational database for storing entities like "User", "Page", "Comment", "Post" etc.
But in most cases of a typical social graph based workload, I have to get a deep traversals that relational are not good to deal and involves slow joins.
Example: Comment -(made_in)-> Post -(made_in)-> Page etc...
I'm thinking make something like this:
Example:
User id: 1
Query: Get all followers of user_id 1
Query Neo4j for all outcoming edges named "follows" for node user with id 1
With a list of ids query them on the Users table:
SELECT *
FROM users
WHERE user_id IN (ids)
Is this slow?
I have seen this question Is it a good idea to use MySQL and Neo4j together?, but still cannot understand why the correct answer says that that is not a good idea.
Thanks
Using Neo4j is a great choice of technologies for an application like yours, that requires deep traversals. The reason it's a good choice is two-fold: one is that the Cypher language makes such queries very easy. The second is that deep traversals happen very quickly, because of the way the data is structured in the database.
In order to reap both of these benefits, you will want to have both the relationships and the people (as nodes) in the graph. Then you'll be able to do a friend-of-friends query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->fof
RETURN john, fof
and a friend-of-friend-of-friend query as follows:
START john=node:node_auto_index(name = 'John')
MATCH john-[:friend]->()-[:friend]->()->[:friend]->fofof
RETURN john, fofof
...and so on. (Same idea for posts and comments, just replace the name.)
Using Neo4j alongside MySQL is fine, but I wouldn't do it in this particular way, because the code will be much more complex, and you'll lose too much time hopping between Neo4j and MySQL.
Best of luck!
Philip
In general, the more databases/systems/layers you've got, the more complex the overall setup and operating will be.
Think about all those tasks like synchronization, export/import, backup/archive etc. which become quite expensive if your database(s) grow in size.
People use polyglot persistence only if the benefits of having dedicated and specialized databases outweigh the drawbacks of having to cope with multiple data stores. F.e. this can be the case if you have a large number of data items (activity or transaction logs f.e.), each related to a user. It would probably make no sense to store all the information in a graph database if you're only interested in the connections between the data items. So you would be better off storing only the relations in the graph (and the nodes have just a pointer into the other database), and the data per item in a K/V store or the like.
For your example use case, I would go only for one database, namely Neo4j, because it's a graph.
As the other answers indicate, using Neo4j as your single data store is preferable. However, in some cases, there might not be much choice in the matter where you already have another database behind your product. I would just like to add that if this is the case, running neo4j as your secondary database does work (the product I work on operates in this mode). You do have to work extra hard at figuring out what functionality you expect out of neo4j, what kind of data you need for it,how to keep the data in sync and the consequence of suffering from not always real time results. Most of our use cases can work with near real time results so we are fine. Bit it may not be the case for your product. Still, to me , using neo4j in this mode is still preferable than running without it.
We are able to produce a lot of graphy-great stuff as a result of it.

How to improve the performance for indexing data in cassandra

Cassandra doesn't have some CQL like like clause.... in MySQL to search a more specific data in database.
I have looked through some data and came up some ideas
1.Using Hadoop
2.Using MySQL server to be my anther database server
But is there any ways I can improve my Cassandra DB performance easier?
Improving your Cassandra DB performance can be done in many ways, but I feel like you need to query the data efficiently which has nothing to do with performance tweaks on the db itself.
As you know, Cassandra is a nosql database, which means when dealing with it, you are sacrificing flexibility of queries for fast read/writes and scalability and fault tolerance. That means querying the data is slightly harder. There are many patterns which can help you query the data:
Know what you are needing in advance. As querying with CQL is slightly less flexible than what you could find in a RDBMS engine, you can take advantage of the fast read-writes and save the data you want to query in the proper format by duplicating it. Too complex?
Imagine you have a user entity that looks like that:
{
"pk" : "someTimeUUID",
"name": "someName",
"address": "address",
"birthDate": "someBirthDate"
}
If you persist the user like that, you will get a sorted list of users in the order they joined your db (you persisted them). Let's assume you want to get the same list of users, but only of those who are named "John". It is possible to do that with CQL but slightly inefficient. What you could do here to amend this problem is to de-normalize your data by duplicating it in order to fit the query you are going to execute over it. You can read more about this here:
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
However, this approach seems ok for simple queries, but for complex queries it is somewhat hard to achieve and also, if you are unsure what you are going to query in advance, there is no way you store the data in the proper manner beforehand.
Hadoop comes to the rescue. As you know, you can use hadoop's map reduce to solve tasks involving a large amount of data, and Cassandra data, by my experience, can become very very large. With hadoop, to solve the above example, you would iterate over the data as it is, in each map method to find if the user is named John, if so, write to context.
Here is how the pseudocode would look:
map<data> {
if ("John".equals(data.getColumn("name")){
context.write(data);
}
}
At the end of the map method, you would end up with a list of all users who are named John. Youl could put a time range (range slice) on the data you feed to hadoop which will give you
all the users who joined your database over a certain period and are named John. As you see, here you are left with a lot more flexibility and you can do virtually anything. If the data you got was small enough, you could put it in some RDBMS as summary data or cache it somewhere so further queries for the same data can easily retrieve it. You can read more about hadoop in here:
http://hadoop.apache.org/

NoSQL DB and Reporting

I am in the architecture stage of an academic project involving billions of records. The project should be very lightweight in terms of computing power and highly scalable.
The information structure is very simple: I need to store a list of items each one with different features. The feature are integers, decimals, dates, strings etc. When the data is imported the types of the feature is known. Also, features can be used to reference other items.
I need to be able to get and sort a list of items by its features (more than one) - possibly using queries such as >, <, =, and regexes, length, left, right, mid for strings between the feature values and against user arbitrary input.
Reporting in the sense of sums, averages, grouping is also necessary by the demands for that are more relaxed - there is not need for a full cube capabilities, but more are better.
I am very new to the whole NoSQL world. What would you recommend?.
If you check out the tutorials for MongoDB, they have, in my opinion, the best introduction to the Map/Reduce system that is used to query and aggregrate.
I do wonder though why you have concluded in advance that NoSQL is the route to go. Although different items may have different schemas, are there a fixed number of entities and attributes, and why have you (if you have) ruled out SQL, which, after all, has decades of accumulated features for storing and querying data.
If you are going to use aggregates then you could use map reduce to populate aggregate tables and then serve that data.
Writing map reduce for every query may be cumbersome, you can also have a look at Apache Pig and Hive. This is especially helpful for the kindly of adhoc queries you are talking about.

Resources