neo4j - how do I model node schema less? - database

I read some where that noe4j or other nosql database is schemaless. so what is the schemaless? I would like to know more about it with use case.

You don't need to define a schema like you would have to do e.g. in mysq with a table. Instead, you can add properties and their value to each individual node (entry), as you like.
E.g: if you look at the address book in an android phone a person entry can have a multitude of properties - phone numbers, addresses, names. Some people have a lot of attributes, some have none.
Doing something like that with a schema (e.g. table structure) is really hard, and requires advance planning of what your fields are, and how you want to query them in the future.
Without a schema you can more or less play it by ear, and add things as needed.
What needs deciding though is what to add as property to a node, and what as a related node. E.g. is an address a node, or just a property of a person? (Most likely a seperate node, but it depends on your use case)

Related

Modeling for Graph DBs

Coming from as SQL/NoSQL background I am finding it quite challenging to model (efficiently that is) the simplest of exercises on a Graph DB.
While different technologies have limitations and best practices, I am uncertain whether the mindset that I am using while creating the models is the correct one, hence, I am in the need of guidance, advice and/or resources to help me get closer to the right practices.
The initial exercise I have tried is representing a file share entire directory (subfolders and files) in a graph DB. For instance some of the attributes and queries I would like to include are;
The hierarchical structure of the folders
The aggregate size at the current node
Being able to search based on who created a file/folder
Being able to search on file types
This brings me to the following questions
When/Which attributes should be used for edges. Only those on which I intend to search? Only relationships?
Should I wish to extend my graph capabilities, for instance, search on files bigger than X? How does one try to maximize the future capabilities/flexibility of the model so that such changes do not cause massive impacts.
Currently I am exploring InfiniteGraph and TitanDB.
1) The only attribute I can think of to describe an edge in a folder hierarchy is whether it is a contains or contained-by relationship.
(You don't even need that if you decide to consider all your edges one or the other. In your case, it looks like you'll almost always be interrogating descendants to search and to return aggregate size).
This is a lot simpler than a network, or a hierarchy where the edges may be of different types. Think an organization chart that tracks not only who manages whom, but who supports whom, mentors whom, harasses whom, whatever.
2) I'm not familiar with the two databases you mentioned, but Neo4J allows indexes on node properties, so adding an index on file_size should not have much impact. It's also "schema-less," so that you can add attributes on the fly and various nodes may contain different attributes.

Is it ever a good idea to have a record in a reference table in your database that represent "all other records"?

I have an asp.net-mvc website with a SQL Server backend. I am simplifying my situation to highlight and isolate the issue. I have 3 tables in the DB
Article table (id, name, content)
Location table (id, name)
ArticleLocation table (id, article Id, location Id)
On my website, when you create an article, you select from a multiselect listbox the locations where you want that article sent.
There are about 25 locations so I was debating adding a new location called "Global" as a shortcut instead of having the person select 25 different items from a listbox. I could still do this as a shortcut on the front end but now I am debating if there is benefit for this to flow through to the backend.
So if I have an article that goes global, instead of having 25 records in the ArticleLocation table, I would only have one and then I would do some tricks on the front end to select all of the items. I am trying to figure out if this is a very bad idea.
Things I can think about that are making me nervous:
what if I create an article and choose global but then last in the future 3 new locations are added. Without this global setting, these 3 location would not get the article but in the new way, they would. I am not sure what is better as the second thing might actually be what you want but its a little less explicit.
I have a requirement on a report, I want to filter by all articles that are global. Imagine I would need a article.IsGlobal() methode. Right now I guess I could say if a project has the same count of locations as all of the records in the location table I could translate that to being deemed global but again since people can add new locations, I feel like this approach is somewhat flaky.
Does anyone have any suggestions for this dilemna around creating records in a reference data table that really reflect "all records". Appreciate any advice
By request, here is my comment promoted to an answer. It's an opportunity to expand on it, too.
I'll limit my answer to a system with a single list of locations. I've done the corporate hierarchy thing: Companies, Divisions, Regions, States, Counties, Offices and employees or some such. It gets ugly.
In the case of the OP's question, it seems that adding an AllLocations bit to the Articles table makes the intention clear. Any article with the flag set to 1 appears in all locations, regardless of when they were created, and need not have any entries in the ArticleLocation table. An article can still be explicitly added to all existing locations if the author does not want it to automatically appear in future locations.
Implementation involves a little more work. I would add INSERT and UPDATE triggers to the Article and ArticleLocation tables to enforce the rule that either the AllLocations bit is set and there are no corresponding rows in ArticleLocation, or the bit is clear and locations may be explicitly set. (It's a personal preference to have the database defend itself against "bad data" whenever it's practical to do so.)
Depending on your needs, a table-valued function is a good way to hide some of the dirty work, e.g. dbo.GetArticleIdsForLocation( LocationId ) can handle the AllLocations flag internally. You can use it in stored procedures and ad-hoc queries to JOIN with Article. In other cases a view may be appropriate.
Another feature that you are welcome to borrow ("Steal from your friends!") is to have the administrator's landing page be an "exceptions" page. It's a place where I display things that vary from massive flaming disasters to mere peccadillos. In this case, articles that are associated with zero locations would qualify as something non-critical, but worth checking up on.
Articles that are explicitly shown in every location might be of interest to someone adding a new location, so I would probably have a web page for that. It may be that some of the articles should be updated to account for the new location explicitly or reconsidered for being changed to all locations.
Is it ever a good idea ... that represent “all other records”?
Is it it ever a good idea to represent a tree in table? Root of a tree represents “all other records”.
Trees and hierarchies are not simple to work with, but there are many examples, articles and books that tackle the problem -- like Celko's Trees and Hierarchies in SQL; Karwin's SQL Antipatterns.
So what you actually have here is a hierarchy (maybe just a tree) -- it may help to approach the problem that way from the start. The Global from your example is just another Location (root of a tree), so when a new location is added, you may decide if it will be a child of the Global or not.
Facts
Location(LocationID) exists.
Location(LocationID) is contained in Parent Location(LocationID).
Article(ArticleID) exists.
Article(ArticleID) is available at Location(LocationID).
Constraints
Each Location is contained in at most one Parent Location. It is possible that for some Parent Location, more than one Location is contained in that Parent Location.
It is possible that some Article is available at more than one Location
and that for some Location, more than one Article is available at that Location.
Logical
This way you can assign any location to an article -- but have to resolve it to the leaf level when needed.
The hierarchy (tree) is here represented in the "naive way"; use closure table, nested sets or enumerated path instead -- or, if you like recursion...
tl;dr
In this case as I understand it, I think it is a good idea to create a "global" location in the Location table. I definitely find it preferable to creating a "global" flag in the Article table.
"Is it ever a good idea...?" is not a question we like to answer on SO. It's mostly a debate question, not a Q&A question, and besides, we have enough creativity in our community to come up with some example where "it" would be a good idea, regardless.
To your more specific question, how do I represent "all locations" in the database? that is a judgement call based on your business requirements.
Do you want "all locations" to include future locations?
If not, then probably you should only implement "all locations" as a helper that selects all current locations in the database.
Do you anticipate having a hierarchy of locations?
Real-world locations have significant hierarchy:
Global
Multi-national (continent, trading block)
Country
Administrative region (state, province, canton, etc.)
City
Neighborhood
If you think you are going to want to have the option to choose, say, a Country, instead of Global, then implementing a hierarchical representation such as Damir suggests is the best way to go. However, if you are not sure if you are ever going to have any other grouping of locations besides Global, a hierarchical data structure is too much work for now. All you need to do is make sure your current implementation has a migration path to a possible future hierarchical representation.
Global as a pseudo-location
If you do want future locations included in Global and do not need a hierarchical location structure, then my instinct based on years of experience would be to create "Global" as a pseudo-location. That is, Global would be one of the locations in the Location table, but it would have a special meaning. This is definitely a trade-off, but has the benefit of not altering the data structure to support Global which means that all the special cases that "Global" creates are handled by excluding or including some Locations in queries rather than by checking some flags somewhere. (Or if you like flags, you can add a 'pseudo-location' flag to the Location table.)
With Global as a location, additions or deletions to the Location table are handled automatically. The query for all Global articles is straightforward: the same as the query for all articles for any other Location. Reporting on articles by location is also straight forward, with Global articles appearing in reports just like any other location. You can also represent the difference between a "Global" article (all current and future locations) and an "all locations" article (all current locations but no future locations).
Selecting all articles that should be visible at a specific location is slightly harder, it's now a check against "Global" as well as that location, but at least it is checking for 2 values in the same table versus checking two different tables.
SELECT article_id FROM ArticleLocation WHERE location_id in (1, 5);
vs
SELECT article_id FROM ArticleLocation WHERE location_id = 5
UNION
SELECT id FROM Article WHERE is_global;
From the logic, as you described it, GLOBAL should be actually global and stay global, even if you add new locations (problem 1 solved). But this also implies that GLOBAL is not the same thing as "all locations" (as there might also exist some other locations we don't have defined yet). I think this logic is needed especially by your requirement 2 - otherwise it would completely fail on adding new locations.
Analysis done! From the above we see that GLOBAL is something above all those locations. There's no sense in trying to define it as a Location. Go for the easiest solution!
Article table (id, name, content, global)
i.e. boolean flag - article is global or not. In the UI, do it simply as a checkbox - if checked, the multiselect box will be disabled. Simple, easy, requirements met. Done!
Is there a need to automatically add some articles to new locations when new locations are added? If yes then in such case I’d consider adding new ‘global’ property in the backend.
Otherwise it probably isn’t worth the effort. Even if you had 10000 articles and 20 different locations selected for each article that would be about 200k records which is not that bad when you set indexes.
Check your existing data and see how people are already choosing locations. If most users select only several locations and not all then it’s really an edge case and you shouldn’t be working on it unless it really creates problems.
I agree with #HABO's comment (he should have posted it--if he does, upvote him). Adding an atrribute to table Article to identify those items that are to be associated with all Locations, present and future, presumably for the lifetime of the article, should save you time and effort over the long run. Sure, triggers and counts-against-all will do the trick, but they're awkward and would be a pain to support if/when subsequent system changes come along. The UI would be simpler to use, as the user just has to click a checkbox (or whatever) and not multi-click everything in a dropdown of unforseeable length.
(#Damir's hierarchy idea would work as well, but--speaking from a bit too much experience--they're a hassle to work with, and I wouldn't introduce one here unless there was significantly more system and/or business use to get out of it.)

Database storage design of large amounts of heterogeneous data

Here is something I've wondered for quite some time, and have not seen a real (good) solution for yet. It's a problem I imagine many games having, and that I can't easily think of how to solve (well). Ideas are welcome, but since this is not a concrete problem, don't bother asking for more details - just make them up! (and explain what you made up).
Ok, so, many games have the concept of (inventory) items, and often, there are hundreds of different kinds of items, all with often very varying data structures - some items are very simple ("a rock"), others can have insane complexity or data behind them ("a book", "a programmed computer chip", "a container with more items"), etc.
Now, programming something like that is easy - just have everything implement an interface, or maybe extend an abstract root item. Since objects in the programming world don't have to look the same on the inside as on the outside, there is really no issue with how much and what kind of private fields any type of item has.
But when it comes to database serialization (binary serialization is of course no problem), you are facing a dilemma: how would you represent that in, say, a typical SQL database ?
Some attempts at a solution that I have seen, none of which I find satisfying:
Binary serialization of the items, the database just holds an ID and a blob.
Pro's: takes like 10 seconds to implement.
Con's: Basically sacrifices every database feature, hard to maintain, near impossible to refactor.
A table per item type.
Pro's: Clean, flexible.
Con's: With a wide variety come hundreds of tables, and every search for an item has to query them all since SQL doesn't have the concept of table/type 'reference'.
One table with a lot of fields that aren't used by every item.
Pro's: takes like 10 seconds to implement, still searchable.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
A few tables with a few 'base profiles' for storage where similar items get thrown together and use the same fields for different data.
Pro's: I've got nothing.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
What ideas do you have? Have you seen another design that works better or worse?
It depends if you need to sort, filter, count, or analyze those attribute.
If you use EAV, then you will screw yourself nicely. Try doing reports on an EAV schema.
The best option is to use Table Inheritance:
PRODUCT
id pk
type
att1
PRODUCT_X
id pk fk PRODUCT
att2
att3
PRODUCT_Y
id pk fk PRODUCT
att4
att 5
For attributes that you don't need to search/sort/analyze, then use a blob or xml
I have two alternatives for you:
One table for the base type and supplemental tables for each “class” of specialized types.
In this schema, properties common to all “objects” are stored in one table, so you have a unique record for every object in the game. For special types like books, containers, usable items, etc, you have another table for each unique set of properties or relationships those items need. Every special type will therefore be represented by two records: the base object record and the supplemental record in a particular special type table.
PROS: You can use column-based features of your database like custom domains, checks, and xml processing; you can have simpler triggers on certain types; your queries differ exactly at the point of diverging concerns.
CONS: You need two inserts for many objects.
Use a “kind” enum field and a JSONB-like field for the special type data.
This is kind of like your #1 or #3, except with some database help. Postgres added JSONB, giving you an improvement over the old EAV pattern. Other databases have a similar complex field type. In this strategy you roll your own mini schema that you stash in the JSONB field. The kind field declares what you expect to find in that JSONB field.
PROS: You can extract special type data in your queries; can add check constraints and have a simple schema to deal with; you can benefit from indexing even though your data is heterogenous; your queries and inserts are simple.
CONS: Your data types within JSONB-like fields are pretty limited and you have to roll your own validation.
Yes, it is a pain to design database formats like this. I'm designing a notification system and reached the same problem. My notification system is however less complex than yours - the data it holds is at most ids and usernames. My current solution is a mix of 1 and 3 - I serialize data that is different from every notification, and use a column for the 2 usernames (some may have 2 or 1). I shy away from method 2 because I hate that design, but it's probably just me.
However, if you can afford it, I would suggest thinking outside the realm of RDBMS - it sounds like Non-RDBMS (especially key/value storage ones) may be a better fit to store these data, especially if item 1 and item 2 differ from each item a lot.
I'm sure this has been asked here a million times before, but in addition to the options which you have discussed in your question, you can look at EAV schema which is very flexible, but which has its own sets of cons.
Another alternative is database systems which are not relational. There are object databases as well as various key/value stores and document databases.
Typically all these things break down to some extent when you need to query against the flexible attributes. This is kind of an intrinsic problem, however. Conceptually, what does it really mean to query things accurately which are unstructured?
First of all, do you actually need the concurrency, scalability and ACID transactions of a real database? Unless you are building a MMO, your game structures will likely fit in memory anyway, so you can search and otherwise manipulate them there directly. In a scenario like this, the "database" is just a store for serialized objects, and you can replace it with the file system.
If you conclude that you do (need a database), then the key is in figuring out what "atomicity" means from the perspective of the data management.
For example, if a game item has a bunch of attributes, but none of these attributes are manipulated individually at the database level (even though they could well be at the application level), then it can be considered as "atomic" from the data management perspective. OTOH, if the item needs to be searched on some of these attributes, then you'll need a good way to index them in the database, which typically means they'll have to be separate fields.
Once you have identified attributes that should be "visible" versus the attributes that should be "invisible" from the database perspective, serialize the latter to BLOBs (or whatever), then forget about them and concentrate on structuring the former.
That's where the fun starts and you'll probably need to use "all of the above" strategy for reasonable results.
BTW, some databases support "deep" indexes that can go into heterogeneous data structures. For example, take a look at Oracle's XMLIndex, though I doubt you'll use Oracle for a game.
You seem to be trying to solve this for a gaming context, so maybe you could consider a component-based approach.
I have to say that I personally haven't tried this yet, but I've been looking into it for a while and it seems to me something similar could be applied.
The idea would be that all the entities in your game would basically be a bag of components. These components can be Position, Energy or for your inventory case, Collectable, for example. Then, for this Collectable component you can add custom fields such as category, numItems, etc.
When you're going to render the inventory, you can simply query your entity system for items that have the Collectable component.
How can you save this into a DB? You can define the components independently in their own table and then for the entities (each in their own table as well) you would add a "Components" column which would hold an array of IDs referencing these components. These IDs would effectively be like foreign keys, though I'm aware that this is not exactly how you can model things in relational databases, but you get the idea.
Then, when you load the entities and their components at runtime, based on the component being loaded you can set the corresponding flag in their bag of components so that you know which components this entity has, and they'll then become queryable.
Here's an interesting read about component-based entity systems.

Database structure and associations for open-ended data and data types

I'm trying to build a custom CMS tool (yes I know - ANOTHER CMS) where the users can create as many nested "nodes" as they want.
Example "Nodes": Restaurants, People, Shoes, Continents ... anything. Within each Node, there can be as many sub nodes as needed and so on and so forth.
While looking through Wordpress, Drupal... etc etc, I keep seeing tables like "taxonomy", and "terms".
This seems like a "normal" thing, but I can't wrap my head around how it should be done or how they're doing it. I assume these tables are related to the overall structure and table relationships, but... searches online come up shy of explaining what's actually going on or how best I would go about designing my database for this kind of structure.
Ideas I've had so far (that obviously aren't flushed out or I wouldn't be here asking):
1) Store known data types and bindModel():
Create tables like data_locations and data_texts that would have fields respective of their data. So - in the data_locations table, I'd have city, and longitude, and address. And in the data_texts table, I'd have title, subtitle, content, author.
Then, each time they created a new "Node", they could pick what kind of data-types it should have, and I would use bindModel() to create the associations (I guess?).
This wouldn't be as flexible, but maybe easier to manage, and faster to run queries on...etc? Dunno.
2) Custom fields for each Node with single "data" table: Have a data table, and a fields table... each node would haveMany fields - each with a type and maxLength...etc. Then, in the admin, I'd list those fields, and each chunk of data title, shoe_size...etc would have a row in the data table that related to the node and to the field.
This one seems more like what I THINK the "taxonomy" thing is - but again, I really have no idea.
Which database are you considering? Graph databases tend to be more natural for this kind of thing in my opinion.
In a relational database it is tricky. Nesting queries of arbitrary depth are not natural (but doable), dynamic schema are also unnatural (see 'EAV schemas' via google to see the arguments around that) and are very difficult to query well.
Have a look at neo4j. I think you can express your requirement directly and naturally.

Object oriented programming in Graph databases

Graph databases store data as nodes, properties and relations. If I need to retrieve some specific data from an object based upon a query, then I would need to retrieve multiple objects (as the query might have a lot of results).
Consider this simple scenario in object oriented programming in graph-databases:
I have a (graph) database of users, where each user is stored as an object. I need to retrieve a list of users living in a specific place (the place property is stored in the user object). So, how would I do it? I mean unnecessary data will be retrieved every time I need to do something (in this case, the entire user object might need to be retrieved). Isn't functional programming better in graph databases?
This example is just a simple analogy of the above stated question that came to my mind. Don't take it as a benchmark. So, the question remains, How great is object oriented programming in graph-databases?
A graph database is more than just vertices and edges. In most graph databases, such as neo4j, in addition to vertices having an id and edges having a label they have a list of properties. Typically in java based graph databases these properties are limited to java primatives -- everything else needs to be serialized to a string (e.g. dates). This mapping to vertex/edge properties can either be done by hand using methods such as getProperty and setProperty or you can something like Frames, an object mapper that uses the TinkerPop stack.
Each node has attributes that can be mapped to object fields. You can do that manually, or you can use spring-data to do the mapping.
Most graph databases have at least one kind of index for vertices/edges. InfiniteGraph, for instance, supports B-Trees, Lucene (for text) and a distributed, scaleable index type. If you don't have an index on the field that you're trying to use as a filter you'd need to traverse the graph and apply predicates yourself at each step. Hopefully, that would reduce the number of nodes to be traversed.
Blockquote I need to retrieve a list of users living in a specific place (the place property is stored in the user object).
There is a better way. Separate location from user. Instead of having a location as a property, create a node for locations.
So you can have (u:User)-[:LIVES_IN]->(l:Location) type of relationship.
it becomes easier to retrieve a list of users living in a specific place with a simple query:
match(u:User)-[:LIVES_IN]->(l:Location) where l.name = 'New York'.
return u,l.
This will return all users living in New York without having to scan all the properties of each node. It's a faster approach.
Why not use an object-oriented graph database?
InfiniteGraph is a graph database built on top of Objectivity/DB which is an massively scalable, distributed object-oriented database.
InfiniteGraph allows you to define your vertices and edges using a standard object-oriented approach, including inheritance. You can also embed a defined data type as an attribute in another data type definition.
Because InfiniteGraph is object-oriented, it give you access to query capabilities on complex data structures that are not available in the popular graph databases. Consider the following diagram:
In this diagram I create a query that determines the inclusion of the edge based on an evaluation of the set of CallDetail nodes hanging off the Call edge. I might only include the edge in my results if there exists a CallDetail with a particular date or if the sum of the callDurations of all of the CallDetails that occurred between two dates is over from threshold. This is the real power of object-oriented database in solving graph problems: You can support a much more complex data model.
I'm not sure why people have comingled the terms graph database and property graph. A property graph is but one way to implement a graph database, and not particular efficient. InfiniteGraph is a schema-based database and the schema provides several distinct advantages, one of which object placement.
Disclaimer: I am the Director of Field Operation for Objectivity, Inc., maker of InfiniteGraph.

Resources