Is it good practice to use same edge name b/w multiple verticies in gremlin? - graph-databases

Is it good practice to use the same edge name on multiple places in Gremlin?
I have a situation where I can use has as an EDGE name b/w multiple vertecies. Is that ok ? or it's better to have a different name for the performance? For understanding, I guess, a different name/label is better. What about the performance?

You really have to look for edge cases to get a notable impact on performance. An example: if you would do an index lookup on the combination of edge label and property key when an edge label was reused once, execution time would scale as log(2N) = log(N) + log(2). I consider this an edge case because starting a traversal on a relation is often considered an anti-pattern.
Personally, I like reuse of edge labels, but only if the semantics of the relation is exactly the same. Compare with the semantic web ontology language OWL, where you can define a domain and range for each relation. There, domain and range can consist of multiple vertex types.

Related

Graph database modeling: multiple edges are better than single edges with properties?

This is for a project that will map metadata. There are many more nodes but this particular one became a debate in the team.
Which model would yield the best query performance? Or it does not matter?
Option 1
Permission metadata is explicit as edges between nodes.
Option 2
Permission metadata is inside the properties of the edge.
Option 3
???
Let me comment for ArangoDB here, being one of its developers.
There is a third possibility, namely to have a single vertex collections and multiple edge collections for the different access methods. You would then "officially" have 3 graphs that share the same vertex set.
I would expect that this is better in performance, because each access type would only have to deal with a single type of edge and access would be fast.
Obviously it all depends on your queries. My statement holds for queries like "what are all the Entities a Person can update?" or "who can select this Entity?".
I could imagine that your standard query is more "Can this person delete that Entity?" or "Which access rights does this person have for that Entity?".
These two questions are probably not efficient with any of the approaches suggested, because as far as I see, all of them would then require a search, either in the outgoing edges of the Person or in the incoming edges of the Entity.
What would be needed here are a kind of "vertex centric indices", that is an index that can be used for the set of outgoing or incoming edges of a given vertex. If you, for example would use your option 2 (or indeed 1, this does not matter so much), and have a sorted index on all edges that is sorted first by Person and then by Entity. Then it is a lookup with time complexity O(log(#edges)) to find the (probably singleton) set of edges from a given Person to a given Entity.
We at ArangoDB are currently busy to add this feature, which will appear in one of the next two releases.
I can only speak for Neo4j here:
I don't know that it would matter much, but definitely benchmark! Both relationships and properties are stored as linked lists, so it will still need to traverse them. But if you have more relationships between Person and Entity nodes then putting them in properties starts to become more attractive.
I recommend checking out the free O'Reilly book Graph Databases to learn more about the internals of Neo4j. But benchmarks will always be the gold standard.

Representing multi dimensional data and their attributes

I am building an application where I will store some facts corresponding to the product, location and time dimensions. For example, a particular product P1 sold 10 units at a store S1 in a particular month T1. All the dimensions will have levels with a hierarchy among them - for example - Year/Month/Week/Day for time dimension.
The members (not sure if members is a right word) of each level will also have hierarchy among them - for example - 2014/Sep/1st Week/3rd Sep and of course this hierarchy matches the hierarchy among the corresponding levels. Similar is the case for other dimensions. Implementing this structure itself is a bit tough going by the options for representing hierarchical data and the choice should be dictated by frequency and volume of data that is to be inserted/updated/deleted versus selected. I can do some research and pick the most optimum solution for my case.
However, the real difficulty I am facing currently is modeling an alternate space where the fact data will live. Referring to the example I cited above, assume that P1 is a member of the product dimension level "Article" in the hierarchy Category/Subcategory/Article and S1 is a member of store dimension level "Store" in the hierarchy Country/City/Store. Now assume the store S1 does not keep the item P1 in the month T1 and we represent this decision using the flag IS_ACTIVE. That is, IS_ACTIVE=N is a fact and its context is {P1,S1,T1}. Also note that IS_ACTIVE is the attribute and N is its value. However this context {P1,S1,T1} itself is an instance of the meta context {Article, Store, Month}. And I need to store this meta context in the application as well. Reason is that there may be a place in the application where I may need to fetch a list of other possible attributes (for example, REBATE_OFFERED_PERCENT) corresponding to the meta context {Article, Store, Month}.
I have figured out a normalized relational schema design for all this but it is too convoluted and in my opinion will not be performant. I am looking for an alternative solution like a NoSQL database which can serve my needs since there is a some hierarchy involved here. Or, is my problem domain more amenable to a relational schema design?
This seems like a standard problem that should appear in multiple domains but I could not find any articles regarding this. Also, is there a branch in abstract mathematics which has a relevance to this problem? Is there a standard terminology to describe such problems? I am willing to read up on some theory before implementing a solution for it.

What is a good way to manage keys in a key-value store?

Trying to define some policy for keys in a key-value store (we are using Redis). The keyspace should be:
Shardable (can introduce more servers and spread out the keyspace between them)
Namespaced (there should be some mechanism to "group" keys together logically, for example by domain or associated concepts)
Efficient (try to use as little as possible space in the DB for keys, to allow for as much data as possible)
As collision-less as possible (avoid keys for two different objects to be equal)
Two alternatives that I have considered are these:
Use prefixes for namespaces, separated by some character (like human_resources:person:<some_id>).The upside of this is that it is pretty scalable and easy to understand. The downside would be possible conflicts depending on the separator (what if id has the character : in it?), and possibly size efficiency (too many nested namespaces might create very long keys).
Use some data structure (like Ordered Set or Hash) to store namespaces. The main drawback to this would be loss of "shardability", since the structure to store the namespaces would need to be in a single database.
Question: What would be a good way to manage a keyspace in a sharded setup? Should we use one these alternatives, or is there some other, better pattern that we have not considered?
Thanks very much!
The generally accepted convention in the Redis world is option 1 - i.e. namespaces separated by a character such as colon. That said, the namespaces are almost always one level deep. For example : person:12321 instead of human_resources:person:12321.
How does this work with the 4 guidelines you set?
Shardable - This approach is shardable. Each key can get into a different shard or same shard depending on how you set it up.
Namespaced Namespace as a way to avoid collisions works with this approach. However, namespaces as a way to group keys doesn't work out. In general, using keys as a way to group data is a bad idea. For example, what if the person moves from department to another? If you change the key, you will have to update all references - and that gets tricky.
Its best to ensure the key never changes for an object. Grouping can then be handled externally by creating a separate index.
For example, lets say you want to group people by department, by salary range, by location. Here's how you'd do it -
Individual people go in separate hash with keys persons:12321
Create a set for each group by - For example : persons_by:department - and only store the numeric identifiers for each person in this set. For example [12321, 43432]. This way, you get the advantages of Redis' Integer Set
Efficient The method explained above is pretty efficient memory wise. To save some more memory, you can compress the keys further on the application side. For example, you can store p:12321 instead of persons:12321. You should do this only if you have determined via profiling that you need such memory savings. In general, it isn't worth the cost.
Collision Free This depends on your application. Each User or Person should have a primary key that never changes. Use this in your Redis key, and you won't have collisions.
You mentioned two problems with this approach, and I will try to address them
What if the id has a colon?
It is of course possible, but your application's design should prevent it. Its best not to allow special characters in identifiers - because they will be used across multiple systems. For example, the identifier will very likely be a part of the URL, and colon is a reserved character even for urls.
If you really must allow special characters in your identifier, you would have to write a small wrapper in your code that encodes the special characters. URL encoding is perfectly capable of handling this.
Size Efficiency
There is a cost to long keys, however it isn't too much. In general, you should worry about the data size of your values rather than the keys. If you think keys are consuming too much memory, profile the database using a tool like redis-rdb-tools.
If you do determine that key size is a problem and want to save the memory, you can write a small wrapper that rewrites the keys using an alias.

Multiple Keys for exact and approximate lookups in C

I am trying to develop a network resource manager component in C which keeps track of various network elements over TCP/UDP sockets. For this, I use three values :
Hardware Location Number
Service Group Number
Node Number
The rule is that no two elements on a network may have the same set of these three numbers. Thus, each location's identity will be unique on the network. This information needs to be saved in the program (non-persistently) in a way so that given any of the parameters (could be just a single number, or a combination of any two, or all three) the program returns the eligible candidates by performing a quick search.
The addition and deletion should also be efficient, but given that there will be few insertions or deletions after the initial transient phase if they are a bit slower than search, it should be OK. Using trees is one option, but the answer of 'Which one to use?' still eludes me (Not that I know of many, but I look forward to learning newer ones if they serve my purpose).
To do this, I could have three different trees maintained separately with similar nodes pointing to a same structure in memory, but I feel that is inefficient and not compact. I am looking for a unified data set which can handle these variations like multiple keys.
Or I could have a single AVL tree with multiple keys (if that is allowed).
The number of elements in the network is dynamic, so using a 3D array is out of option.
A friend also suggested hashing, but I am not too sure.
Please help.
Hashing seems like a silly choice for this. Perhaps the most significant reason is that you seem interested in approximate lookups. Hashing your values will likely mean iterating through the entire collection to find a group of nodes that have a common prefix, or a similar prefix.
PATRICIA is commonly used in routing tables, and makes itself quite amenable to searching for items that have similar keys. Note that I have found much misleading information about PATRICIA tries, which I've written about here. I found this resource to be particularly helpful.
Similarly to an AVL tree, you'll need to combine the three keys to form one (without hashing, preferably).
unsigned int key[3] = { hardware_location_number, service_group_number, node_number };
/* ^------- Use something like this as your key */

Naming conventions for non-normalized fields

Is it a common practice to use special naming conventions when you're denormalizing for performance?
For example, let's say you have a customer table with a date_of_birth column. You might then add an age_range column because sometimes it's too expensive to calculate that customer's age range on the fly. However, one could see this getting messy because it's not abundantly clear which values are authoritative and which ones are derived. So maybe you'd want to name that column denormalized_age_range or something.
Is it common to use a special naming convention for these columns? If so, are there established naming conventions for such a thing?
Edit: Here's another, more realistic example of when denormalization would give you a performance gain. This is from a real-life case. Let's say you're writing an app that keeps track of college courses at all the colleges in the US. You need to be able to show, for each degree, how many credits you graduate with if you choose that degree. A degree's credit count is actually ridiculously complicated to calculate and it takes a long time (more than one second per degree). If you have a report comparing 100 different degrees, it wouldn't be practical to calculate the credit count on the fly. What I did when I came across this problem was I added a credit_count column to our degree table and calculated each degree's credit count up front. This solved the performance problem.
I've seen column names use the word "derived" when they represent that kind of value. I haven't seen a generic style guide for other kinds of denormalization.
I should add that in every case I've seen, the derived value is always considered secondary to the data from which it is derived.
In some programming languages, eg Java, variable names with the _ prefix are used for private methods or variables. Private means it should not be modified/invoked by any methods outside the class.
I wonder if this convention can be borrowed in naming derived database columns.
In Postgres, column names can start with _, eg _average_product_price.
It can convey the meaning that you can read this column, but don't write it because it's derived.
I'm in the same situation right now, designing a database schema that can benefit from denormalisation of central values. For example, table partitioning requires the partition key to exist in the table. So even if the data can be retrieved by following some levels of foreign keys, I need the data right there in most tables.
Maybe the suffix "copy" could be used for this. Because after all, the data is just a copy of some other location where the primary data is stored. Since it's a word, it can work with all naming conventions, like .NET PascalCase which can be mapped to SQL snake_case, e. g. CompanyIdCopy and company_id_copy. And it's a short word so you don't have to write too much. And it's not an abbreviation so you don't have to spell it or ever wonder what it means. ;-)
I could also think of the suffix "cache" or "cached" but a cache is usually filled on demand and invalidated some time later, which is usually not the case with denormalised columns. That data should exist at all times and never be outdated or missing.
The word "derived" is just a bit longer than "copy". I know that one special DBMS, an expensive one, has a column name limit of 30 characters, so that could be an issue.
If all of the values required to derive the calculation are in the table already, then it is extremely unlikely that you will gain any meaningful (or even measurable) performance benefit by persisting these calculated values.
I realize this doesn't answer the question directly, but it would seem that the premise is faulty: if such conditions existed for the question to apply, then you don't need to denormalize it to begin with.

Resources