I am creating a graph and initially used a partition key that seems like the only logical one given the set of data. However, the number of vertices and edges ends up being too large for a single partition. I did not create a partitioned collection yet but only created a single 10GB collection. I ran this out of space and filled it up as I wasn't sure how many vertices and edges I would have. The data is a set of categories with varying number of subcategories(and subcategories of those subcats down to an arbitrary depth). The data is a category id and name and a market for which the category applies. The partition key is currently the market. Within a given market there are a bunch of category/subcat/subcat/... that exhausted the 10GB partition for that given market.
If all I have is a category id which is unique, a category name, and a market (as a vertex), and then a parentOf edge the connects a parent category to its children, then what else would make sense as a partition key? If I have a parent category (vertex) with id of 1, a market of 'US', and it has 100 subcategories each with their own id and the corresponding 100 edges for the parentOf connections all with same market of 'US', then the only other option I have for a partition key other than the market is the category id. The issue is, how efficient would lookups and traversals be if the children and children of those children(and edges) are in other partitions?
How do you build a very large graph with a scenario like this?
Given an arbitrary category id, what would the performance be like to find all the children and walk the edges down to find all the children in the hierarchy of those edges?
What would the partition key attribute for the edges need to be? The same partition key as the parent vertex or the same partition key as the child vertex?
Am I thinking about this wrong?
My recommendation for any non-trivial graph implementation is to make a super generic property that all your docs must include such as (quite literally) partitionKey. Then you're free to use the value for market in that field where it makes sense and something else to support a different query pattern.
The important thing to understand is that queries across multiple partitions are going to be slow. So as much as possible you should tailor your partition key to support the best balance between reads and writes.
Ask yourself "What queries will I need to perform against this data most often?" and then adjust the partitionKey for the various documents accordingly.
As for edges, when you add an edge between two vertices using Gremlin, Cosmos automatically places the edge document in the same partition as the out vertex.
Related
I am new to Neo4j. I want to find out what is the best way to group your nodes so that you can make faster queries based on location using Neo4j.
I will have a lot of users in the database. I will be making queries to return the top 10 users closest to a particular user/location. When making the query, I want to narrow down the search before working out the distances between the users and then ranking them based on who is the closest.
I need advice on the best approach using Neo4j. I can think of two ways to group them but I am not sure if it's feasible or how to execute them yet.
Divide the world up into grids, give each grid a cell id and create a node for each id. Make a direct relationship between the users and the cell id.
Create a node for each city or town. All users living in that town or city will have a direct relationship. But if I do that how should I group a user that shares two borders?
Luckily we got you covered.
Neo4j has built in spatial indexes, that you can declare on your location property, which needs to be a point({latitude, longitude})
Then you can use the point.distance function to find nodes that are close to a starting point, e.g. passed in from the outside as a location or from a (set of) start node(s).
https://neo4j.com/docs/cypher-manual/current/functions/spatial/#functions-distance
https://neo4j.com/docs/cypher-manual/current/query-tuning/indexes/#administration-indexes-spatial-distance-searches-single-property-index
i.e.
WITH point({latitude:$latitude, longitude:$longitude}) as poi
MATCH (n:Node)
// will use the index
WHERE point.distance(n.location, poi) < 1000
RETURN n
ORDER BY point.distance(n.location, poi) ASC LIMIT 10
I have the table order with following fields:
ID
Serial
Visitor
Branch
Company
Assume there are relations between Visitor, Branch and Company in the database. But every visitor can be in more Branch. How can I create a hierarchy between these three fields for my order table.
How can I do that?
You would need to create a denormalised dimension table, with the distinct result of the denormalisation process of the table order. In this case, you would have many rows for the same visitor. One for each branch.
In your fact table, the activity record which would have BranchKey in the primary key, would reference this dimension. This obviously would be together with the VisitorKey...
Then in SSAS you would need to build the hierarchy, and set the relationships between the keys... When displaying this data in a client, such as excel, you would drag the hierarchy in the rows, and when expanding, data from your fact would fit in according to the visitors branch...
With regards to dimensions, it's important to set relationships between the attributes, as this will give you a massive performance gain when processing the dimension, and the cube. Take a look at this article for help regarding that matter http://www.bidn.com/blogs/DevinKnight/ssis/1099/ssas-defining-attribute-relationships-in-2005-and-2008. In this case it's the same approach also for '12.
I have relationships between nodes which are only valid for a specific time. Simple example: Person P lived at Address A from time t1 to time t2. I can put a validFrom and a validUntil property on the relationship, but when using cypher, I have to filer this in the WHERE clause, but I would want it in the MATCH clause.
The reason why I want this in the MATCH clause is that it may potentially traverse a large subgraph, only to find out later that most of it could be ignored. This is especially so with a large set of historic relationships.
I can make seperate validFrom and validUntil relationships to specific date nodes. That way I can use MATCH. This would be OK if I have a limited set of such time nodes, but when I have to store timestamps it is not practical to do this.
How can you optimize Cypher for this type of querying?
How to time-slice a graph with nodes and relationships having time-validity indicators?
Have you seen the timeline modeling described here: http://docs.neo4j.org/chunked/milestone/cypher-cookbook-path-tree.html
Specifically a Multigraph.
Some colleague suggested this and I'm completely baffled.
Any insights on this?
It's pretty straightforward to store a graph in a database: you have a table for nodes, and a table for edges, which acts as a many-to-many relationship table between the nodes table and itself. Like this:
create table node (
id integer primary key
);
create table edge (
start_id integer references node,
end_id integer references node,
primary key (start_id, end_id)
);
However, there are a couple of sticky points about storing a graph this way.
Firstly, the edges in this scheme are naturally directed - the start and end are distinct. If your edges are undirected, then you will either have to be careful in writing queries, or store two entries in the table for each edge, one in either direction (and then be careful writing queries!). If you store a single edge, i would suggest normalising the stored form - perhaps always consider the node with the lowest ID to be the start (and add a check constraint to the table to enforce this). You could have a genuinely unordered representation by not having the edges refer to the nodes, but rather having a join table between them, but that doesn't seem like a great idea to me.
Secondly, the schema above has no way to represent a multigraph. You can extend it easily enough to do so; if edges between a given pair of nodes are indistinguishable, the simplest thing would be to add a count to each edge row, saying how many edges there are between the referred-to nodes. If they are distinguishable, then you will need to add something to the node table to allow them to be distinguished - an autogenerated edge ID might be the simplest thing.
However, even having sorted out the storage, you have the problem of working with the graph. If you want to do all of your processing on objects in memory, and the database is purely for storage, then no problem. But if you want to do queries on the graph in the database, then you'll have to figure out how to do them in SQL, which doesn't have any inbuilt support for graphs, and whose basic operations aren't easily adapted to work with graphs. It can be done, especially if you have a database with recursive SQL support (PostgreSQL, Firebird, some of the proprietary databases), but it takes some thought. If you want to do this, my suggestion would be to post further questions about the specific queries.
It's an acceptable approach. You need to consider how that information will be manipulated. More than likely you'll need a language separate from your database to do the kinds graph related computations this type of data implies. Skiena's Algorithm Design Manual has an extensive section graph data structures and their manipulation.
Without considering what types of queries you might execute, start with two tables vertices and edges. Vertices are simple, an identifier and a name. Edges are complex given the multigraph. Edges should be uniquely identified by a combination two vertices (i.e. foreign keys) and some additional information. The additional information is dependent on the problem you're solving. For instance, if flight information, the departure and arrival times and airline. Furthermore you'll need to decide if the edge is directed (i.e. one way) or not and keep track if that information as well.
Depending on the computation you may end up with a problem that's better solved with some sort of artificial intelligence / machine learning algorithm. For instance, optimal flights. The book Programming Collective Intelligence has some useful algorithms for this purpose. But where the data is kept doesn't change the algorithm itself.
Well, the information has to be stored somewhere, a relational database isn't a bad idea.
It would just be a many-to-many relationship, a table of a list of nodes, and table of a list of edges/connections.
Consider how Facebook might implement the social graph in their database. They might have a table for people and another table for friendships. The friendships table has at least two columns, each being foreign keys to the table of people.
Since friendship is symmetric (on Facebook) they might ensure that the ID for the first foreign key is always less than the ID for the second foreign key. Twitter has a directed graph for its social network, so it wouldn't use a canonical representation like that.
How to design data storage for huge tagging system (like digg or delicious)?
There is already discussion about it, but it is about centralized database. Since the data is supposed to grow, we'll need to partition the data into multiple shards soon or later. So, the question turns to be: How to design data storage for partitioned tagging system?
The tagging system basically has 3 tables:
Item (item_id, item_content)
Tag (tag_id, tag_title)
TagMapping(map_id, tag_id, item_id)
That works fine for finding all items for given tag and finding all tags for given item, if the table is stored in one database instance. If we need to partition the data into multiple database instances, it is not that easy.
For table Item, we can partition its content with its key item_id. For table Tag, we can partition its content with its key tag_id. For example, we want to partition table Tag into K databases. We can simply choose number (tag_id % K) database to store given tag.
But, how to partition table TagMapping?
The TagMapping table represents the many-to-many relationship. I can only image to have duplication. That is, same content of TagMappping has two copies. One is partitioned with tag_id and the other is partitioned with item_id. In scenario to find tags for given item, we use partition with tag_id. If scenario to find items for given tag, we use partition with item_id.
As a result, there is data redundancy. And, the application level should keep the consistency of all tables. It looks hard.
Is there any better solution to solve this many-to-many partition problem?
I doubt there is a single approach that optimizes all possible usage scenarios. As you said, there are two main scenarios that the TagMapping table supports: finding tags for a given item, and finding items with a given tag. I think there are some differences in how you will use the TagMapping table for each scenario that may be of interest. I can only make reasonable assumptions based on typical tagging applications, so forgive me if this is way off base!
Finding Tags for a Given Item
A1. You're going to display all of the tags for a given item at once
A2. You're going to ensure that all of an item's tags are unique
Finding Items for a Given Tag
B1. You're going to need some of the items for a given tag at a time (to fill a page of search results)
B2. You might allow users to specify multiple tags, so you'd need to find some of the items matching multiple tags
B3. You're going to sort the items for a given tag (or tags) by some measure of popularity
Given the above, I think a good approach would be to partition TagMapping by item. This way, all of the tags for a given item are on one partition. Partitioning can be more granular, since there are likely far more items than tags and each item has only a handful of tags. This makes retrieval easy (A1) and uniqueness can be enforced within a single partition (A2). Additionally, that single partition can tell you if an item matches multiple tags (B2).
Since you only need some of the items for a given tag (or tags) at a time (B1), you can query partitions one at a time in some order until you have as many records needed to fill a page of results. How many partitions you will have to query will depend on how many partitions you have, how many results you want to display and how frequently the tag is used. Each partition would have its own index on tag_id to answer this query efficiently.
The order you pick partitions in will be important as it will affect how search results are grouped. If ordering isn't important (i.e. B3 doesn't matter), pick partitions randomly so that none of your partitions get too hot. If ordering is important, you could construct the item id so that it encodes information relevant to the order in which results are to be sorted. An appropriate partitioning scheme would then be mindful of this encoding. For example, if results are URLs that are sorted by popularity, then you could combine a sequential item id with the Google Page Rank score for that URL (or anything similar). The partitioning scheme must ensure that all of the items within a given partition have the same score. Queries would pick partitions in score order to ensure more popular items are returned first (B3). Obviously, this only allows for one kind of sorting and the properties involved should be constant since they are now part of a key and determine the record's partition. This isn't really a new limitation though, as it isn't easy to support a variety of sorts, or sorts on volatile properties, with partitioned data anyways.
The rule is that you partition by field that you are going to query by. Otherwise you'll have to look through all partitions. Are you sure you'll need to query Tag table by tag_id only? I believe not, you'll also need to query by tag title. It's no so obvious for Item table, but probably you also would like to query by something like URL to find item_id for it when other user will assign tags for it.
But note, that Tag and Item tables has immutable title and URL. That means you can use the following technique:
Choose partition from title (for Tag) or URL (for Item).
Choose sequence for this partition to generate id.
You either use partition-localID pair as global identifier or use non-overlapping number sets. Anyway, now you can compute partition from both id and title/URL fields. Don't know number of partitions in advance or worrying it might change in future? Create more of them and join in groups, so that you can regroup them in future.
Sure, you can't do the same for TagMapping table, so you have to duplicate. You need to query it by map_id, by tag_id, by item_id, right? So even without partitioning you have to duplicate data by creating 3 indexes. So the difference is that you use different partitioning (by different field) for each index. I see no reason to worry about.
Most likely your queries are going to be related to a user or a topic. Meaning that you should have all info related to those in one place.
You're talking about distribution of DB, usually this is mostly an issue of synchronization. Reading, which is about 90% of the work usually, can be done on a replicated database. The issue is how to update one DB and remain consistent will all others and without killing the performances. This depends on your scenario details.
The other possibility is to partition, like you asked, all the data without overlapping. You probably would partition by user ID or topic ID. If you partition by topic ID, one database could reference all topics and just telling which dedicated DB is holding the data. You can then query the correct one. Since you partition by ID, all info related to that topic could be on that specialized database. You could partition also by language or country for an international website.
Last but not least, you'll probably end up mixing the two: Some non-overlapping data, and some overlapping (replicated) data. First find usual operations, then find how to make those on one DB in least possible queries.
PS: Don't forget about caching, it'll save you more than distributed-DB.