Datomic many to many with data on the relationship - datomic

I'd like to implement a many to many relationship which also has metadata
describing the relationship.
One could think of the relationship as a labelled edge.
Specifically, a path consists of an ordered collection of series, and a series can
be within more than one path, on each occasion having a position within such path.
If I understand correctly some reification of the relationship is needed in datomic
(as we cannot label edges directly), such as a join entity like:
:path/path-member ; ref, many
:path-member/series ; ref, one
:path-member/position ; long, one
Or to reify it more completely:
:path-member/series ; ref, one
:path-member/path ; ref, one
:path-member/position ; long, one
Are there any other data modelling options that could work?
Are composite attributes relevant here?
This question has been asked before but I wondered if any additions to datomic since that question was asked (2015) offer any new possibilities.

Almost everyone encounters this question when they start data modelling with datomic, as evidenced by the large number of stack overflow questions on exactly the same point.
There is great news: Heterogenous tuples, added in June 2019, are a powerful new feature which solves this beautifully - it's exactly the feature we all thought was missing.
What it means is an attribute value, i.e. the v in the eavto 5-tuple, can now itself be a tuple.
This is a clojure vector of max length 8. While this length limitation does not go the full way to allowing an arbitrary amount of meta-data to be stored as labels on an edge, as in a true graph db, it adds great modelling power to datomic while retaining all the rest of the leverage and simplicity datomic provides.
Official blog post announcement.
Discussion of the release on twitter.
To use this in datalog, all you need are the tuple and untuple functions. It's beautifully simple, and is exactly the feature that was 'missing'.

Related

Adjustable, versioned graph database

I'm currently working on a project where I use natural language processing to extract emotions from text to correlate them with contextual information.
Definition of contextual information: Every information that is relevant to describe an entity's situation in time an space.
Description of the data structure I'm looking for:
There is a arbitrary number of entities (an entity can either be a person or a group for example (twitter hash tags)) of which I want to track contextual information and their conversations with other entities. Conversations between entities are processed in order to classify their emotional features. Basic emotional features consist of a vector that specifies their occurrence percentually: {fear: 0.1, happiness: 0.4, joy: 0.1, surprise: 0.9, anger: 0}
Entities can also submit any contextual information they'd like to share, for example: location, room-temperature, blood pressure, ... and so on (will refer to this as contextual variables).
Because neither the number of conversations of an entity, nor the number of contextual variables they want to share is clear at any point in time, the data structure needs to be able to adjust accordingly.
Important: Every change in the data must also represent an own state as I'm looking forward to correlate certain changes in state with each other.
Example: Bob and Alice have a conversation that shows high magnitude of fear. A couple of hours later they have another conversation that shows no more fear, but happiness.
Now, one could argue that high magnitude fear, followed by happiness actually could be interpreted as the emotion relief.
However, in order to be able to extract this very information I need to be able to correlate different states with each other.
Same goes for using contextual information to correlate them with the tracked emotions in conversations.
This is why every state change must be recorded and available.
To make this more clear to you, I've created a graphic and attached it to the question.
Now, the actual question I have is: Which database/data structure can I use to solve this problem?
I've looked into event-sourcing databases but wasn't quite convinced if I can easily recreate a graph structure with them. I also looked at graph databases but didn't find what I was looking for.
Therefore it would be nice if someone here could at least point me in the right direction or help me adjust my structure accordingly to solve the problem. If however there are data structures supporting, what I call it graph databases with snapshots then ease of usage is probably the most important feature to filter for.
There's a database called Datomic by Rich Hickey (of Clojure fame) that stores facts over time. Every entry in the database is a fact with a timestamp, append-only as in Event Sourcing.
These facts can be queried with a relational/logical language ala Datalog (remiscent of Prolog). Please see This post by kisai for a quick overview. It has been used for querying graphs with some success in the past: Using Datomic as a Graph Database.
While I have no experience with Datomic, it does seem to be quite suitable for your particular problem.
You have an interesting project, I do not work on things like this directly but for my 2 cents -
It seems to me your picture is a bit flawed. You are trying to represent a graph database overtime but there isn't really a way to represent time this way.
If we examine the image, you have conversations and context data changing over time, but the fact of "Bob" and "Alice" and "Malory" actually doesn't change over time. So lets remove them from the equation.
Instead focus on the things you can model over time, a conversation, a context, a location. These things will change as new data comes in. These objects are an excellent candidate for an event sourced model. In your app, the conversation would be modeled as a series of individual events which your aggregate would use and combine and factor to generate a final state which would be your 'relief' determination.
For example you could write logic where if a conversation was angry then a very happy event came in then the subject is now feeling relief.
What I would do is model these conversation states in your graph db connected to your 'Fact' objects "Bob", "Alice", etc. And a query such as 'What is alice feeling right now?' would be a graph traversal through your conversation states factoring in the context data connected to alice.
To answer a question such as 'What was alice feeling 5 minutes ago?' you would take all the event streams for the conversations and rewind them to the appropriate point then examine the state of the conversations.
TLDR:
Separate the time dependent variables from the time independent variables and use event sourcing to model time.
There is an obvious 1:1 correspondence between your states at a given time and a relational database with a given schema. So there is an obvious 1:1 correspondence between your set of states over time and a changing-schema database, ie a variable whose value is a database plus metadata, manipulated by both DDL and DML update commands. So there is no evidence that you shouldn't just use a relational DBMS.
Relational DBMSs allow generic querying with automated implementation at a certain computational complexity with certain opportunities for optimization. Any application can have specialized queries that make a specialized data structure and operators a better choice. But you must design your application and know about such special aspects to justify this. As it is, with the obvious correspondences between your states and relational states, this has not been justified.
EAV is frequently used instead of DDL and a changing schema. But under EAV the DBMS does not know the real tables you are concerned with, which have columns that are EAV attributes, and which are explicit in the DDL/DML changing schema approach. So EAV foregoes simplicity, clarity, optimization and most of all integrity and ACID. It can only be justified (compared to DDL/DML, assuming a relational representation is otherwise appropriate) by demonstrating that DDL with schema updates (adding, deleting and changing columns and tables) is worse (per the above) than EAV in your particular application.
Just because you can draw a picture of your application state at some time using a graph does not mean that you need a graph database. What matters is what specialized queries/expressions you will be evaluating. You should understand what these are in terms of your problem domain, which is probably most easily expressible per some specialized data structure and operators and relationally. Then you can compare the expressive and computational demands to a specialized data structure, a relational representation, and the models of particular graph databases. Be sure to google stackoverflow.
According to Wikipedia "Neo4j is the most popular graph database in use today".

Representing multi dimensional data and their attributes

I am building an application where I will store some facts corresponding to the product, location and time dimensions. For example, a particular product P1 sold 10 units at a store S1 in a particular month T1. All the dimensions will have levels with a hierarchy among them - for example - Year/Month/Week/Day for time dimension.
The members (not sure if members is a right word) of each level will also have hierarchy among them - for example - 2014/Sep/1st Week/3rd Sep and of course this hierarchy matches the hierarchy among the corresponding levels. Similar is the case for other dimensions. Implementing this structure itself is a bit tough going by the options for representing hierarchical data and the choice should be dictated by frequency and volume of data that is to be inserted/updated/deleted versus selected. I can do some research and pick the most optimum solution for my case.
However, the real difficulty I am facing currently is modeling an alternate space where the fact data will live. Referring to the example I cited above, assume that P1 is a member of the product dimension level "Article" in the hierarchy Category/Subcategory/Article and S1 is a member of store dimension level "Store" in the hierarchy Country/City/Store. Now assume the store S1 does not keep the item P1 in the month T1 and we represent this decision using the flag IS_ACTIVE. That is, IS_ACTIVE=N is a fact and its context is {P1,S1,T1}. Also note that IS_ACTIVE is the attribute and N is its value. However this context {P1,S1,T1} itself is an instance of the meta context {Article, Store, Month}. And I need to store this meta context in the application as well. Reason is that there may be a place in the application where I may need to fetch a list of other possible attributes (for example, REBATE_OFFERED_PERCENT) corresponding to the meta context {Article, Store, Month}.
I have figured out a normalized relational schema design for all this but it is too convoluted and in my opinion will not be performant. I am looking for an alternative solution like a NoSQL database which can serve my needs since there is a some hierarchy involved here. Or, is my problem domain more amenable to a relational schema design?
This seems like a standard problem that should appear in multiple domains but I could not find any articles regarding this. Also, is there a branch in abstract mathematics which has a relevance to this problem? Is there a standard terminology to describe such problems? I am willing to read up on some theory before implementing a solution for it.

Database storage design of large amounts of heterogeneous data

Here is something I've wondered for quite some time, and have not seen a real (good) solution for yet. It's a problem I imagine many games having, and that I can't easily think of how to solve (well). Ideas are welcome, but since this is not a concrete problem, don't bother asking for more details - just make them up! (and explain what you made up).
Ok, so, many games have the concept of (inventory) items, and often, there are hundreds of different kinds of items, all with often very varying data structures - some items are very simple ("a rock"), others can have insane complexity or data behind them ("a book", "a programmed computer chip", "a container with more items"), etc.
Now, programming something like that is easy - just have everything implement an interface, or maybe extend an abstract root item. Since objects in the programming world don't have to look the same on the inside as on the outside, there is really no issue with how much and what kind of private fields any type of item has.
But when it comes to database serialization (binary serialization is of course no problem), you are facing a dilemma: how would you represent that in, say, a typical SQL database ?
Some attempts at a solution that I have seen, none of which I find satisfying:
Binary serialization of the items, the database just holds an ID and a blob.
Pro's: takes like 10 seconds to implement.
Con's: Basically sacrifices every database feature, hard to maintain, near impossible to refactor.
A table per item type.
Pro's: Clean, flexible.
Con's: With a wide variety come hundreds of tables, and every search for an item has to query them all since SQL doesn't have the concept of table/type 'reference'.
One table with a lot of fields that aren't used by every item.
Pro's: takes like 10 seconds to implement, still searchable.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
A few tables with a few 'base profiles' for storage where similar items get thrown together and use the same fields for different data.
Pro's: I've got nothing.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
What ideas do you have? Have you seen another design that works better or worse?
It depends if you need to sort, filter, count, or analyze those attribute.
If you use EAV, then you will screw yourself nicely. Try doing reports on an EAV schema.
The best option is to use Table Inheritance:
PRODUCT
id pk
type
att1
PRODUCT_X
id pk fk PRODUCT
att2
att3
PRODUCT_Y
id pk fk PRODUCT
att4
att 5
For attributes that you don't need to search/sort/analyze, then use a blob or xml
I have two alternatives for you:
One table for the base type and supplemental tables for each “class” of specialized types.
In this schema, properties common to all “objects” are stored in one table, so you have a unique record for every object in the game. For special types like books, containers, usable items, etc, you have another table for each unique set of properties or relationships those items need. Every special type will therefore be represented by two records: the base object record and the supplemental record in a particular special type table.
PROS: You can use column-based features of your database like custom domains, checks, and xml processing; you can have simpler triggers on certain types; your queries differ exactly at the point of diverging concerns.
CONS: You need two inserts for many objects.
Use a “kind” enum field and a JSONB-like field for the special type data.
This is kind of like your #1 or #3, except with some database help. Postgres added JSONB, giving you an improvement over the old EAV pattern. Other databases have a similar complex field type. In this strategy you roll your own mini schema that you stash in the JSONB field. The kind field declares what you expect to find in that JSONB field.
PROS: You can extract special type data in your queries; can add check constraints and have a simple schema to deal with; you can benefit from indexing even though your data is heterogenous; your queries and inserts are simple.
CONS: Your data types within JSONB-like fields are pretty limited and you have to roll your own validation.
Yes, it is a pain to design database formats like this. I'm designing a notification system and reached the same problem. My notification system is however less complex than yours - the data it holds is at most ids and usernames. My current solution is a mix of 1 and 3 - I serialize data that is different from every notification, and use a column for the 2 usernames (some may have 2 or 1). I shy away from method 2 because I hate that design, but it's probably just me.
However, if you can afford it, I would suggest thinking outside the realm of RDBMS - it sounds like Non-RDBMS (especially key/value storage ones) may be a better fit to store these data, especially if item 1 and item 2 differ from each item a lot.
I'm sure this has been asked here a million times before, but in addition to the options which you have discussed in your question, you can look at EAV schema which is very flexible, but which has its own sets of cons.
Another alternative is database systems which are not relational. There are object databases as well as various key/value stores and document databases.
Typically all these things break down to some extent when you need to query against the flexible attributes. This is kind of an intrinsic problem, however. Conceptually, what does it really mean to query things accurately which are unstructured?
First of all, do you actually need the concurrency, scalability and ACID transactions of a real database? Unless you are building a MMO, your game structures will likely fit in memory anyway, so you can search and otherwise manipulate them there directly. In a scenario like this, the "database" is just a store for serialized objects, and you can replace it with the file system.
If you conclude that you do (need a database), then the key is in figuring out what "atomicity" means from the perspective of the data management.
For example, if a game item has a bunch of attributes, but none of these attributes are manipulated individually at the database level (even though they could well be at the application level), then it can be considered as "atomic" from the data management perspective. OTOH, if the item needs to be searched on some of these attributes, then you'll need a good way to index them in the database, which typically means they'll have to be separate fields.
Once you have identified attributes that should be "visible" versus the attributes that should be "invisible" from the database perspective, serialize the latter to BLOBs (or whatever), then forget about them and concentrate on structuring the former.
That's where the fun starts and you'll probably need to use "all of the above" strategy for reasonable results.
BTW, some databases support "deep" indexes that can go into heterogeneous data structures. For example, take a look at Oracle's XMLIndex, though I doubt you'll use Oracle for a game.
You seem to be trying to solve this for a gaming context, so maybe you could consider a component-based approach.
I have to say that I personally haven't tried this yet, but I've been looking into it for a while and it seems to me something similar could be applied.
The idea would be that all the entities in your game would basically be a bag of components. These components can be Position, Energy or for your inventory case, Collectable, for example. Then, for this Collectable component you can add custom fields such as category, numItems, etc.
When you're going to render the inventory, you can simply query your entity system for items that have the Collectable component.
How can you save this into a DB? You can define the components independently in their own table and then for the entities (each in their own table as well) you would add a "Components" column which would hold an array of IDs referencing these components. These IDs would effectively be like foreign keys, though I'm aware that this is not exactly how you can model things in relational databases, but you get the idea.
Then, when you load the entities and their components at runtime, based on the component being loaded you can set the corresponding flag in their bag of components so that you know which components this entity has, and they'll then become queryable.
Here's an interesting read about component-based entity systems.

Extracting information triples form Tables

I have a very large dataset of HTML tables (extracted originally from Wikipedia). I want to extract meaningful tripleSet from each of these tables (This is not to be conflicted with extracting triples from wikipedia infoboxes which is relatively a lot easier task).
The triples has to be semantically meaningful, to the humans, not like DBpedia where triples are extracted to be URIs and other formats. So I am ok with just extracting the table text values.
Keep in mind the variety of table orientation and shapes.
The main task I see is to extract the main Entity of the table records (The student name in a school record for example), so that it can be used as the triple's "Subject".
Example
for a table like this, we should know the main entity is "Server" and the others are only objects, so relations should be like:
<AOLserver> <Developed by> <NaviSoft>.
<AOLserver> <Open Source> <Yes>.
<AOLserver> <Software license> <Mozilla>.
<AOLserver> <Last stable version> <4.5.1>.
<AOLserver> <Release date> <2009-02-02>.
Also, keep in mind that not always the main Entity lies in the First column of the table, there's even tables that are not by any means talk about the same subject.
This is a table where the main Entity is the last column not the first:
This table should generate relations like:
<Arsène Wenger> <Position> <Manager>.
<Steve Bould> <Position> <Assistant manager>
Questions
My first question is can this be done using rule based methods, to craft some rules around examples and try to generalize so that I can detect the right Entity? can you suggest example rules?
Second question is about evaluation, how can I evaluate such a system? how can I measure my performance, so that I can enhance it?
Fantastic Project!! If you get it to work, def try to get it incorporated into dbpedias crawlers/extractors - http://wiki.dbpedia.org/Documentation.
For reference - http://en.wikipedia.org/wiki/Comparison_of_web_server_software
If you look at the HTML, the column titles are in a thead element, while the rows are all contained in tr elements inside tbody elements, with the title of the entity (/rdfs:label) in a th element - this should go a long way to solving your problem without getting too dirtya nd imprecise.
I suppose that checking the html structure to see how many rows have th elements woudl be worthwhile to evaluate this approach.
In the second example (http://en.wikipedia.org/wiki/Arsenal_F.C.) does the fact that it doesnt have a thead element help ie. - allow us to assume that the page itself ie. arsenal is the subject of the data in the table.
There are also microformats like vcard scatter about wikipedia that might halp elucidate the relationships
I'm not sure how generalisable it is across all of the tables in wikipedia, but should be a good start. I would imagine that its vastly superior to stick to html structure and microformats as much as possible rather than getting into anything too tricky
Also - each link has a dbpedia uri to identify it, which is very useful in these circumstances. eg. http://example.com/resource/AOLserver http://example.com/property/Server http://dbpedia.org/resource/AOLserver. http://example.com/resource/AOLserver http://example.com/property/Developed_by http://dbpedia.org/resource/NaviSoft. http://example.com/property/Developed_by a rdf:Property. http://example.com/property/Developed_by rdfs:label "Developed by"#en
have you seen - http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/ -could be worthwhile for generating mappings
So, Finally I've been able to achieve the goal of my project, it required a lot of work and testing but it was achieved.
The idea rested mainly in a pipeline like the following:
1-a component to extract the tables and import them into an in-memory object
2-a component to exclude bad tables, these are things that are used in table tags but they're not really tables (sometimes the writers of a page want to organize data appearance, so they put them in a table)
3- a component to strip off the styling of the tables and also to resolve column/row spans by repeating the data by the number of the span
4-a Machine learning based classifier to classify the orientation of the table (horizontal/vertical) and the header row/column for that table.
5-a Machine learning based classifier to classify the rows/columns that should be the "subject" of the relationship triple < subject > < predicate > < object >
The first classifier is a support vector machine classifier that takes features like character count, table/row cells count ratio, numbers to text ratio, capitalization..etc.
we achieved about 80~85% on both precision and recall
The second classifier is a Random Forest classifier that takes features that are more related to the relevance of cells inside one row/column. we achieved about 85% also on both precision and recall.
some other refinement components and heuristics was involved in the process to make the output more clear and related to the context of the table
Generally there were no additional data used from Wikipedia to make the tool more general to any html table on the web. but the training data of the classifiers were mainly biased towards Wikipedia content!
I'll be updating the question code with the source code once it's finalized.

Designing a database for an e-commerce store

Hi I am trying to design a database for an e-commerce website but I can't seem to find a way to do this right, this is what I have so far:
The problem appears at the products.I have 66 types of products most of them having different fields.I have to id's but both of them don't seem very practical:
OPTION A:
At first I thought I to make a table for each product type, but that would result in 66 tables which is not very easy to maintain. I already started to do that I created the Product_Notebook and Product_NotebookBag tables. And then I stopped and thought about it a bit and this solution is not very good.
OPTION B
After thinking about it a bit more I came up with option B which is storing the data into a separate field called description. For example:
"Color : Red & Compatibility : 15.6 & CPU : Intel"
In this approach I could take the string and manipulate it after retrieving it from the database.
I know this approach is also not a very good idea, that's why I am asking for a more practical approach.
See my answer to this question here on Stack Overflow. For your situation I recommend using Entity Attribute Value (EAV).
As I explain in the linked answer, EAV is to be avoided almost all of the time for many good reasons. However, tracking product attributes for an online catalog is one application where the problems with EAV are minimal and the benefits are extensive.
Simply create a ProductProperties table and put all the possible fields there. (You can actually just add more fields to your Products table)
Then, when you list your products, just use the fields you need.
Surely, there are many fields in common as well.
By the way, if you're thinking of storing the data in array (option B?) you'll regret it later. You won't be able to easily sort your table that way.
Also, that option will make it hard to find a particular item by a specific characteristic.

Resources