How do we give an identification to a relationship in OntoRefine RDF mapping? - relationship

I'm working on a transformation work in which I need to transform a property graph dataset into a RDF dataset. There are so many n-ary relationships that need to be traited as a class, but I do not know how to affect an unique identification on these relations. I tried to use the row index but I've got more than one file on this work so this can't work. So I would like to know how do you affect an unique identification to relationships, if the URI is the solution, how do we do this in OntoRefine mapping? Thank you for your answers.
Lee

There are several ways to address this:
Ideally, use some characteristics of the related entities to make a deterministic URL. Eg if you're making a position (membership) node between a person and an org that involves a mandatory role and start date, you could use a URL like org/<org_id>/person/<person_id>/role/<role_id>/date/<date>
Use a blank node. In that case you don't need to worry about a URN
Use the row index if you prepend it with the table/file name (as a constant)
Use the GREL function random(). It doesn't produce a globally unique identifier, but if you ask for a large enough range, it'll be unique with a very high probability
Use a Jython function, as shown at How to create UUID in Openrefine based on the MD5 hash of the values
If you do your mapping using SPARQL, then use the builtin uuid() function

Related

When should I use ObjectId vs UUID in MongoDB

I'm making a simple CRUD application with MongoDB so I can learn more about it.
The application is a simple blog, I have a collection named "articles" which stores various documents, each one representing a post for my blog.
When I display the list of all blog posts, I can do a db.collection.find(), and list all of them.
But the question lies when I need to show a single post individually, when I need to query the collection for a single, specific document.
The logical solution would be to use a RDBMS and an auto increment feature, but MongoDB is NoSQL and does not have auto increment.
I'm using the auto generated _id field of the document which stores an ObjectId by default, which means that my url's look like this:
http://localhost/blog/article.php?_id=5d41f6e5fc1a2f3d80645185
I saw in the documentation that the ObjectId contains a unique identifier for the server, together with a timestamp and a counter, isn't exposing these things a security risk?
As a solution, I stumbled into UUID https://docs.mongodb.com/manual/reference/method/UUID/ which is an auto-generated unique ID, that doesn't expose timestamp and machine info in it. It seems like a logical solution to use this instead of the _id that contains my ObjectId for querying and finding a document.
So I can make my url's look like this:
http://localhost/blog/article.php?_id=23829651-26f7-4092-99d0-5be8658c966e
But still, should I keep the _id property? should I add another one called "id" that stores the UUID? should I even use UUID's at all?
Here's what I would consider before choosing an identifier:
Collision
Risk of collision is very low for both UUIDs and ObjectIDs. This has been discussed in detail in another question.
Nature
UUIDs are random whereas ObjectID values always increase over time. This makes ObjectIDs a bad choice for sharding.
Other uses
ObjectIDs have the creation timestamp as a part and can be used as a substitute of commonly used the createdAt field. A sort by ObjectIDs is a sort by creation time.
Insecure object references (OWASP)
Short def: An attacker cannot deduce the ID of another object if they have the ID of one object. You can read more about this here. Both UUIDs and ObjectIDs are not vulnerable to this.
Link to another question that discusses the security of ObjectIDs (thanks zbee).
Ease of use
Note: This is subjective
Using ObjectIds is a lot easier in the Mongo ecosystem. The existence of speical aggregation operators to deal with ObjectIDs + libraries add to it.
Portability
UUIDs are more portable than ObjectIDs. I do not know of any other system that uses ObjectIDs internally except for Mongo. Whereas there are other DBs such as Postgres that have a special data type for UUIDs + extensions for random generation etc.

Storing arbitrary key/value entries alongside a datomic entity

Say I have entities that I want to store in datomic. If the attributes are all known in advance, I just add them to my datomic schema once and can then make use of them.
What if in addition to known attributes, entities could have an arbitrary number of arbitrary keys, mapping to arbitrary values. Of course I can just store that list in some "blob" attribute that I also add to the schema, but then I couldn't easily query those attributes.
The solution that I've come up with is to define a key and a value attribute in datomic, each of type string, and treat every one of those additional key/value entries as entities in their own right, using aforementioned attributes. Then I can connect all those key/value-entities to the actual entity by means of a 1:n relation using the ref type.
That allows me to query. Is that the way to go or is there a better way?
I would be reluctant to lose the power of attribute definitions. Datomic attributes can be added at any time, and the limit is reasonably high (2^20), so it may be reasonable to model the dynamic keys and values as they come along, creating a new attribute for each.

Custom Fields for a Form representing an object

I have an architectural question concerning custom fields in a view for an object. Let's say you have a User Object with some basic information like firstname, lastname, ... that can be used by all customers.
Now, often we get a question from a customer to add couple of custom fields typical for their domain. Our solution now is an xml data column where key value pairs are stored. This has been ok so far, but now we'll have to find a more architectural solution.
For instance, now, a customer wants a dropdown where it can select the value for its custom field. We could still store the selected value in the xml data column, but where do we store all those dropdown values...
I know that in sharepoint you can also add custom fields like dropdowns and I was wondering how to deal with this best. I want to avoid creating custom tables for customers, or having a table with 90 columns (10 basic and then 10 for each customer), ...
You get the idea, it should be generic and be able to deal with all sorts of problems in the future.
What I was thinking about is a Table UserConfiguration where each record has a Foreign Key to the Customer (Channel in our database), then a column FieldName, a column FieldType and a column Values. The column values should be an xml type column, because for a dropdown, we'll need to add multiple values. Also, each value can have extra data attached to it (not just a name). The other problem then is how to store the selected value. I don't like the idea of having foreign keys to xml in my database (read somewhere that Azure can't handle this all to well). Do you just store the name of the value (what if the value were to disappear out of the xml?)?
Any documentation, links on this kind of problems would also be great. I'm trying to find a design pattern that deals with this kind of problem in the database.
I want to answer your question in two parts:
1) Implementing custom fields in a database server
2) Restricting custom fields to an enumeration of values
Although common solutions to 1) are discussed in the question referenced by #Simon, maybe you are looking for a bit of discussion on what the problem is and why it hasn't been solved for us already.
databases are great for structured, typed data
custom fields are inherently less structured
therefore, custom fields are more difficult to work with in a database
some or many of the advantages of using a database are lost
some queries may be more difficult or impossible
type safety may be lost (in the database)
data integrity may no longer be enforced (by the database)
it's a lot more work for the implementers and maintainers
As discussed in the other question, there's no perfect solution.
But these benefits/features still need to be implemented somewhere, and so often the application becomes responsible for data integrity and type safety.
For situations like these, people have created Object-Relation Mapping tools, although, as Jeff Atwood says, even using an ORM could create more problems than it solved. However, you mentioned that it 'should be generic and be able to deal with all sorts of problems in the future' -- this makes me think an ORM might be your best bet.
So, to sum up my answer, this is a known problem with known solutions, none of which are completely satisfactory (because it's so hard). Pick your poison.
To answer the second part of (what I think is) your question:
As mentioned in the linked question, you could implement Entity-Attribute-Value in your database for custom fields, and then add an extra table to hold the legal values for each entity. Then, the attribute/value of the EAV table is a foreign key into the attribute-value table.
For example,
CREATE TABLE `attribute_value` ( -- enumerations go in this table
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`attribute`, `value`)
);
CREATE TABLE `eav` ( -- now the values of attributes are restricted
`entityid` int,
`attribute` varchar(30),
`value` varchar(30),
PRIMARY KEY (`entityid`, `attribute`),
FOREIGN KEY (`attribute`, `value`) REFERENCES `attribute_value`(`attribute`, `value`)
);
Of course, this solution isn't perfect or complete -- it's only supposed to illustrate the idea. For instance, it uses varchars, and lacks a type column. Also, who gets to decide what the possible values for each attribute are? Can these be changed at any time by the user?
I'm doing something similar for a customer. I've create a JSON FieldType which holds the entire JSON stream of a complex object and a String containing the FQTN (FullQualifiedTypeName) of my C# model class.
By using custom New-, Edit- and Display-Forms we'd ensured that our custom objects are rendered the correct way for best user experience.
To promote fields from the complex C# model to the SharePoint list, we've build something like Microsoft did in InfoPath. Users are able to select Properties or MetaData from the Complex C# type, which will be automatically promoted to the hosting SharePoint list.
The big advantage of JSON is, that its smaller than XML and easier to work with in the web world. (JavaScript...)
When you let the users create the data models, I would recommend looking at an document database or 'NoSQL' since you want exactly that, to store schemaless data structures.
Also, sharePoint stores metadata the way you mentioned (10 columns for text, 5 for dates etc)
That said, in my current project (locked in SharePoint, so Framework 3.5 + SQL Server and all the constraints that follow) we use a somewhat similar structure as below:
Form
Id
Attribute (or Field)
Name
Type (enum) Text, List, Dates, Formulas etc
Hidden (bool)
Mandatory
DefaultValue
Options (for lists)
Readonly
Mask (for SSN etc)
Length (for text fields)
Order
Metadata
FormId
AttributeId
Text (the value for everything but dates)
Date (the value for dates)
Our formulas employ functions such as Increment: INC([attribute1][attribute2], 6) and this would produce something like 000999 for the 999th instance of the combined values for attribute 1 and attribute 2 for a form, this is stored as:
AttributeIncrementFormula
AtributeId
Counter
Token
Other 'formulas' (aka anything non-trivial) such as barcodes are stored as single metadata values. In the actual implementation, we would have something like this:
var form = formRepository.GetById(1);
form.Metadata["firstname"].Value
Value above is a readonly property that decides whether we should get the value from Text or Date and if some additional transform is required. Note that the database here is merely a storage, we hold all the domain complexity in the application.
We also let our customer decide which attribute is the form title for example, so if firstname is the form title, they'll set an in-memory param that spans the entire application to be something like Params.InMemory.TitleAttributeId = <user-defined-id>.
I hope this gives you some insight on a production impl of a similar scenario.
This is really more of a comment than an answer, but I need more space than SO will allow for comments, so here 'tis:
I think your UserConfiguration table approach is good, and would suggest only abstracting the "type" and "value" pieces of your design a bit more:
Since your application will need to validate user input, each notion of "type" will have an associated piece of evaluation logic. Obviously the more of this you can abstract into data the easier it will be to keep your code small. Enumerated lists are a good start, but if your "validator" logic can be extended to handle pattern matching for text strings and Boolean logical expressions (e.g. to describe/enforce constraints on input values), then you can express pretty much any "type" of input that your application may need to handle in terms of (relatively) simple "atoms" that you can map naturally to DB tables.
When storing a user-specified value, you can either store the "raw" data (e.g. in JSON) and a foreign key to the associated "type", or you can add an lookup/cache system that assigns an integer to each new value that is encountered by the system ("novelty" can be checked by checking a hash of the "raw" data, for example). The latter approach obviously scales better if you're expecting lots of data duplication (which of course you would in the case of a multiple-choice menu).

Table and column naming conventions when plural and singular forms are odd or the same

In my search I found mostly arguments for whether to use plurality in database naming conventions, and ways to handle it in either case. I have decided I prefer plural table names, so I don't want to argue that.
I need to represent an animal's species and genus and so on in a database. The plural and singular form for 'species' are the same, and the plural of 'genus' is 'genera'.
I'm using Microsoft's Entity Data Model, by the way. My concern is mainly about whether I'll have problems later on depending on my naming choices.
I think I can get by with:
Table: Genera | Column: Genus
But I'm unsure how I should handle:
Table: Species | Column: Species
If I really wanted to be lazy about this I'd just name them 'species > specie' and 'genuses > genus', but I would prefer to read them in their correct forms.
Any advice would be appreciated.
I would go for Genera/Genus and Species/Species. That's how you say it in English, so why using an incorrect form of the word?
I generally avoid have a column name that is the same as a table name because it can be confusing to human readers. The database engine knows whether it expects a table name or column name in any given context, I don't recall that ever being a problem. (Is there some context where either would be valid? I can't think of one.)
That said, if you run into this issue, it indicates to me that you have a poorly chosen name for one or the other. Species makes good sense as a table name: this table contains data about a species. So if a field in that table is called "species" ... what about the species? Presumably everything in the table is about a species. I'd guess it was probably some sort of identifier and not, say, the number of chromosomes or method of reproduction. But is it an ID number? An abbreviation? The common name? The binomial nomenclature name? Etc. If it's, say, the common name, I'd call it "common_name" and not "species".
By the way, another naming convention you should decide on is whether column names that could be ambiguous if taken out of context should have names that specify the context, or whether you use the table name to eliminate the ambiguity. For example, you could have many things that have a "name". You could call any such field simply "name", and if there's ambiguity, qualify it, like "species.name", "laboratory.name", etc. Or you could give each field a unique name, like "species_name", "laboratory_name", etc. That's one of those questions that I think has no definitively right answer, just pros and cons and make a decision and be consistent.

Analysis Services Dimension - best way to handle description or friendly name

If I have a dimension in Analysis Services where the base table has columns like this:
TransTypeKey TransTypeCode TransTypeDescription TransCategoryCode TransCategory Description
where the description columns are just friendly names for the corresponding 'code,' what's the best way to capture that? Concatenate the code and description when loading the dimension? Keep them separate?
That would depend on what the user wants to see in the final cube. Is the dimension going to be sorted by the concatenated field? Do they normally sort/search by description or the code? If it is both you will need attributes for both versions or concatenate both ways: Code-Description as well as Description-Code.
In any case I'd leave the base table as is then concatenate them in a view if you have access to the source database or in the cube dsv if that is the only choice. That gives you some flexibility going forward.
If the code is unique or can compositely makes up uniqueness, You can assign the code to the member key property and the description to the member name property.
This works really well and keys your key sizes small assuming your codes are simple integers or small character compared to the larger description fields.

Resources