I'm planning a system that combines various data sources and lets users do simple queries on these. A part of the system needs to act as an abstraction layer that knows all connected data sources: the user shouldn't [need to] know about the underlying data "providers". A data provider could be anything: a relational DBMS, a bug tracking system, ..., a weather station. They are hooked up to the query system through a common API that defines how to "offer" data. The type of queries a certain data provider understands is given by its "offer" (e.g. I know these entities, I can give you aggregates of type X for relationship Y, ...).
My concern right now is the unification of the data: the various data providers need to agree on a common vocabulary (e.g. the name of the entity "customer" could vary across different systems). Thus, defining a high level representation of the entities and their relationships is required.
So far I have the following requirements:
I need to be able to define objects and their properties/attributes. Further, arbitrary relations between these objects need to be represented: a verb that defines the nature of the relation (e.g. "knows"), the multiplicity (e.g. 1:n) and the direction/navigability of the relation.
It occurs to me that RDF is a viable option, but is it "the right tool" for this job?
What other solutions/frameworks do exist for semantic data modeling that have a machine readable representation and why are they better suited for this task?
I'm grateful for every opinion and pointer to helpful resources.
If you need cardinality restrictions on relations (for example "a Person knows 1:n Languages"), then RDF is not enough (see http://www.w3.org/TR/2004/REC-rdf-primer-20040210/#richerschemas). You will need ontology languages (at least OWL-DL for cardinalities greater than 1: http://www.w3.org/TR/owl-guide/#owl_cardinality)
I'd also consider an XML database and xquery, and perhaps topic maps (which is quite similar to RDF, but less widely known).
There are also a broad range of less standardised tools to consider, things like couchdb (which uses json).
There's rarely a 'right tool', but RDF is a very strong contender given your requirement.
Related
I have picked up the book named "Fundamentals of Database Systems, 3rd Edition" by Elmasri and Navathe to get a basic understanding first. I have started reading it from the first chapter.
A database is a logically coherent collection of data with some
inherent meaning, representing some aspect of real world and which is
designed, built and populated with data for a specific purpose.
What does means above paragraph?
A database is seen as a particular perspective on data and its representation in a framework of well-defined structures and interdependencies.
Breaking the definition down into parts:
'collection of data':
What it's all about.
'with some inherent meaning':
Mostly tautological, it would not constitute data otherwise. It shows, however, that databases do not exist to elicit the meaning of data. They may aid in doing so, though.
'representing some aspect of real world':
Contestable, as databases may represent data over abstract domains like mathematics ( eg. a database of prime number twins ). Unless this also counts as 'real world', which would make this part tautological.
'logically coherent':
Data items are related in a non-arbitrary way that allows reasoning about them. Often this aspect also includes comprehensiveness (as an objective at least) for the purpose at hand.
'for a specific purpose':
The intended perspective on the data, which co-determines the nature of structures and relationships the database will be composed of.
In particular the choice of representations and abstractions applied (eg. which parts of available data are dropped) depend on the intended purpose.
'designed, built and populated with data':
Implies that databases comprise a model and use a technical base. It also implies that databases focus on the description of data.
The usefulness of such high-level descriptions is probably limited but may help to focus on some key issues wrt databases:
- Describing data \
- Structuring data > modelling data
- Relating data items to each other /
- Reasoning over data
- Databases are tools
My teacher wrote the answer for what is database in the slides, it says this
Database:
a collection of data
represents some aspect of the real world (database represent something in the real world)
logically coherent collection (not a random collection)
designed, built & populated for a specific purpose
Please share the difference between homonyms and synonyms in data science with examples.
Synonyms for concepts:
When you determine that two concepts are synonyms (say, sofa and couch), you use the class expression owl:equivalentClass. The entailment here is that any instance that was a member of class sofa is now also a member of class couch and vice versa. One of the nice things about this approach is that "context" of this equivalence is automatically scoped to the ontology in which you make the equivalence statement. If you had a very small mapping ontology between a furniture ontology and an interior decorating ontology, you could say in the map that these two are equivalent. In another situation if you needed to retain the (subtle) difference between a couch and a sofa, you do that by merely not including the mapping ontology that declared them equivalent.
Homonyms for concepts:
As Led Zeppelin says, "and you know sometimes words have two meanings…" What happens when a "word" has two meanings is that we have what WordNet would call "word senses." In a particular language, a set of characters may represent more than one concept. One example is the English word "mole," for which WordNet has 6 word senses. The Semantic Web approach is to give each its own namespace; for instance, I might refer to the counterspy mole as cia:mole and the burrowing rodent as the mammal:mole. (These are shortened qnames for what would be full namespace names.) The nice thing about this is, if the CIA ever needed to refer to the rodent they could unambiguously refer to mammal:mole.
Credit
Homonyms- are words that have the same sound but have different in meaning.
2. Synonyms- are words that have the same or almost the same meaning.
Homonyms
Machine learning algorithms are now the subject of ethical debate. Bias, in layman's terms, is a pre-formed view created before facts are known. It applies to an estimating procedure's proclivity to provide estimations or predictions that are, on average, off goal in machine learning and data mining.
A policy's strength can be measured in a variety of ways, including confidence. "Decision trees" are diagrams that show how decisions are being made and what consequences are available. Rescale a statistic to match the scale of other variables in the model to normalise it.
Confidence is a statistician's metric for determining how reliable a sample is (we are 95 percent confident that the average blood sugar in the group lies between X and Y, based on a sample of N patients). Decision tree algorithms are methods that divide data across pieces that are becoming more and more homogeneous in terms of the outcome measure as they advance.
A graph is a graphical representation of data that statisticians call plots and charts. A graph seems to be an information structure that contains the ties and links among items, according to computer programmers. The act of arranging relational databases and their columns such that table relationships are consistent is known as normalisation.
Synonyms
Statisticians use the terms record, instance, sample, or example to describe their data. In computer science and machine learning, this can be called an attribute, input variable, or feature. The term "estimation" is also used, though its use is generally limited to numeric outcomes.
Statisticians call the non-time-series data format a record, or record. In statistics, estimation more often refers to the use of a sample statistic to measure something. Predictive modelling involves developing aggregations of low-level predictors into more informative "features".
The spreadsheet format, in which each column is still a variable, so each row is a record, is perhaps the most common non-time-series data type. Modeling in machine learning and artificial intelligence often begins with some very low-level prediction data.
What is the difference between a DBMS and an RDBMS with some examples and some new tools as examples. Why can't we really use a DBMS instead of an RDBMS or vice versa?
A relational DBMS will expose to its users "relations, and nothing else". Other DBMS's will violate that principle in various ways. E.g. in IDMS, you could do <ACCEPT <hostvar> FROM CURRENCY> and this would expose the internal record id of the "current record" to the user, violating the "nothing else".
A relational DBMS will allow its users to operate exclusively at the logical level, i.e. work exclusively with assertions of fact (which are represented as tuples). Other DBMS's made/make their users operate more at the "record" level (too "low" on the conceptual-logical-physical scale) or at the "document" level (in a certain sense too "high" on that same scale, since a "document" is often one particular view of a multitude of underlying facts).
A relational DBMS will also offer facilities for manipulation of the data, in the form of a language that supports the operations of the relational algebra. Other DBMS's, seeing as they don't support relations to boot, obviously cannot build their data manipulation facilities on relational algebra, and as a consequence the data manipulation facilities/language is mostly ad-hoc. On the "too low" end of the spectrum, this forces DBMS users to hand-write operations such as JOIN again and again and again. On the "too high" end of the spectrum, it causes problems of combinatorial explosion in language complexity/size (the RA has some 4 or 5 primitive operators and that's all it needs - can you imagine 4 or 5 operators that will allow you to do just any "document transform" anyone would ever want to do ?)
(Note very carefully that even SQL systems violate basic relational principles quite seriously, so "relational DBMS" is a thing that arguably doesn't even exist, except then in rather small specialized spaces, see e.g. http://www.thethirdmanifesto.com/ - projects page.)
DBMS : Database management system, here we can store some data and collect.
Imagine a single table , save and read.
RDBMS : Relational Database Management , here you can join several tables together and get related data and queried data ( say data for a particular user or for an particular order,not all users or all orders)
The Noramalization forms comes into play in RDBMS, we dont need to store repeated data again and again, can store in one table, and use the id in other table, easier to update, and for reading we can join both the table and get what we want.
DBMS:
DBMS applications store data as file.In DBMS, data is generally stored in either a hierarchical form or a navigational form.Normalization is not present in DBMS.
RDBMS:
RDBMS applications store data in a tabular form.In RDBMS, the tables have an identifier called primary key and the data values are stored in the form of tables.Normalization is present in RDBMS.
I'm currently working on a project where I use natural language processing to extract emotions from text to correlate them with contextual information.
Definition of contextual information: Every information that is relevant to describe an entity's situation in time an space.
Description of the data structure I'm looking for:
There is a arbitrary number of entities (an entity can either be a person or a group for example (twitter hash tags)) of which I want to track contextual information and their conversations with other entities. Conversations between entities are processed in order to classify their emotional features. Basic emotional features consist of a vector that specifies their occurrence percentually: {fear: 0.1, happiness: 0.4, joy: 0.1, surprise: 0.9, anger: 0}
Entities can also submit any contextual information they'd like to share, for example: location, room-temperature, blood pressure, ... and so on (will refer to this as contextual variables).
Because neither the number of conversations of an entity, nor the number of contextual variables they want to share is clear at any point in time, the data structure needs to be able to adjust accordingly.
Important: Every change in the data must also represent an own state as I'm looking forward to correlate certain changes in state with each other.
Example: Bob and Alice have a conversation that shows high magnitude of fear. A couple of hours later they have another conversation that shows no more fear, but happiness.
Now, one could argue that high magnitude fear, followed by happiness actually could be interpreted as the emotion relief.
However, in order to be able to extract this very information I need to be able to correlate different states with each other.
Same goes for using contextual information to correlate them with the tracked emotions in conversations.
This is why every state change must be recorded and available.
To make this more clear to you, I've created a graphic and attached it to the question.
Now, the actual question I have is: Which database/data structure can I use to solve this problem?
I've looked into event-sourcing databases but wasn't quite convinced if I can easily recreate a graph structure with them. I also looked at graph databases but didn't find what I was looking for.
Therefore it would be nice if someone here could at least point me in the right direction or help me adjust my structure accordingly to solve the problem. If however there are data structures supporting, what I call it graph databases with snapshots then ease of usage is probably the most important feature to filter for.
There's a database called Datomic by Rich Hickey (of Clojure fame) that stores facts over time. Every entry in the database is a fact with a timestamp, append-only as in Event Sourcing.
These facts can be queried with a relational/logical language ala Datalog (remiscent of Prolog). Please see This post by kisai for a quick overview. It has been used for querying graphs with some success in the past: Using Datomic as a Graph Database.
While I have no experience with Datomic, it does seem to be quite suitable for your particular problem.
You have an interesting project, I do not work on things like this directly but for my 2 cents -
It seems to me your picture is a bit flawed. You are trying to represent a graph database overtime but there isn't really a way to represent time this way.
If we examine the image, you have conversations and context data changing over time, but the fact of "Bob" and "Alice" and "Malory" actually doesn't change over time. So lets remove them from the equation.
Instead focus on the things you can model over time, a conversation, a context, a location. These things will change as new data comes in. These objects are an excellent candidate for an event sourced model. In your app, the conversation would be modeled as a series of individual events which your aggregate would use and combine and factor to generate a final state which would be your 'relief' determination.
For example you could write logic where if a conversation was angry then a very happy event came in then the subject is now feeling relief.
What I would do is model these conversation states in your graph db connected to your 'Fact' objects "Bob", "Alice", etc. And a query such as 'What is alice feeling right now?' would be a graph traversal through your conversation states factoring in the context data connected to alice.
To answer a question such as 'What was alice feeling 5 minutes ago?' you would take all the event streams for the conversations and rewind them to the appropriate point then examine the state of the conversations.
TLDR:
Separate the time dependent variables from the time independent variables and use event sourcing to model time.
There is an obvious 1:1 correspondence between your states at a given time and a relational database with a given schema. So there is an obvious 1:1 correspondence between your set of states over time and a changing-schema database, ie a variable whose value is a database plus metadata, manipulated by both DDL and DML update commands. So there is no evidence that you shouldn't just use a relational DBMS.
Relational DBMSs allow generic querying with automated implementation at a certain computational complexity with certain opportunities for optimization. Any application can have specialized queries that make a specialized data structure and operators a better choice. But you must design your application and know about such special aspects to justify this. As it is, with the obvious correspondences between your states and relational states, this has not been justified.
EAV is frequently used instead of DDL and a changing schema. But under EAV the DBMS does not know the real tables you are concerned with, which have columns that are EAV attributes, and which are explicit in the DDL/DML changing schema approach. So EAV foregoes simplicity, clarity, optimization and most of all integrity and ACID. It can only be justified (compared to DDL/DML, assuming a relational representation is otherwise appropriate) by demonstrating that DDL with schema updates (adding, deleting and changing columns and tables) is worse (per the above) than EAV in your particular application.
Just because you can draw a picture of your application state at some time using a graph does not mean that you need a graph database. What matters is what specialized queries/expressions you will be evaluating. You should understand what these are in terms of your problem domain, which is probably most easily expressible per some specialized data structure and operators and relationally. Then you can compare the expressive and computational demands to a specialized data structure, a relational representation, and the models of particular graph databases. Be sure to google stackoverflow.
According to Wikipedia "Neo4j is the most popular graph database in use today".
Object-Relational-Mappers have been created to help applications (which think in terms of objects) deal with stored data in a more application-friendly way like every other class/object.
However, I have never seen a OKM (Object-Key/Value-Mapper) for NoSQL "Key/Value" storage systems. Which seems odd because the need should be far greater given the fact that more value-relations will have to be hard-coded into the app than a regular, single SQL table row object.
four requests:
user:id
user:id:name
user:id:email
user:id:created
vs one request:
user = [id => ..., name => ..., email => ...]
Plus you must keep track of "lists" (post has_many comments) since you don't have has_many through tables or foreign keys.
INSERT INTO user_groups (user_id, group_id) VALUES (23, 54)
vs
usergroups:user_id = {54,108,32,..}
groupsuser:group_id = {23,12,645,..}
And there are lots more examples of the added logic that an application would need to replicate some basic features that normal relational databases use. All of these reasons make the idea of a OKM sound like a shoe-in.
Are there any? Are there any reasons there are not any?
Ruby's DataMapper project is an ORM and will happily talk to a key-value store through the use of an adapter.
Redis and MongoDB have adapters that already exist. CouchDB has an adapter — it's not maintained, but at one point it worked pretty well. I don't think anyone's done anything with Cassandra yet, but there's no reason it couldn't be done. The Dubious framework for Google App Engine takes a very similar approach to Data Mapper to make the Data Store available to applications.
So it's very possible to do ORM with key-value stores. The ORM just really needs to avoid the assumption that SQL is its primary vocabulary.
One of the design goals of SQL is that any data can be stored/queried in any relational database - There are some differences between platforms, but in general the correct way to handle a particular data structure is well known and easily automated but requiring fairly verbose code. That is not the case with NoSQL - generally you will be directly storing the data as used in your application rather than trying to map it to a relational structure, and without joins or other object/relational differences the mapping code is trivial.
Beyond generating the boilerplate data access code, one of the main purposes of an ORM is abstraction of differences between platforms. In my experience the ability to switch platforms has always been purely theoretical, and this lowest common denominator approach simply won't work for NoSQL as the platform is usually chosen specifically for capabilities not present on other platforms. Your example is only for the most trivial key value store - depending on your platform you most likely have some useful additional commands, so your first example could be
MGET user:id:name user:id:email ... (multiget - get any number of keys in a single call)
GET user:id:* (key wildcards)
HGETALL user:id (redis hash - gets all subkeys of user)
You might also have your user object stored in a serialized form - unlike in a relational database this will not break all your queries.
Working with lists isn't great if your platform doesn't have support built in - native list/set support is one of the reasons I like to use redis - but aside from potentially needing locks it's no worse than getting the list out of sql.
It's also worth noting that you may not need all the relationships you would define in sql - for example if you have a group containing a million users, the ability to get a list of all users in a group is completely useless, so you would never create the groupsuser list at all and rather than a seperate usergroups list have user:id:groups as a multivalue property. If you just need to check for membership you could set up keys as usergroups:userid:groupid and get constant time lookup.
I find it helps to think in terms of indexes rather than relationships - when setting up your data access code decide which fields will need to be queried and adding appropriate index records when those fields are written.
ORMs don't map terribly well to the schema-less nature of key-value stores. That being said, if you're using Riak and Ruby, you could take a look at Ripple. There are a number of other drivers for Riak which might fit with your language.
If you're looking into MongoDB (more of a document store than a k/v store), there are a number of drivers available.
The UNIVERSE db , which is a descendent of Pick, lets you store a list of key value pairs for a given key. However this is very old technoligy and the world ran away from these databases a long time ago.
You can implement this in an SQL database with a three column table
CREATE TABLE ATTRS ( KEYVAL VARCHAR(32),
ATTRNAME VARCHAR(32),
ATTRVAR VARCHAR(1024)
)
Although most DBAs will hit you over the head with the very thick Codd and Date hardback edition if you propose this, it is in fact a very common pattern in packaged applications to allow you to add site specific attributes to a system.
To prarphrase Richrd Stallmans comments on LISP.
"Any reasonably functional datastorage system will eventually end up implementing there own version of RDBMS."