How would I create a DCG to parse queries about a database?

How would I create a DCG to parse queries about a database? - database

I'm just playing around with learning Prolog and am trying to build a database of information and then use natural English to query about relationships. How would I go about doing this?
Example:
%facts
male(bob).
female(sarah).
male(timmy).
female(mandy).
parent(bob, timmy).
parent(bob, mandy).
parent(sarah, timmy).
parent(sarah, mandy).
spouse(bob, sarah).
spouse(sarah, bob).
%some basic rules
married(X,Y) :- spouse(X,Y).
married(X,Y) :- spouse(Y,X).
child(C,X) :- parent(X,C).
Now, I want to ask some "who" questions, i.e., "who is the parent of timmy".
I read something about DCGs, but, can anyone point me to some good resources or get me going in the right direction?
Thank you!

First, I would also like to ask a "who" question: In a fact like
parent(X, Y).
WHO is actually the parent? Is it X or is it Y?
To make this clearer, I strongly recommend you use a naming convention that makes clear what each argument means. For example:
parent_child(X, Y).
Now, to translate "informal" questions to Prolog goals that reason over your data, consider for example the following DCG:
question(Query, Person) --> [who,is], question_(Query, Person).
question_(parent_child(Parent,Child), Parent) --> [the,parent,of,Child].
question_(parent_child(Parent,Child), Child) --> [the,child,of,Parent].
question_(married(Person0,Person), Person) --> [married,to,Person0].
question_(spouse(Person0,Person), Person) --> [the,spouse,of,Person0].
Here, I am assuming you have already converted given sentences to tokens, which I represent as Prolog atoms.
In order to conveniently work with DCGs, I strongly recommend the following declaration:
:- set_prolog_flag(double_quotes, chars).
This lets you write for example:
?- Xs = "abc".
Xs = [a, b, c].
Thus, it becomes very convenient to work with such programs. I leave converting such lists of characters to tokens as an exercise for you.
Once you have such tokens, you can use the DCG to relate lists of such tokens to Prolog queries over your program.
For example:
?- phrase(question(Query, Person), [who,is,the,parent,of,timmy]).
Query = parent_child(Person, timmy) .
Another example:
?- phrase(question(Query, Person), [who,is,the,spouse,of,sarah]).
Query = spouse(sarah, Person).
To actually run such queries, simply post them as goals. For example:
?- phrase(question(Query, Person), [who,is,married,to,bob]),
Query.
Query = married(bob, sarah),
Person = sarah .
This shows that such a conversion is quite straight-forward in Prolog.
As #GuyCoder already mentioned in the comments, make sure to check out dcg for more information about DCG notation!

Related

How do I select for a row using multiple clauses in PACT smart contract language

The PACT documentation clearly states how to select for a single condition in the where clause, but it is not so clear on how to select for multiple clauses which seems much more general and important for real world use cases than a single clause example.
Pact-lang select row function link
For instance I was trying to select a set of dice throws across the room name and the current round.
(select 'throws (where (and ('room "somename") ('round 2)))
But this guy didn't resolve and the error was not so clear. How do I select across multiple conditions in the select function?

The first thing we tried was to simply select via a single clause which returns a list:
(select 'throws (where (and ('room "somename")))
A: [object{throw-schema},object{throw-schema}]
And then we applied the list operator "filter" to the result:
(filter (= 'with-read-function' 1) (select 'throws (where (and ('room "somename"))))
Please keep in mind we had a further function that read the round and spit back the round number and we filtered for equality of the round value.
This ended up working but it felt very janky.

The second thing we tried was to play around with the and syntax and we eventually found a nice way to express it although not quite so intuitive as we would have liked. It just took a little elbow grease.
The general syntax is:
(select 'throws (and? (where condition1...) (where condition2...))
In this case the and clause was lazy, hence the ? operator. We didn't think that we would have to declare where twice, but its much cleaner than the filter method we first tried.

The third thing we tried via direction from the Kadena team was a function we had yet to purview: Fold DB.
(let* ((qry (lambda (k obj) true)) ;;
(f (lambda(x) [(at 'firstName x), (at 'b x)])) ) (fold-db people (qry) (f)) )
This actually is the most correct answer but it was not obvious from the initial scan and would be near inscrutable for a new user to put together with no pact experience.
We suggest a simple sentence ->
"For multiple conditions use fold-db function."
In the documentation.
This fooled us because we are so used to using SQL syntax that we didn't imagine that there was a nice function like this lying around and we got stuck in our ways trying to figure out conditional logic.

Using Watson as a testing tool

I'm wondering about using Watson assistant as a simple tool for informal testing of medical students. I'm a bit confused as to whether this is an appropriate use. I have played around but am quite stuck.
I have a symptom X in mind, that, if the user asks about, Watson would spit out 3 questions sequentially, and test the users responses against some specific terms.
These questions look like
1. how much water does a 'symptom X' patient drink ?
Watson would take their input and compare it against definition somehow
what are the 3 diseases that can manifest with 'symptom X' ?
Watson would then take their input and compare it against the known list
what tests should be run immediately on a patient presenting with 'symptom X' ?
Watson would then compare their input to known list
Am I way off base with how I am using trying to use it?
-So far I have set up
intent = test_me (eg Can you test me)
#entity = symptom X
My first dialog node is if #test_me and #symptom X ->
'Sure, I can test you on symptom X'. I'm going to ask you 3 questions on this.
Pause.
Response -> how much water does a 'symptom X' patient drink ?
Their response would be along the lines of 'more than 100ml/kg/day'
How can I evaluate this response?
Is what I'm trying to do beyond the scope of a chatbot / WA?

The simple way would be by adding NLU (Natural Language Understanding) to the solution. If the language is English, NLU by default would get the 100ml as a Quantity and you can also use the syntax enchantment, if you need to apply a different rule when the user writes things like "more".
If there are more complexity to the sentences and NLU by default is not enough, you can train a custom model using WKS (Watson Knowledge Studio) and use it with NLU. The same applies for languages where the default model doesn't give you enough info.
NLU also have some understand of a good number of medical terms, that seems to be of use for your solution.
If you need to do it using only Watson Assistant, the only solution I can imagine is to use regex to get the number and the type (ml/day/km/etc). Something like "(\d+)(\w{2})"

How to query for multiple vertices and counts of their relationships in Gremlin/Tinkerpop 3?

I am using Gremlin/Tinkerpop 3 to query a graph stored in TitanDB.
The graph contains user vertices with properties, for example, "description", and edges denoting relationships between users.
I want to use Gremlin to obtain 1) users by properties and 2) the number of relationships (in this case of any kind) to some other user (e.g., with id = 123). To realize this, I make use of the match operation in Gremlin 3 like so:
g.V().match('user',__.as('user').has('description',new P(CONTAINS,'developer')),
__.as('user').out().hasId(123).values('name').groupCount('a').cap('a').as('relationships'))
.select()
This query works fine, unless there are multiple user vertices returned, for example, because multiple users have the word "developer" in their description. In this case, the count in relationships is the sum of all relationships between all returned users and the user with id 123, and not, as desired, the individual count for every returned user.
Am I doing something wrong or is this maybe an error?
PS: This question is related to one I posted some time ago about a similar query in Tinkerpop 2, where I had another issue: How to select optional graph structures with Gremlin?

Here's the sample data I used:
graph = TinkerGraph.open()
g = graph.traversal()
v123=graph.addVertex(id,123,"description","developer","name","bob")
v124=graph.addVertex(id,124,"description","developer","name","bill")
v125=graph.addVertex(id,125,"description","developer","name","brandy")
v126=graph.addVertex(id,126,"description","developer","name","beatrice")
v124.addEdge('follows',v125)
v124.addEdge('follows',v123)
v124.addEdge('likes',v126)
v125.addEdge('follows',v123)
v125.addEdge('likes',v123)
v126.addEdge('follows',v123)
v126.addEdge('follows',v124)
My first thought, was: "Do we really need match step"? Secondarily, of course, I wanted to write this in TP3 fashion and not use a lambda/closure. I tried all manner of things in the first iteration and the closest I got was stuff like this from Daniel Kuppitz:
gremlin> g.V().as('user').local(out().hasId(123).values('name')
.groupCount()).as('relationships').select()
==>[relationships:[:]]
==>[relationships:[bob:1]]
==>[relationships:[bob:2]]
==>[relationships:[bob:1]]
so here we used local step to restrict the traversal within local to the current element. This works, but we lost the "user" tag in the select. Why? groupCount is a ReducingBarrierStep and paths are lost after those steps.
Well, let's go back to match. I figured I could try to make the match step traverse using local:
gremlin> g.V().match('user',__.as('user').has('description','developer'),
gremlin> __.as('user').local(out().hasId(123).values('name').groupCount()).as('relationships')).select()
==>[relationships:[:], user:v[123]]
==>[relationships:[bob:1], user:v[124]]
==>[relationships:[bob:2], user:v[125]]
==>[relationships:[bob:1], user:v[126]]
Ok - success - that's what we wanted: no lambdas and local counts. But, it still left me feeling like: "Do we really need match step"? That's when Mr. Kuppitz closed in on the final answer which makes copious use of the by step:
gremlin> g.V().has('description','developer').as("user","relationships").select().by()
.by(out().hasId(123).values("name").groupCount())
==>[user:v[123], relationships:[:]]
==>[user:v[124], relationships:[bob:1]]
==>[user:v[125], relationships:[bob:2]]
==>[user:v[126], relationships:[bob:1]]
As you can see, by can be chained (on some steps). The first by groups by vertex and the second by processes the grouped elements with a "local" groupCount.

Neo4j output format

After working with neo4j and now coming to the point of considering to make my own entity manager (object manager) to work with the fetched data in the application, i wonder about neo4j's output format.
When i run a query it's always returned as tabular data. Why is this??
Sure tables keep a big place in data and processing, but it seems so strange that a graph database can only output in this format.
Now when i want to create an object graph in my application i would have to hydrate all the objects and this is not really good for performance and doesn't leverage true graph performace.
Consider MATCH (A)-->(B) RETURN A, B when there is one A and three B's, it would return:
A B
1 1
1 2
1 3
That's the same A passed down 3 times over the database connection, while i only need it once and i know this before the data is fetched.
Something like this seems great http://nigelsmall.com/geoff
a load2neo is nice, a load-from-neo would also be nice! either in the geoff format or any other formats out there https://gephi.org/users/supported-graph-formats/
Each language could then implement it's own functions to create the objects directly.
To clarify:
Relations between nodes are lost in tabular data
Redundant (non-optimal) format for graphs
Edges (relations) and vertices (nodes) are usually not in the same table. (makes queries more complex?)
Another consideration (which might deserve it's own post), what's a good way to model relations in an object graph? As objects? or as data/method inside the node objects?
#Kikohs
Q: What do you mean by "Each language could then implement it's own functions to create the objects directly."?
A: With an (partial) graph provided by the database (as result of a query) a language as PHP could provide a factory method (in C preferably) to construct the object graph (this is usually an expensive operation). But only if the object graph is well defined in a standard format (because this function should be simple and universal).
Q: Do you want to export the full graph or just the result of a query?
A: The result of a query. However a query like MATCH (n) OPTIONAL MATCH (n)-[r]-() RETURN n, r should return the full graph.
Q: you want to dump to the disk the subgraph created from the result of a query ?
A: No, existing interfaces like REST are prefered to get the query result.
Q: do you want to create the subgraph which comes from a query in memory and then request it in another language ?
A: no i want the result of the query in another format then tabular (examples mentioned)
Q: You make a query which only returns the name of a node, in this case, would you like to get the full node associated or just the name ? Same for the edges.
A: Nodes don't have names. They have properties, labels and relations. I would like enough information to retrieve A) The node ID, it's labels, it's properties and B) the relation to other nodes which are in the same result.
Note that the first part of the question is not a concrete "how-to" question, rather "why is this not possible?" (or if it is, i like to be proven wrong on this one). The second is a real "how-to" question, namely "how to model relations". The two questions have in common that they both try to find the answer to "how to get graph data efficiently in PHP."
#Michael Hunger
You have a point when you say that not all result data can be expressed as an object graph. It reasonable to say that an alternative output format to a table would only be complementary to the table format and not replacing it.
I understand from your answer that the natural (rawish) output format from the database is the result format with duplicates in it ("streams the data out as it comes"). I that case i understand that it's now left to an alternative program (of the dev stack) to do the mapping. So my conclusion on neo4j implementing something like this:
Pro's - not having to do this in every implementation language (of the application)
Con's - 1) no application specific mapping is possible, 2) no performance gain if implementation language is fast
"Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results."
I don't understand this point entirely, are you saying that these formats are no able to hold deduplicated results (in certain cases)?? So infact that there is no possible textual format with which a graph can be described without duplication??
"There is also the questions on what you want to include in your output?"
I was under the assumption that the cypher language was powerful enough to specify this in the query. And so the output format would have whatever the database can provide as result.
"You could just return the paths that you get, which are unique paths through the graph in themselves".
Useful suggestion, i'll play around with this idea :)
"The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it".
Does the enriching process fetch additional data from the database or is the data already contained in the initial result?

There is more to it.
First of all as you said tabular results from queries are really commonplace and needed to integrate with other systems and databases.
Secondly oftentimes you don't actually return raw graph data from your queries, but aggregated, projected, sliced, extracted information out of your graph. So the relationships to the original graph data are already lost in most of the results of queries I see being used.
The only time that people need / use the raw graph data is when to export subgraph-data from the database as a query result.
The problem of doing that as a de-duplicated graph is that the db has to fetch all the result data data in memory first to deduplicate, extract the needed relationships etc.
Normally it just streams the data out as it comes and uses little memory with that.
Even if you use geoff, graphml or the gephi format you have to keep all the data in memory to deduplicate the results (which are returned as paths with potential duplicate nodes and relationships).
There is also the questions on what you want to include in your output? Just the nodes and rels returned? Or additionally all the other rels between the nodes that you return? Or all the rels of the returned nodes (but then you also have to include the end-nodes of those relationships).
You could just return the paths that you get, which are unique paths through the graph in themselves:
MATCH p = (n)-[r]-(m)
WHERE ...
RETURN p
Another way to address this problem in Neo4j is to use sensible aggregations.
E.g. what you can do is to use collect to aggregate data per node (i.e. kind of subgraphs)
MATCH (n)-[r]-(m)
WHERE ...
RETURN n, collect([r,type(r),m])
or use the new literal map syntax (Neo4j 2.0)
MATCH (n)-[r]-(m)
WHERE ...
RETURN {node: n, neighbours: collect({ rel: r, type: type(r), node: m})}
The dump command of the neo4j-shell uses the approach of pulling the cypher results into an in-memory structure, enriching it and then outputting it as cypher create statement(s).
A similar approach can be used for other output formats too if you need it. But so far there hasn't been the need.
If you really need this functionality it makes sense to write a server-extension that uses cypher for query specification, but doesn't allow return statements. Instead you would always use RETURN *, aggregate the data into an in-memory structure (SubGraph in the org.neo4j.cypher packages). And then render it as a suitable format (e.g. JSON or one of those listed above).
These could be a starting points for that:
https://github.com/jexp/cypher-rs
https://github.com/jexp/cypher_websocket_endpoint
https://github.com/neo4j-contrib/rabbithole/blob/master/src/main/java/org/neo4j/community/console/SubGraph.java#L123
There are also other efforts, like GraphJSON from GraphAlchemist: https://github.com/GraphAlchemist/GraphJSON
And the d3 json format is also pretty useful. We use it in the neo4j console (console.neo4j.org) to return the graph visualization data that is then consumed by d3 directly.

I've been working with neo4j for a while now and I can tell you that if you are concerned about memory and performances you should drop cypher at all, and use indexes and the other graph-traversal methods instead (e.g. retrieve all the relationships of a certain type from or to a start node, and then iterate over the found nodes).
As the documentation says, Cypher is not intended for in-app usage, but more as a administration tool. Furthermore, in production-scale environments, it is VERY easy to crash the server by running the wrong query.
In second place, there is no mention in the docs of an API method to retrieve the output as a graph-like structure. You will have to process the output of the query and build it.
That said, in the example you give you say that there is only one A and that you know it before the data is fetched, so you don't need to do:
MATCH (A)-->(B) RETURN A, B
but just
MATCH (A)-->(B) RETURN B
(you don't need to receive A three times because you already know these are the nodes connected with A)
or better (if you need info about the relationships) something like
MATCH (A)-[r]->(B) RETURN r

Take the qualifiers - postgresql

I'm working on the postgresql 8.4 source code. I need to extrapolate the qualifiers (where part) from the query.
For example if the query is: select name from student where age > 18
I need to know "age" and "18".
I've already took the target list, and the range list in this way
Query *query_idr = (Query *)linitial(querytree_list);
ListCell *l;
ListCell *tl;
foreach(l, query_idr->rtable){
Oid tab_idT = ((RangeTblEntry *) lfirst(l)) ->relid;
}
foreach(tl, query_idr->targetList){
TargetEntry *tle = (TargetEntry *) lfirst(tl);
Oid col_id = tle->resorigtbl;
}
and it works, and I've got the id of the table student (with the first foreach) and id of name column (with the second foreach), but I can't understand how I have to take the qualifier.
Here is the navigable Query structure http://doxygen.postgresql.org/structQuery.html

I doubt you are going to get an answer here. In general hacking the PostgreSQL source code is not likely to have enough people here who can answer it that a general site like this will be helpful. However rather than leave this withough any such resources I would like to reply to provide a list of resources for answering questions like this one as well as my reading of the docs as someone with quite a bit of experience buildign things on Pg.
In essence what you are trying to do is navigate through the parse tree of the query. It looks to me like the setOperations member might be the place to look just because I can't think of anywhere else and because this might help with both join conditions and where clause filters (remember these are considered interchangeable by the planner). However I have little experience in this area and so I could be wrong.
I would entirely second the suggestion that the pgsql-hackers list is likely to be the best place to ask this sort of question. You will probably get a better answer there.