How do I build a domain-specific query language? - database

I have a biology database that I would like to query. There is also a given terminology bank I have access to that has formalizable predicates. I would like to build a query language for this DB using the predicates mentioned. How would you go about it? My solution is the following:
formalize the predicates
translate into a query language (sql, sparql, depends)
Build a specific language with ANTLR or other such tools
Translate from 3 to 2.
Is this a valid approach? Are there better ones? Any pointers would be much appreciated.

Take a look at Booleano.

Use BNF to get a head-start into the language semantics..GoldParser will help you by playing around with the semantics and syntax (link here: http://www.devincook.com/). Once you have the BNF semantics sorted out, you can then build up actions based on the inputs, for example, a bnf grammar section dealing with extracting a composition of a limb's genetic makeup classification (I do not know if that is in existence, abstract example here but you get the gist) for a particular query...'fetch stats on limb where limb is leg', then behind the scenes you would issue a SQL select on a column alias or name from a predefined table ... I could be wrong on the approach... Hope it helps?

I suggest you take a look at the i2b2 framework, it's a graphical query language and query engine platform for patient databases.
It's probably hard to grasp all first but do take a look at the CRC cell or webservice in there, you'll see how they approached SQL generation from a clinical graphical query language in an interesting way (albeit, not so performance friendly :))

Consider using Irony.NET from here: Irony.NET

Related

Is it possible to use only one field to index traditional and simplified Chinese?

Official Solr documentation points to the separation of Simplified and Traditional Chinese by using different Tokenizers. I wonder if people use ICU Transform Filter to Traditional <-> Simplified conversion and then be able to have one unique field for both Chineses.
At the same time, it seems that this conversion is a really hard task and it doesn't seem to be solved.
The simple question is what is the recommended way of indexing traditional and simplified Chinese in Solr? It would be really convenient to have a unique field for both, but I couldn't find a good success case for that.
The truth is, it is possible. This video shows how you could create an field with as many languages as possible. But looks tricky.

Query equivalence evaluation

My question is rooted in T-SQL, SQL Server environment, but its scope is not confined to this technology. I am working on a database with a quite complex business logic, with existing views, stored procedures and new ones to be designed. By means of comparisons of different queries or part of them, I have a strong feeling that there are sections performing the same job with a different arrangement, but of course to refactor the whole mess I need something more that a feeling; so I am trying to determine a way to demonstrate that two statements are equivalent.
An obvious but weak response could be to ascertain that the two queries A and B produce the same recordset: if A is a subset of B and B is a subset of A, they are the same recordset; but I am not sure that this is a good idea because, of course, a recordset is not a query, the results could depend on data and specific parameter values. My questions is: there is a method to prove the equivalence of two different queries? I would say yes, because the optimization performed by the database should works on this. Someone could provide me some pointer to documentation or books digging in this? If there is no general method to prove the equivalence, there is some smart approach based on regression testing performed according to some effective heuristic that does the job?
Edited later: in case, reverse engineering the queries (by hand?) by means of relational algebra, could be a superior method to assess the query equivalence instead of using other queries and / or the computer? There are automated tools helping in performing this "reverse engineering", in case?
Thanks a lot for helping
You probably can't prove it, since the problem seems to be NP-complete; check this SO question on query equivalence (that one is about Oracle, but there are a couple of answers / links that should be relevant for you).
You can check the execution plans of the two queries. If they are the same, you have your answer!
Only by the execution plan you can check it. Apart from that i dont think that there is any way to prove this thing.
You'll need to implement some "canonical query plan" generator for this (an "optimal query plan" as generated by the DBMS can be nondeterministic). In most cases, using alphabetical ordering of terms and tables as a tie-breaker will get you there.
I doubt you are going to be able to formally proof or disprove this but my take on this would be to
identify all use cases
identify all boundary values
identify all parameters
and derive a test plan from that. It would require you to
create testdata for each case
run both queries against that data
compare the results
If you don't find any differences after testing, you can be reasonably assured that both statements are equivallent.

Is there a way to translate database table rows into Prolog facts?

After doing some research, I was amazed with the power of Prolog to express queries in a very simple way, almost like telling the machine verbally what to do. This happened because I've become really bored with Propel and PHP at work.
So, I've been wondering if there is a way to translate database table rows (Postgres, for example) into Prolog facts. That way, I could stop using so many boring joins and using ORM, and instead write something like this to get what I want:
mantenedora_ies(ID_MANTENEDORA, ID_IES) :-
papel_pessoa(ID_PAPEL_MANTENEDORA, ID_MANTENEDORA, 1),
papel_pessoa(ID_PAPEL_IES, ID_IES, 6),
relacionamento_pessoa(_, ID_PAPEL_IES, ID_PAPEL_MANTENEDORA, 3).
To see why I've become bored, look at this post. The code there would be replaced for these simple lines ahead, much easier to read and understand. I'm just curious about that, since it will be impossible to replace things around here.
It would also be cool if something like that was possible to be done in PHP. Does anyone know something like that?
check the ODBC interface of swi-prolog (maybe there is something equivalent for other prolog implementations too)
http://www.swi-prolog.org/pldoc/doc_for?object=section%280,%270%27,swi%28%27/doc/packages/odbc.html%27%29%29
I can think of a few approaches to this -
On initialization, call a method that performs a selects all data from a table and asserts it into the db. Do this for each db. You will need to declare the shape of each row as :- dynamic ies_row/4 etc
You could modify load_files by overriding user:prolog_load_files. From this activity you could so something similar to #1. This has the benefit of looking like a load_files call. http://www.swi-prolog.org/pldoc/man?predicate=prolog_load_file%2F2 ... This documentation mentions library(http_load), but I cannot find this anywhere (I was interested in this recently)!
There is the Draxler Prolog to SQL compiler, that translates some pattern (like the conjunction you wrote) into the more verbose SQL joins. You can find in the related post (prolog to SQL converter) more info.
But beware that Prolog has its weakness too, especially regarding aggregates. Without a library, getting sums, counts and the like is not very easy. And such libraries aren't so common, and easy to use.
I think you could try to specialize the PHP DB interface for equijoins, using the builtin features that allows to shorten the query text (when this results in more readable code). Working in SWI-Prolog / ODBC, where (like in PHP) you need to compose SQL, I effettively found myself working that way, to handle something very similar to what you have shown in the other post.
Another approach I found useful: I wrote a parser for the subset of SQL used by MySQL backup interface (PHPMyAdmin, really). So routinely I dump locally my CMS' DB, load it memory, apply whathever duty task I need, computing and writing (or applying) the insert/update/delete statements, then upload these. This can be done due to the limited size of the DB, that fits in memory. I've developed and now I'm mantaining this small e-commerce with this naive approach.
Writing Prolog from PHP should be not too much difficult: I'd try to modify an existing interface, like the awesome Adminer, that already offers a choice among basic serialization formats.

AI program to generate paragraph pattern

Is there any software or service or AI program who can rebuild an English paragraph using different set of vocabulary, grammar rules etc.
I mean to say, if the source paragraph is
“Gwalior is a good tourist place near
to Jhansi. Jhansi is very famous due
their queen Rani Laxmi Bai
(Manikandana)”
Any software can generate its version or pattern like
“Rani Laxmi Bai (Manikandana) was the
queen of Jhansi which is nearer to a
good tourist palace Gwalior.”
Or something else. I know that 100% correctness is not possible until human intervention.
This guy wrote a JavaScript app that generates corporate bullshit ready for distribution (He's also got a great buzzword bingo generator). It's not AI, it just simply follows linguistic rules. From what I understand of your question, you don't need AI, you could learn a lot from just studying what this guy did. He seeds the program with nouns, verbs, adjectives, adverbs, etc and generates text that your eyes can parse (it's grammatical but it doesn't necessarily make sense). If you're looking for something to write your thesis paper, you have a lot more looking to do.
From you're question, it looks like you're also looking for a program to parse English and generate the seed data for the formerly mentioned generator. Abiword uses such a grammar parser for grammar checking. I haven't looked at it in much depth, but I figure you could easily use it to list the parts of speech contained in a section of text. If you used this program to generate the seed data you could pump the output directly into the other program to generate more text.
The python NLTK library does natural language parsing, including building parse trees which include whether a word is a verb, noun, tense etc. Perhaps you could take these trees and re-organize them according to some simple rules you come up with and verify. I don't think you would need too many rules before the results of your program are very different from the source document. Some example rules:
Replace words with synonyms
active voice to passive voice and vice-versa (The hunter saw the deer -> the deer was seen by the hunter)
http://www.nltk.org/
Rapid Rewrite is a software that can do what you want: http://www.rapidrewriter.com/?hop=qushy It's not free though, and the website is terrible.
Here's another one - same story
http://thebestspinner.com/?id=eprocent
watch their video and tell me that's not what you are looking for...
Here are a few links to various programs to alter written text. One of them should be able to provide you with some tips on how to implement what you're looking for.
http://www.worldlingo.com/ma/enwiki/en/Jive_filter
http://bytes.com/topic/python/answers/476939-filters-like-old-skool-jive-fudd-valspeak-text-transformation-python
http://www.rinkworks.com/dialect/
I disagree that NLP is not the path you need to follow.
However, if you don't want to go the NLP route, you could generate some good sounding sentences without using NLP, by training a custom language model using n-grams to build a fourth or fifth order model. You would then use statistical probability to generate your sentences.
Once you have your model, you randomly pick a starting word (in the domain of known sentence starting words, or words that begin with a capital letter), and then use conditional probablitily to pick the next word.
An easy example of this is in this article: Wordmills are coming...
Of course, you would need ample training material in order to accomplish this, as just training on a simple paragraph would not work well for the way you want to rephrase a paragraph. Without using NLP techniques to detect nouns, verbs, etc. from your sample paragraph (which would require well trained models as well), and then rearranging them using an opposite sentence structure would be more effort than just using NLP in the first place.
What you are trying to do is perform entity extraction, and also location awareness. Not only that, but relationships between entities and locations. A very tall order if you are not going to use any NLP.

Template Matching for relational database

I am trying to do the following:
we are trying to design a fraud detection system for stock market.
I know the Specification for the frauds (they are like templates).
so I want to know if I can design a template, and find all records that match this template.
Notice:
I can't use the traditional queries cause the templates are complex
for example one of my Fraud is circular trading,it's like this :
A bought from B, and B bought from C, And C bought from A (it's a cycle)
and this cycle can include 4 or 5 persons.
is there any good suggestion for this situation.
I don't see why you can't use "traditional queries" as you've stated. SQL can be used to write extraordinarily complex queries. For that matter I'm not sure that this is a hugely challenging question.
Firstly, I'd look at the behavior you have described as vary transactional, therefore I treat the transactions as a model. I'd likely have a transactions table with some columns like buyer, seller, amount, etc...
You could alternatively have the shares as its own table and store say the previous 100 owners of that share in the same table using STI (Single Table Inheritance) buy putting all the primary keys of the owners into an "owners" column in your shares table like 234/823/12334/1234/... that way you can do complex queries and see if that share was owned by the same person or look for patterns in the string really easily and quickly.
-update-
I wouldn't suggest making up a "small language" I don't see why you'd want to do something like that when you have huge selection of wonderful languages and databases to choose from, all of which have well refined and tested methods to solve exactly what you are doing.
My best advice is pop open your IDE (thumbs up for TextMate) and pick your favorite language (Ruby in my case). Find some sample data and create your database and start writing some code! You can't go wrong trying to experiment like this, it'll will totally expose better ways to go about it than we can dream up here on Stackoverflow.
Definitely Data Mining. But as you point out, you've already got the models (your templates). Look up fraud DETECTION rather than prevention for better search results?
I know a some banks use SPSS PASW Modeler for fraud detection. This is very intuitive and you can see what you are doing as you play around with the data. So you can implement your templates. I agree with Joseph, you need to get playing, making some new data structures.
Maybe a timeseries model?
Theoretically you could develop a "Small Language" first, something with a simple syntax (that makes expressing the domain - in your case fraud patterns - easy) and from it generate one or more SQL queries.
As most solutions, this could be thought of as a slider: at one extreme there is the "full Fraud Detection Language" at the other, you could just build stored procedures for the most common cases, and write new stored procedures which use the more "basic" blocks you wrote before to implement the various patterns.
What you are trying to do falls under the Data Mining umbrella, so you could also try to learn more about it: maybe you can find a Data Mining package for your specific DB (you didn't specify) and see if it helps you finding common patterns in your data.

Resources