Datomic query for maximum of aggregated value - datomic

Assume I have entity author with many related book entities.
What's the query to fetch author with biggest amount of books?

OK. Since I found an answer by myself - I am posting it here in case somebody will search for:
The solution is to build two datomic queries passing output of first one to second one.
(->>
(d/q '[:find (count ?b) ?a :where [?a :author/books ?b]] db)
(d/q '[:find (max ?count) ?a :in $ [?count ?a]] db))
This is as far as I got it the common way to work with less trivial queries in datomic - split it to several subqueries and chain together giving the DB do its job.

Related

Is it possible to pass the datalog wildcard `_` into a parameterized query?

Is it possible to pass a wildcard _ into a parameterized query? Something like this:
(d/q [:find ?e
:in $ ?type
:where [?e :type ?type]] db _)
When I tried this as written above it threw an error. Is there a way to do this?
I know that I can get everything with a query that looks like this:
(d/q [:find ?e
:where [?e :type]] db)
But my goal is to avoid needing to build separate queries when I don't want to filter results by :type. The use case is, e.g., and API endpoint that may or may not filter results.
If I understand you correctly, you should be able to type:
(d/q [:find ?e
:in $
:where [?e :type]] db )
In Datomic, any unspecified values are considered to be wildcards. The above query will return a list of all entities that have the :type attribute, regardless of value.
Update
Datomic's query is designed to accept a plain value like 5 or :awesome to be substituted into the ?type variable. A symbol like _ (or the quoted version '_) does not fit the pattern expected by Datomic.
Just for fun, I tried several variations and could not get it Datomic to accept the symbol '_ for the ?type variable in the way you proposed. I think you'll have to write a separate query for the wildcard case.
Essentially, the wildcard _ is a special symbol (aka "reserved word") in the Datomic query syntax just like $. Datomic also enforces that query variables begin with a ? like ?e or ?type. These requirements are a part of the Datomic DSL that you can't change.
The only workaround besides hand-writing separate queries would be to dynamically compose the query vector from a base-part and add-on parts. Whether that is easier or harder than hand-writing the different queries depends on your specific situation.

Find entity with most occurrences in a relation in Datomic

Is there a way to express this kind of logic purely inside a query?
(def e-top
(let [res (d/q '[:find ?e (count ?p)
:where [?p :likes ?e]] db)]
(first (apply max-key last res))))
If you need to work within one query, then aggregate of aggregates problems are best tackled with subquery (a nested call to query inside query). See this answer on the Datomic mailing list which includes a similar (not identical) query on the results of an aggregate against mbrainz:
(d/q '[:find ?track ?count
:where [(datomic.api/q '[:find ?track (count ?artist)
:where [?track :track/artists ?artist]] $) [[?track ?count]]]
[(> ?count 1)]]
(d/db conn))
For your case (assuming work stays in Clojure), apply will be faster and simpler. Subqueries that only need to do something simple (e.g. get something associated with the max value) tend to make more sense if you're using the REST API or some other client wrapping around Datomic where you don't have the perf benefits associated with the Peer library being in process.

How to use SynonymFilterFactory with ShingleFilterFactory in Solr?

What I want to achieve is searching for 'deodorant spray' matches 'antiperspirant spray', 'deo spray' etc.
I'm using a SynonymFilterFactory to add synonyms at index time for deodorant, deo and antiperspirant. I can see this working correctly in the analyzer.
After this I'm running a ShingleFilterFactory (maxShingleSize="3") to split into combinations of words. This, again gives me the correct result, e.g. analysing 'test shingle phrase' gives:
test
test shingle
test shingle phrase
shingle
shingle phrase
phrase
Which is the desired result. The problem comes when I combine synonym terms with shingles. For example, searching for 'deodorant spray' should give me:
deodorant spray
deo spray
antiperspirant spray
for all my synonyms. But what I actually see is:
deodorant
deodorant deo
deodorant deo antiperspirant
deo
deo antiperspirant
deo antiperspirant spray
antiperspirant
antiperspirant spray
Which clearly is making shingles from each of the synonym terms too. I've tried swapping the order of my filter factories but can't seem to get it to work. What am I doing wrong?
The only thing you can do is to use synonym filter without expanding - the one that reduces all synonyms to the first in the list. Then you have to use it at index time, as well as at query time.
Such approach would not cause problem described in the documentation, since you have to apply the filter also on the index.
Consider the following scenario:
An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
Many thousands of documents containing the term "text:TV"
A few hundred documents containing the term "text:Television"
A query for text:TV will expand into (text:TV text:Television) and the lower docFreq for text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
However, you might still run into problems if you want to support multi-word synonyms as described in the documentation.
I do not know if shingles consisting of synonyms will affect search results anyhow, but if not, then only what it costs you is extra space in the index, so consider if it is something you want to save on.

what is a good way of finding duplicates in datomic?

I have a bunch of records containing business names and I wish to do a query to find all the duplicates. How can this be done?
{:business/name "<>"}
If you're trying to enforce uniqueness on the attribute value you should look at the :db/unique schema attribute instead.
To find the duplicated values and how often they repeat, use:
(->> (d/datoms db :aevt :business/name)
(map :v)
(frequencies)
(filter #(> (second %) 1)))
which uses the datomic.api/datoms API to access the raw AEVT index to stream :business/name attribute values, calculate their frequency and filter them based on some criteria i.e. more than one occurrence. You can also achieve the same result using datalog and aggregation functions:
(->> (d/q '[:find (frequencies ?v)
:with ?e
:in $ ?a
:where [?e ?a ?v]]
db :business/name)
(ffirst)
(filter #(> (second %) 1)))
To find the entities with duplicated attribute values, use:
(->> (d/datoms db :aevt :business/name)
(group-by :v)
(filter #(> (count (second %)) 1))
(mapcat second)
(map :e))
which also leverages the d/datoms API to accomplish it. For a full code sample, including datalog implementations, see https://gist.github.com/a2ndrade/5641681

Solr / Lucene - Why is this OR query failing when the two individual queries succeed?

I have a Solr document schema with with a solr.TrieDateField and noticed this boolean query (not authored by me) which I thought could benefit from some simplification;
q=-(-event_date:[2011-12-02T00:00:00.000Z TO NOW/DAY+90DAYS] OR (event_date:[* TO *]))
which means events within the next 90 days or non-events (See Pure Negative for Solr boolean NOT notation) . My simplification looked like
q=event_date:[2011-12-02T00:00:00.000Z TO NOW/DAY+90DAYS] OR -event_date:[* TO *]
As stated, this didn't work (0 results). So as a test I ran the two sides of my modified OR query individually and the sum of the two results (both non-zero) equaled the sum of the original query and I can't come up with a good explanation why. Running with debugQuery=true didn't present anything helpful.
I've put this on solr-user, will post back with any solutions.

Resources