what is a good way of finding duplicates in datomic?

what is a good way of finding duplicates in datomic? - datomic

I have a bunch of records containing business names and I wish to do a query to find all the duplicates. How can this be done?
{:business/name "<>"}

If you're trying to enforce uniqueness on the attribute value you should look at the :db/unique schema attribute instead.
To find the duplicated values and how often they repeat, use:
(->> (d/datoms db :aevt :business/name)
(map :v)
(frequencies)
(filter #(> (second %) 1)))
which uses the datomic.api/datoms API to access the raw AEVT index to stream :business/name attribute values, calculate their frequency and filter them based on some criteria i.e. more than one occurrence. You can also achieve the same result using datalog and aggregation functions:
(->> (d/q '[:find (frequencies ?v)
:with ?e
:in $ ?a
:where [?e ?a ?v]]
db :business/name)
(ffirst)
(filter #(> (second %) 1)))
To find the entities with duplicated attribute values, use:
(->> (d/datoms db :aevt :business/name)
(group-by :v)
(filter #(> (count (second %)) 1))
(mapcat second)
(map :e))
which also leverages the d/datoms API to accomplish it. For a full code sample, including datalog implementations, see https://gist.github.com/a2ndrade/5641681

Related

Query that excludes some ids

I'd like to write a query that provides a list of entity ids that will be filtered out.
The following query still returns everything in the id list
(d/q '[:find (pull ?e [:db/id
:user/first-name])
:in $ ?account [?id ...]
:where [?e :user/account ?account]
(not [(= ?e ?id)])]
db 18625726974632500 [40809473576669559 47437329668874807])

It turns out that I could do this by using a scalar input rather than a collection input:
(d/q '[:find (pull ?e [:db/id
:user/first-name])
:in $ ?account ?ids
:where [?e :user/account ?account]
(not [(contains? ?ids ?e)])]
db 18625726974632500 #{40809473576669559 47437329668874807})

Passing ?input map as Datomic query argument

How to pass a map as ?input to a Datomic query and destructure the map for matching against facts?
When I run the following query, I get a NullPointerException:
(d/q '[:find ?e
:where
:in $ ?input
[?e :amount ?amount]
[(:amount ?input) ?amount]]
(d/db conn)
{:amount 123.0M})
=> Syntax error (NullPointerException) compiling at ...
However, passing the amount as an input argument, works:
(d/q '[:find ?e
:where
:in $ ?amount
[?e :amount ?amount]
(d/db conn)
123.0M)
=> [[1234]]

You can't pass a map into a Datalog query, you're limited to scalars, tuples, collections and relations:
Datomic Query Bindings
If you had a much more complex map than your example and needed to use many values from it, you'd have to destructure it outside (as #Alan Thompson suggested) the query and pass the values in as a tuple:
(let [input-fn (juxt :amount :timestamp :quantity)
input-data {:timestamp "29/08/2019" :quantity 3 :amount 123.0}
inputs (input-fn input-data)]
(d/q '[:find ?e
:in $ [?amount ?timestamp ?quantity]
:where
[?e :amount ?amount]
[?e :timestamp ?timestamp]
[?e :quantity ?quantity]]
(d/db conn)
inputs))

Is it possible to pass the datalog wildcard `_` into a parameterized query?

Is it possible to pass a wildcard _ into a parameterized query? Something like this:
(d/q [:find ?e
:in $ ?type
:where [?e :type ?type]] db _)
When I tried this as written above it threw an error. Is there a way to do this?
I know that I can get everything with a query that looks like this:
(d/q [:find ?e
:where [?e :type]] db)
But my goal is to avoid needing to build separate queries when I don't want to filter results by :type. The use case is, e.g., and API endpoint that may or may not filter results.

If I understand you correctly, you should be able to type:
(d/q [:find ?e
:in $
:where [?e :type]] db )
In Datomic, any unspecified values are considered to be wildcards. The above query will return a list of all entities that have the :type attribute, regardless of value.
Update
Datomic's query is designed to accept a plain value like 5 or :awesome to be substituted into the ?type variable. A symbol like _ (or the quoted version '_) does not fit the pattern expected by Datomic.
Just for fun, I tried several variations and could not get it Datomic to accept the symbol '_ for the ?type variable in the way you proposed. I think you'll have to write a separate query for the wildcard case.
Essentially, the wildcard _ is a special symbol (aka "reserved word") in the Datomic query syntax just like $. Datomic also enforces that query variables begin with a ? like ?e or ?type. These requirements are a part of the Datomic DSL that you can't change.
The only workaround besides hand-writing separate queries would be to dynamically compose the query vector from a base-part and add-on parts. Whether that is easier or harder than hand-writing the different queries depends on your specific situation.

Find entity with most occurrences in a relation in Datomic

Is there a way to express this kind of logic purely inside a query?
(def e-top
(let [res (d/q '[:find ?e (count ?p)
:where [?p :likes ?e]] db)]
(first (apply max-key last res))))

If you need to work within one query, then aggregate of aggregates problems are best tackled with subquery (a nested call to query inside query). See this answer on the Datomic mailing list which includes a similar (not identical) query on the results of an aggregate against mbrainz:
(d/q '[:find ?track ?count
:where [(datomic.api/q '[:find ?track (count ?artist)
:where [?track :track/artists ?artist]] $) [[?track ?count]]]
[(> ?count 1)]]
(d/db conn))
For your case (assuming work stays in Clojure), apply will be faster and simpler. Subqueries that only need to do something simple (e.g. get something associated with the max value) tend to make more sense if you're using the REST API or some other client wrapping around Datomic where you don't have the perf benefits associated with the Peer library being in process.

Datomic query for maximum of aggregated value

Assume I have entity author with many related book entities.
What's the query to fetch author with biggest amount of books?

OK. Since I found an answer by myself - I am posting it here in case somebody will search for:
The solution is to build two datomic queries passing output of first one to second one.
(->>
(d/q '[:find (count ?b) ?a :where [?a :author/books ?b]] db)
(d/q '[:find (max ?count) ?a :in $ [?count ?a]] db))
This is as far as I got it the common way to work with less trivial queries in datomic - split it to several subqueries and chain together giving the DB do its job.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

what is a good way of finding duplicates in datomic? - datomic

I have a bunch of records containing business names and I wish to do a query to find all the duplicates. How can this be done? {:business/name "<>"}

Related

Query that excludes some ids

Passing ?input map as Datomic query argument

Is it possible to pass the datalog wildcard `_` into a parameterized query?

Find entity with most occurrences in a relation in Datomic

Datomic query for maximum of aggregated value

Categories

Resources