Pull expression in query with limit/default? - datomic

How can I use limit/default in at pull expression in a query? Given a cardinality-many attribute, how can I control how many of its values is returned (default is max 1000 values!).
(Found it hard to figure out the correct syntax from the documentation/examples)

Limit (for cardinality-many attributes)
Return max 2 values of cardinality-many attribute :ns/ints:
[:find (pull ?a [(limit :ns/ints 2)])
:where [?a :ns/str ?b]]
Return all values of cardinality-many attribute :ns/ints:
[:find (pull ?a [(limit :ns/ints nil)])
:where [?a :ns/str ?b]]
Default
Return default value 2000 if attribute :ns/ints has no value:
[:find (pull ?a [(default :ns/ints 2000)])
:where [?a :ns/str ?b]]
Return default values 2000 and 2001 if cardinality-many attribute :ns/ints has no values:
[:find (pull ?a [(default :ns/ints [2000 2001])])
:where [?a :ns/str ?b]]

Related

evaluating test dataset using eval() in LightGBM

I have trained a ranking model with LightGBM with the objective 'lambdarank'.
I want to evaluate my model to get the nDCG score for my test dataset using the best iteration, but I have never been able to use the lightgbm.Booster.eval() nor lightgbm.Booster.eval_train() function.
First, I have created 3 dataset instances, namely the train set, valid set and test set:
lgb_train = lgb.Dataset(x_train, y_train, group=query_train, free_raw_data=False)
lgb_valid = lgb.Dataset(x_valid, y_valid, reference=lgb_train, group=query_valid, free_raw_data=False)
lgb_test = lgb.Dataset(x_test, y_test, group=query_test)
I then train my model using lgb_train and lgb_valid:
gbm = lgb.train(params,
lgb_train,
num_boost_round=1500,
categorical_feature=chosen_cate_features,
valid_sets=[lgb_train, lgb_valid],
evals_result=evals_result,
early_stopping_rounds=150
)
When I call the eval() or the eval_train() functions after training, it returns an error:
gbm.eval(data=lgb_test,name='test')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-122-7ff5ef5136b8> in <module>()
----> 1 gbm.eval(data=lgb_test,name='test')
/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in eval(self, data,
name, feval)
1925 raise TypeError("Can only eval for Dataset instance")
1926 data_idx = -1
-> 1927 if data is self.train_set:
1928 data_idx = 0
1929 else:
AttributeError: 'Booster' object has no attribute 'train_set'
gbm.eval_train()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-123-0ce5fa3139f5> in <module>()
----> 1 gbm.eval_train()
/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in eval_train(self,
feval)
1956 List with evaluation results.
1957 """
-> 1958 return self.__inner_eval(self.__train_data_name, 0, feval)
1959
1960 def eval_valid(self, feval=None):
/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in
__inner_eval(self, data_name, data_idx, feval)
2352 """Evaluate training or validation data."""
2353 if data_idx >= self.__num_dataset:
-> 2354 raise ValueError("Data_idx should be smaller than number
of dataset")
2355 self.__get_eval_info()
2356 ret = []
ValueError: Data_idx should be smaller than number of dataset
and when i called the eval_valid() function, it returns an empty list.
Can anyone tell me how to evaluate a LightGBM model and get the nDCG score using test set properly? Thanks.
If you add keep_training_booster=True as an argument to your lgb.train, the returned booster object would be able to execute eval and eval_train (though eval_valid would still return an empty list for some reason even when valid_sets is provided in lgb.train).
Documentation says:
keep_training_booster (bool, optional (default=False)) – Whether the returned Booster will be used to keep training. If False, the returned value will be converted into _InnerPredictor before returning.

Datomic aggregations: counting related entities without losing results with zero-count

I have a domain model made of Questions, each Question being associated with a number of Comments and Affirmations.
I would like to make a Datalog query which extracts a bunch of content attributes for each Question, as well as the number of associated Comments and Affirmations, including when these relationships are empty (e.g some Question has no Comment or no Affirmation), in which case the returned count should be 0.
I have seen the following Gist, which shows how to use (sum ...) and (or-join) combined with a 'weight' variable to get a zero-count when the relationship is empty.
However, I do not see how to make this work when there are 2 relationships. I tried the following, but the returned counts are not correct:
(def query '[:find (sum ?comment-weight) (sum ?affirmation-weight) ?text ?time ?source-identifier ?q
:with ?uniqueness
:where
[?q :question/text ?text]
[?q :question/time ?time]
[?q :question/source-identifier ?source-identifier]
(or-join [?q ?uniqueness ?comment-weight ?affirmation-weight]
(and [?comment :comment/question ?q]
[?affirmation :affirmation/question ?q]
['((identity ?comment) (identity ?affirmation)) ?uniqueness]
[(ground 1) ?comment-weight]
[(ground 1) ?affirmation-weight])
(and [?comment :comment/question ?q]
[(identity ?comment) ?uniqueness]
[(ground 1) ?comment-weight]
[(ground 0) ?affirmation-weight])
(and [?affirmation :affirmation/question ?q]
[(identity ?affirmation) ?uniqueness]
[(ground 1) ?affirmation-weight]
[(ground 0) ?comment-weight])
(and [(identity ?q) ?uniqueness]
[(ground 0) ?comment-weight]
[(ground 0) ?affirmation-weight]))])
Originally asked on the Clojurians Slack.
So to summarize, the trick is to consider each Question, Comment and Affirmation as a 'data point', which has a 'weight' of 0 or 1 for each count, and is identified uniquely (so that Datalog counts correctly, see :with). In particular, each Question has a zero weight for all counts.
The posted code was almost correct, it only needs to remove the first clause of the (or-join ...), which results in creating 'artificial' data points (Comment + Affirmation tuples) that pollute the counts.
The following should work:
[:find (sum ?comment-weight) (sum ?affirmation-weight) ?text ?time ?source-identifier ?q
:with ?uniqueness
:where
[?q :question/text ?text]
[?q :question/time ?time]
[?q :question/source-identifier ?source-identifier]
(or-join [?q ?uniqueness ?comment-weight ?affirmation-weight]
(and
[?comment :comment/question ?q]
[(identity ?comment) ?uniqueness]
[(ground 1) ?comment-weight]
[(ground 0) ?affirmation-weight])
(and
[?affirmation :affirmation/question ?q]
[(identity ?affirmation) ?uniqueness]
[(ground 1) ?affirmation-weight]
[(ground 0) ?comment-weight])
(and
[(identity ?q) ?uniqueness]
[(ground 0) ?comment-weight]
[(ground 0) ?affirmation-weight]))]
You can call arbitrary functions within a query, so consider using datomic.api/datoms directly (or a subquery, or an arbitrary function of your own) to count the comments and affirmations.
Using datomic.api/datoms:
'[:find ?comment-count ?affirmation-count ?text ?time ?source-identifier ?q
:where
[?q :question/text ?text]
[?q :question/time ?time]
[?q :question/source-identifier ?source-identifier]
[(datomic.api/datoms $ :vaet ?q :comment/question) ?comments]
[(count ?comments) ?comment-count]
[(datomic.api/datoms $ :vaet ?q :affirmation/question) ?affirmations]
[(count ?affirmations) ?affirmation-count]]
Or using a subquery:
'[:find ?comment-count ?affirmation-count ?text ?time ?source-identifier ?q
:where
[?q :question/text ?text]
[?q :question/time ?time]
[?q :question/source-identifier ?source-identifier]
[(datomic.api/q [:find (count ?comment) .
:in $ ?q
:where [?comment :comment/question ?q]]
$ ?q) ?comment-count-or-nil]
[(clojure.core/or ?comment-count-or-nil 0) ?comment-count]
[(datomic.api/q [:find (count ?affirmation) .
:in $ ?q
:where [?affirmation :affirmation/question ?q]]
$ ?q) ?affirmation-count-or-nil]
[(clojure.core/or ?affirmation-count-or-nil 0) ?affirmation-count]]
Or using a custom function:
(defn count-for-question [db question kind]
(let [dseq (case kind
:comments (d/datoms db :vaet question :comment/question)
:affirmations (d/datoms db :vaet question :affirmation/question))]
(reduce (fn [x _] (inc x)) 0 dseq)))
'[:find ?comment-count ?affirmation-count ?text ?time ?source-identifier ?q
:where
[?q :question/text ?text]
[?q :question/time ?time]
[?q :question/source-identifier ?source-identifier]
[(user/count-for-question $ ?q :comments) ?comment-count]
[(user/count-for-question $ ?q :affirmations) ?affirmation-count]]

Updating value with cardinality many

I have a schema like this:
[{:db/id #db/id[:db.part/db]
:db/ident :person/name
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/doc "A person's name"
:db.install/_attribute :db.part/db}
{:db/id #db/id[:db.part/db]
:db/ident :person/roles
:db/valueType :db.type/keyword
:db/cardinality :db.cardinality/many
:db/doc "A person's role"
:db.install/_attribute :db.part/db}]
And a code like this:
;; insert new person
(def new-id (-> (d/transact conn [{:db/id (d/tempid :db.part/user)
:person/name "foo"
:person/roles #{:admin}}])
(:tempids)
(vals)
(first)))
(defn get-roles
[db eid]
(d/q '[:find ?roles .
:in $ ?eid
:where [?eid :user/roles ?roles]]))
(get-roles (d/db conn) new-id) ;; => [:admin]
;; update a person
(d/transact conn [{:db/id new-id
:person/roles #{:client}}])
(get-roles (d/db conn) new-id) ;; => [:admin :client]
It seems the default behaviour on it is, it will just assoc the new value.
How can I get this result, after doing the updating transaction:
(get-roles (d/db conn) new-id) ;; => [:client]
if what you want is to "reset" the list of roles to a new value (an 'absolute' operation 'in contrast to the 'relative' operations of just adding or removing roles), you'll have to use a transaction function to perform a diff and retract the values that need be.
Here's a basic generic implementation:
{:db/id (d/tempid :db.part/user),
:db/ident :my.fns/reset-to-many,
:db/fn
(d/function
{:lang :clojure,
:requires '[[datomic.api :as d]],
:params '[db e attr new-vals],
:code
'(let [ent (or (d/entity db e)
(throw (ex-info "Entity not found"
{:e e :t (d/basis-t db)})))
entid (:db/id ent)
old-vals (get ent attr)]
(into
[{:db/id (:db/id ent)
;; adding the new values
attr new-vals}]
;; retracting the old values
(comp
(remove (set new-vals))
(map (fn [v]
[:db/retract entid attr v])))
old-vals)
)})}
;; Usage
(d/transact conn [[:my.fns/reset-to-many new-id :person/roles #{:client}]])
Here is a solution once suggested from Robert Stuttaford
(defn many-delta-tx
"Produces the transaction necessary to have
`new` be the value for `entity-id` at `attr`
`new` is expected to be a set."
[db entity-id attr new]
(let [current (into #{} (map :v)
(d/datoms db :eavt
entity-id
(d/entid db attr)))]
(concat (for [id (clojure.set/difference new current)]
[:db/add entity-id attr id])
(for [id (clojure.set/difference current new)]
[:db/retract entity-id attr id]))))
For ease of testing, I would like to slightly modify part of the example of original question.
Change of the schema. I add db/unique
[{:db/id #db/id[:db.part/db]
:db/ident :person/name
:db/valueType :db.type/string
:db/cardinality :db.cardinality/one
:db/unique :db.unique/identity
:db/doc "A person's name"
:db.install/_attribute :db.part/db}
{:db/id #db/id[:db.part/db]
:db/ident :person/roles
:db/valueType :db.type/keyword
:db/cardinality :db.cardinality/many
:db/doc "A person's role"
:db.install/_attribute :db.part/db}]
get-roles
(defn get-roles
[db eid]
(d/q '[:find [?roles ...]
:in $ ?eid
:where [?eid :person/roles ?roles]] db eid))
My testing example
;; (many-delta-tx (d/db conn) [:person/name "foo"] :person/roles #{:sales :client})
;; => ([:db/add [:person/name "foo"] :person/roles :sales] [:db/retract [:person/name "foo"] :person/roles :admin]))

Django/pyodbc error: not enough arguments for format string

I have a Dictionary model defined in Django (1.6.5). One method (called get_topentities) returns the top names in my dictionary (entity names are defined by Entity model):
def get_topentities(self,n):
entities = self.entity_set.select_related().filter(in_dico=True,table_type=0).order_by("rank")[0:n]
return entities
When I call the function (say with n=2), it returns the top 2 elements but I cannot access the second one because of this "not enough arguments to format string" error:
In [5]: d = Dictionary.objects.get(code='USA')
In [6]: top2 = d.get_topentities(2)
In [7]: top2
Out[7]: [<Entity: BARACK OBAMA>, <Entity: GOVERNMENT>]
In [8]: top2[0]
Out[8]: <Entity: BARACK OBAMA>
In [9]: top2[1]
.
.
/usr/local/lib/python2.7/dist-packages/django_pyodbc/compiler.pyc in as_sql(self, with_limits, with_col_aliases)
172 # Lop off ORDER... and the initial "SELECT"
173 inner_select = _remove_order_limit_offset(raw_sql)
--> 174 outer_fields, inner_select = self._alias_columns(inner_select)
175
176 order = _get_order_limit_offset(raw_sql)[0]
/usr/local/lib/python2.7/dist-packages/django_pyodbc/compiler.pyc in _alias_columns(self, sql)
339
340 # store the expanded paren string
--> 341 parens[key] = buf% parens
342 #cannot use {} because IBM's DB2 uses {} as quotes
343 paren_buf[paren_depth] += '(%(' + key + ')s)'
TypeError: not enough arguments for format string
In [10]:
My server backend is MSSQL and I'm using pyodbc as the database driver. If I try the same on a PC with engine sqlserver_ado, it works. Can someone help?
Regards,
Patrick

Optimization of SPARQL query. [ Estimated execution time exceeds the limit of 1500 (sec) ]

I am trying to run this query on http://dbpedia.org/sparql but I get an error that my query is too expensive. When I run the query trough http://dbpedia.org/snorql/ I get:
The estimated execution time 25012730 (sec) exceeds the limit of 1500 (sec) ...
When running the query through my python script using SPARQLWrapper I simply get an HTTP 500.
I figure I need to do something to optimize my SPARQL query. I need the data for iterating over educational institutions and importing it in to a local database, maybe I am using SPARQL wrong and should do this in a fundamentally different way.
Hope someone can help me!
The query
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX : <http://dbpedia.org/resource/>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia: <http://dbpedia.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?uri
?name
?homepage
?student_count
?native_name
?city
?country
?type
?lat ?long
?image
WHERE {
?uri rdf:type dbpedia-owl:EducationalInstitution .
?uri foaf:name ?name .
OPTIONAL { ?uri foaf:homepage ?homepage } .
OPTIONAL { ?uri dbpedia-owl:numberOfStudents ?student_count } .
OPTIONAL { ?uri dbpprop:nativeName ?native_name } .
OPTIONAL { ?uri dbpprop:city ?city } .
OPTIONAL { ?uri dbpprop:country ?country } .
OPTIONAL { ?uri dbpprop:type ?type } .
OPTIONAL { ?uri geo:lat ?lat . ?uri geo:long ?long } .
OPTIONAL { ?uri foaf:depiction ?image } .
}
ORDER BY ?uri
LIMIT 20 OFFSET 10
Forget it. You won't be able to get that query back from dbpedia with just one SPARQL. Those optionals are very expensive.
To work it around you need to first run something like:
SELECT DISTINCT ?uri WHERE {
?uri rdf:type dbpedia-owl:EducationalInstitution .
?uri foaf:name ?name .
} ORDER BY ?uri
LIMIT 20 OFFSET 10
Then iterate over the resultset of this query to form single queries for each dbpedia-owl:EducationalInstitution such as ... (notice the filter at the end of the query):
SELECT DISTINCT ?uri
?name
?homepage
?student_count
?native_name
?city
?country
?type
?lat ?long
?image
WHERE {
?uri rdf:type dbpedia-owl:EducationalInstitution .
?uri foaf:name ?name .
OPTIONAL { ?uri foaf:homepage ?homepage } .
OPTIONAL { ?uri dbpedia-owl:numberOfStudents ?student_count } .
OPTIONAL { ?uri dbpprop:nativeName ?native_name } .
OPTIONAL { ?uri dbpprop:city ?city } .
OPTIONAL { ?uri dbpprop:country ?country } .
OPTIONAL { ?uri dbpprop:type ?type } .
OPTIONAL { ?uri geo:lat ?lat . ?uri geo:long ?long } .
OPTIONAL { ?uri foaf:depiction ?image } .
FILTER (?uri = <http://dbpedia.org/resource/%C3%89cole_%C3%A9l%C3%A9mentaire_Marie-Curie>)
}
Where <http://dbpedia.org/resource/%C3%89cole_%C3%A9l%C3%A9mentaire_Marie-Curie> has been obtained from the first query.
... and yes it will be slow and you might not be able to run this for an online application. Advice: try to work out some sort of caching mechanism to sit between your app and the dbpedia SPARQL endpoint.
Don't try and get the entire dataset at once! Add a LIMIT and a OFFSET clause and use those to page through the data.
With LIMIT 50 added I get back a result for your query almost instantly, I managed to get the limit up much higher than that and still get a response so play with it. Once you've found a page size that works for you just repeat the query with an OFFSET as well until you get no more results e.g.
SELECT * WHERE { ... } LIMIT 100
SELECT * WHERE { ... } LIMIT 100 OFFSET 100
...
If you know the exact URI (e.g. from a previous query), then putting the URI directly in the where clause is faster (at least in my experience) than putting the URI in a FILTER.
e.g., prefer:
WHERE { <http:/...> ... }
over
WHERE { ?uri .... FILTER (?uri...)
Also I've found UNION's actually perform faster than filters designed to match multiple resources.
Just because we're doing SPARQL now doesn't mean we can forget the nightmares of SQL tuning, welcome to the wonderful world of SPARQL tuning! :)

Resources