Datomic - Get all datoms relevant to an arbitrary query - datomic

Given an arbitrary datomic query q on database d, is it possible to derive a query x from q that when run against d would return all relevant datums r required to produce the result of q on d? The results of q on d should equal the results of q on r.
I'm trying to sync datomic with a datascript client. I know all the queries in advance, and Id like to create a subset of my datomic database in datascript restricted to datums relevant for the clientside queries. For simplicity assume no parameterized queries, although Id expect with placeholders one might achieve the same effect for arbitrary query input parameters, and a solution that allowed for them would be preferred.
I know I can get all the entity ids returned by a query by modifying it and running it against the db, and then touching all those entities, but I'm hoping for something more efficient that only returns the subset of entity datums related to a query, and that can be derived from the query q ALONE without having to run q on d first.
Thanks.

I don’t think so — queries don’t work on datoms directly. They start with datoms, if you query the database, but are converted to sets and all subsequent operations are done on sets. This lets queries to run on arbitrary collections the same way they run on Datomic indexes.
I believe https://github.com/mpdairy/posh tried to do query analysis to figure out which datoms it touches. Maybe give it a look?

Related

GAE NDB Sorting a multiquery with cursors

In my GAE app I'm doing a query which has to be ordered by date. The query has to containt an IN filter, but this is resulting in the following error:
BadArgumentError: _MultiQuery with cursors requires __key__ order
Now I've read through other SO question (like this one), which suggest to change to sorting by key (as the error also points out). The problem is however that the query then becomes useless for its purpose. It needs to be sorted by date. What would be suggested ways to achieve this?
The Cloud Datastore server doesn't support IN. The NDB client library effectively fakes this functionality by splitting a query with IN into multiple single queries with equality operators. It then merges the results on the client side.
Since the same entity could be returned in 1 or more of these single queries, merging these values becomes computationally silly*, unless you are ordering by the Key**.
Related, you should read into underlying caveats/limitations on cursors to get a better understanding:
Because the NOT_EQUAL and IN operators are implemented with multiple queries, queries that use them do not support cursors, nor do composite queries constructed with the CompositeFilterOperator.or method.
Cursors don't always work as expected with a query that uses an inequality filter or a sort order on a property with multiple values. The de-duplication logic for such multiple-valued properties does not persist between retrievals, possibly causing the same result to be returned more than once.
If the list of values used in IN is a static list rather than determined at runtime, a work around is to compute this as an indexed Boolean field when you write the Entity. This allows you to use a single equality filter. For example, if you have a bug tracker and you want to see a list of open issues, you might use a IN('new', 'open', 'assigned') restriction on your query. Alternatively, you could set a property called is_open to True instead, so you no longer need the IN condition.
* Computationally silly: Requires doing a linear scan over an unbounded number of preceding values to determine if the current retrieved Entity is a duplicate or not. Also known as conceptually not compatible with Cursors.
** Key works because we can alternate between different single queries retrieving the next set of values and not have to worry about doing a linear scan over the entire proceeding result set. This gives us a bounded data set to work with.

Optimize SOLR for retrieving all search results

Sometimes I don't need just the top X results from a SOLR query, but all results (running into millions). This is easily achievable by searching once with 0 rows as a request parameter, and then re-execute the search with the numFound from the result as number of rows(*)
Of course we can sort the results by e.g. "id asc" to remove relevancy ranking, however, I would like to be able to disable the entire scoring calculation for these queries, as they probably are quite computational intensive and we just don't need them in these cases.
My question:
Is there a way to make SOLR work in boolean mode and effectively run faster on these often slow queries, when all we need is just all results?
(*) I actually usually simply do a paged query where a script walks through the pages (multi threaded), to prevent timeouts on large result sets, yet keep it fast as possible, but this is not important for the question.
This looks like a related question, but apparently the user asked the wrong question and was only after retrieving all results: Solr remove ranking or modify ranking feature; This question is not answered there.
Use filters instead of queries; there is no score calculation for filters.
There is a couple of things to be aware of
Solr deep paging allows you to export large number of results much quicker
Using an export format such as CSV could be faster than using an XML format just due to the formatting and it being more compact
And, as already mentioned, if you are exporting all, put your queries into FilterQuery with caching off
For very complex queries, if you can split it into several steps, you can actually assign different weights to the filters and have them execute in sequence. This allows to use cheap first filter that gets rid of most of the results and only then apply more expensive, more precise, filters

Does eventual consistency apply to the set of results of a query? Or the entities themselves that are returned?

I'm using the HRD on Appengine.
Say I have a query that cuts across entity groups (i.e. not an ancestor query). I understand that the set of results returned by this query may not be consistent:
For example, the query may return 4 entities {A, B, C, D} even though and 5th entity E, matches the query. This makes sense.
However, in the inconsistent query above, is it ALSO the case that any of the results in the set may themselves not be consisitent (i.e. their fields are not the freshest)? That is, if A has a property called foo, is foo consistent?
My question boils down to, which part of the query is inconsistent - the set of results, the properties of the returned results, or both?
Eventual consistency applies to both the entities themselves and the indexes. This means that if you modify an entity, then query with a filter that matches only the modified one (not the value before modification), you could get no records. It also means that potentially you could get entities back from a query whose current versions do not match the index criteria they were fetched for.
You can ensure you have the latest copy of an entity by doing a consistent get (though outside a transaction, this is fairly meaningless, since it could have changed the moment you do the get), but there's no equivalent way to do a consistent index lookup.
I think the answer is that inconsistency can occur in both the set of results and properties of the returned results. Because incosistency occurs when you query a replica (or data center as in Google docs) that doesn't know yet about some write you made before. And the write can be anything, creating new entity or updating existing one.
So if you have for example the entity A with property x and you:
update x on A to 50 (previously it was 40)
query for entities with x >= 30
Then you certainly get this entity in the resut set but it can has an old value of x (40), in case that the replica you queried didn't yet know about your update.

How long will a query with two "contains" tests take to execute on the appengine datastore?

I have two sets of thirty or forty IDs, set A and set B. I have a kind of entity that has a field idA (an id that might be in set A) and a field idB (an id that might be in set B). I want to find all of the entities with idA in set A and idB in set B.
I could perform a query with filters like "A.contains(idA) && B.contains(idB)," but I worry about how much time this would take. With 30 ids in A, a naive implementation might take 30 comparisons per non-matching entity in the datastore. Or maybe the datastore sorts A and B before it goes looking, and will only take 4 or 5 comparisons per entity in the datastore. Or, maybe there's something that google figured out that I haven't, that could quickly skip over entities.
Basically, I'm trying to figure out what the index for such a query looks like, and if this is a terrible kind of query to run. Maybe it orders by idA, then by idB, and sorts A and B before the query is actually executed?
Main question: with 30-40 elements in A and B, will a query with filters "A.contains(idA) && B.contains(idB)" execute in a reasonable amount of time, or should I try to get this information another way?
You are limited to a list of at most thirty items. So this will not currently run on App Egnine, see the Query Filters section.
The contains() operator also performs multiple queries, one for each item in the provided list value where all other filters are the same and the contains() filter is replaced with an equal-to filter. The results are merged, in the order of the items in the list. If a query has more than 1 contains() filter, the query is performed as multiple queries, one for each combination of values in the contains() filters.
A single query containing != or contains() operators is limited to 30 sub-queries.
App Engine will expand your query into 30 * 40 = 1200 queries for individual combinations of idA and idB - or at least, it would, if it weren't limited to 30 sub-queries. Obviously, this isn't going to be very efficient.
Alternatives depend on the structure of your datastore. If you tell us what you're trying to achieve, we may be able to suggest alternatives that don't require so many queries.

Querying NHibernate

We are using NHibernate as our ORM for the project and we have only database read only feature. The application will not be updating,deleting or inserting any records into the database it will be just querying the database for records.
My question is which is the best method to query the database with NHibernate in the scenario explained above.
Are you sure you really need an ORM?
Anyway, there are 3 common options to query database using NHibernate:
HQL.
Criteria API.
Linq.
The easiest is 3, the most powerful is 1.
But I don't really understand the nature of your question as the query APIs in NHiebrnate are not muturally exclusive, but rather they add up each other.
So you can use any of them depending on the situation:
For dynamic queries - best is Criteria API.
For complex and never changing - HQL.
For quick and easy - Linq.
Since it is read only, you probably won't have much use for retrieving the query results as mapped objects. A result set type return value might be more useful. For that use session.createQuery and then query.list
Each element of the list will be a object array. Each array element correponds to one select column.

Resources