datomic query converter from SQL - datomic

I want to represent following SQL query in datomic
SELECT A.a C.c
FROM A, B, C
WHERE A.id = B.id and B.index = C.index
What willl be datomic query representation for this?
Also, will the same datomic query work if WHERE conditions are reversed i.e. we have "B.index = C.index and A.id = B.id"?

Here is a literal translation:
(d/q '[:find ?a-val ?c-val
:where
[?A :a.id ?id]
[?B :b.id ?id]
[?B :b.index ?index]
[?C :c.index ?index]
[?A :a ?a-val]
[?C :c ?c-val]]
db)
I've written that so you should be able to see how to translate from the question to the answer.
Hopefully it's clear that using ?index in two places that is doing the same job as = in the SQL where clause.
The clauses can be reordered anyhow and remain logically equivalent. The difference will be in performance: put the most narrowing/specific clauses first for better performance.
Some caveats:
Whether you want to have distinct attributes for the index and id concepts (for example a.index and b.index or only index) is down to your requirements. Datomic has good docs and tutorials.
There are many differences between SQL and Datalog, for example no tables in Datalog (meaning variable names ?A,?B,?C are arbitrary), but again this is all well documented.

Related

Does SQL server optimize what data needs to be read?

I've been using BigQuery / Spark for a few years, but I'm not very familiar with SQL server.
If I have a query like
with x AS (
SELECT * FROM bigTable1
),
y AS (SELECT * FROM bigTable2)
SELECT COUNT(1) FROM x
Will SQL Server be "clever" enough to ignore the pointless data fetching?
Note: Due to the configuration of my environment I don't have access to the query planner for troubleshooting.
Like most of the leading professional DBMS's SQL Server has a statistical optimizer that will indeed eliminate datasources that are never used and cannot affect the results.
Note, however, that this does not apply to certain kinds of errors, so if your bigTable1 or bigTable2 do not exist (or you cannot access them), the query will throw a compile error, even though it would never actually use those tables.
SQL Server has certainly and actually the most advanced optimizer of all professionnal RDBMS (IBM DB2, Oracle...).
Before optimizing the algebriser transform the query, that is just a "demand" (not an execution code) into a mathematical formulae known has "algebraic tree". This mathematical object is formulae of relational algebrae that supports the Relational DBMS (a mathematical theory developped by Franck Edgar Codd in the early 70's).
The very first step of optimisation is done at this step by simplifying the mathematical formulae like things you've done with polynomial expression including x and y ( as an example : 2x - 3y = 3x² - 5x + 7 <=> y = (3x² - 3x + 7 ) / 3).
Query example (from Chris Date "A Cure for Madness") :
With the table :
CREATE TABLE T_MAD (TYP CHAR(4), VAL VARCHAR(16));
INSERT INTO T_MAD VALUES
('ALFA', 'ted'),('ALFA', 'chris'),('ALFA', 'michael'),
('NUM', '123'),('NUM', '4567'),('NUM', '89');
This query will fail :
SELECT * FROM T_MAD
WHERE VAL > 1000;
Because VAL is a string datatype uncompatible with a number value when comparing in WHERE.
But our table distinguish ALFA values from NUM values. By adding a restriction on the value with the TYP column like this :
SELECT * FROM T_MAD
WHERE TYP = 'NUM' AND VAL > 1000;
My query give the right result...
But all those queries too :
SELECT * FROM T_MAD
WHERE VAL > 1000 AND TYP = 'NUM';
SELECT * FROM
(
SELECT * FROM T_MAD WHERE TYP = 'NUM'
) AS T
WHERE VAL > 1000;
SELECT * FROM
(
SELECT * FROM T_MAD WHERE VAL > 1000
) AS T
WHERE TYP = 'NUM';
The last one is very important because the subquery is the first one that fails...
So what happen to this subquery that suddenly don't fail ?
In fact the algebriser rewrites all those query to a simplier form that conducts to a similar (not to say identical) formulae...
Just have a look on the query execution plans, which appear to be strictly equals !
NOTE that non professional DBMS like MySQL, MariaDB or PostGreSQL will fail for the last one.... An optimizer use a very huge set of IT developpers and researchers that open/free cannot mimic !
Second the optimizer have heuristic rules, that applies essentially at the semantic level. The execution plan is simplified when some contradictory conditions appears in the query text...
Just have a look over those two queries :
SELECT * FROM T_MAD WHERE 1 = 2;
SELECT * FROM T_MAD WHERE 1 = 1;
The first one will have no row returned while the second will have all the rows of the table returned... What does the optimizer ? The query execution plan give the answer :
The terms "Analyse de constante" in the query execution plan, means that the optimizer will not access to the table... That will be similar to what you will have with your last subquery ...
Note that every constrainst (PK, FK, UNIQUE, CHECK) can help the optimizer to simplify the query execution plan to optimize performances !
Third the optimizer will use statistics that are histogram computed on data distribution to predicts how many rows will be manipulated in every step of the query execution plan...
There is much more things to say about SQL Server query optimizer, like the fct that it works reversely from all the other optimizer, and with this technics it can predict the missing indexes since 18 years that all other RDBMS cannot !
PS : sorry to use the french version of SSMS... I am working in France and helps professionnals to optimize there databases !

Oracle spatial SDO_RELATE: why better performance result from UNION ALL/INTERSECT combined individual specified mask

Initially I was trying to find out why it's so slow to do a spatial query with multiple SDO_REALTE in a single SELECT statement like this one:
SELECT * FROM geom_table a
WHERE SDO_RELATE(a.geom_column, SDO_GEOMETRY(...), 'mask=inside')='TRUE' AND
SDO_RELATE(a.geom_column, SDO_GEOMETRY(...), 'mask=anyinteract')='TRUE';
Note the two SDO_GEOMETRY may not be necessary the same. So it's a bit different from SDO_GEOMETRY(a.geom_column, the_same_geometry, 'mask=inside+anyinteract')='TRUE'
Then I found this paragraph from oracle documentation for SDO_RELATE:
Although multiple masks can be combined using the logical Boolean
operator OR, for example, 'mask=touch+coveredby', better performance
may result if the spatial query specifies each mask individually and
uses the UNION ALL syntax to combine the results. This is due to
internal optimizations that Spatial can apply under certain conditions
when masks are specified singly rather than grouped within the same
SDO_RELATE operator call. (There are two exceptions, inside+coveredby
and contains+covers, where the combination performs better than the
UNION ALL alternative.) For example, consider the following query using the logical
Boolean operator OR to group multiple masks:
SELECT a.gid FROM polygons a, query_polys B WHERE B.gid = 1 AND
SDO_RELATE(A.Geometry, B.Geometry,
'mask=touch+coveredby') = 'TRUE';
The preceding query may result in better performance if it is
expressed as follows, using UNION ALL to combine results of multiple
SDO_RELATE operator calls, each with a single mask:
SELECT a.gid
FROM polygons a, query_polys B
WHERE B.gid = 1
AND SDO_RELATE(A.Geometry, B.Geometry,
'mask=touch') = 'TRUE' UNION ALL SELECT a.gid
FROM polygons a, query_polys B
WHERE B.gid = 1
AND SDO_RELATE(A.Geometry, B.Geometry,
'mask=coveredby') = 'TRUE';
It somehow gives the answer for my question, but still it only says: "due to internal optimizations that Spatial can apply under certain conditions". So I have two questions:
What does it mean with "internal optimization", is it something to do with spatial index? (I'm not sure if I'm too demanding on this question, maybe only developers in oracle know about it.)
The oracle documentation doesn't say anything about my original problem, i.e. SDO_RELATE(..., 'mask=inside') AND SDO_RELATE(..., 'maks=anyinteract') in a single SELECT. Why does it also have very bad performance? Does it work similarly to SDO_RELATE(..., 'mask=inside+anyinteract')?

Query to list all partitions in Datomic

What is a query to list all partitions of a Datomic database?
This should return
[[:db.part/db] [:db.part/tx] [:db.part/user] .... ]
where .... is all of the user defined partitions.
You should be able to get a list of all partitions in the database by searching for all entities associated with the :db.part/db entity via the :db.install/partition attribute:
(ns myns
(:require [datomic.api :as d]))
(defn get-partitions [db]
(d/q '[:find ?ident :where [:db.part/db :db.install/partition ?p]
[?p :db/ident ?ident]]
db))
Note
The current version of Datomic (build 0.8.3524) has a shortcoming such that :db.part/tx and :db.part/user (two of the three built-in partitions) are treated specially and aren't actually associated with :db.part/db via :db.install/partition, so the result of the above query function won't include the two.
This problem is going to be addressed in one of the future builds of Datomic. In the meantime, you should take care of including :db.part/tx and :db.part/user in the result set yourself.
1st method - using query
=> (q '[:find ?i :where
[:db.part/db :db.install/partition ?p] [?p :db/ident ?i]]
(db conn))
2nd method - from db object
(filter #(instance? datomic.db.Partition %) (:elements (db conn)))
The second method returns sequence of datomic.db.Partition objects which may be useful if we want to get additional info about the partition.
Both methods have known bug/inconsistency: they don't return :db.part/tx and :db.part/user built-in partitions.

Rails 3, ActiveRecord, PostgreSQL - ".uniq" command doesn't work?

I have following query:
Article.joins(:themes => [:users]).where(["articles.user_id != ?", current_user.id]).order("Random()").limit(15).uniq
and gives me the error
PG::Error: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
LINE 1: ...s"."user_id" WHERE (articles.user_id != 1) ORDER BY Random() L...
When I update the original query to
Article.joins(:themes => [:users]).where(["articles.user_id != ?", current_user.id]).order("Random()").limit(15)#.uniq
so the error is gone... In MySQL .uniq works, in PostgreSQL not. Exist any alternative?
As the error states for SELECT DISTINCT, ORDER BY expressions must appear in select list.
Therefore, you must explicitly select for the clause you are ordering by.
Here is an example, it is similar to your case but generalize a bit.
Article.select('articles.*, RANDOM()')
.joins(:users)
.where(:column => 'whatever')
.order('Random()')
.uniq
.limit(15)
So, explicitly include your ORDER BY clause (in this case RANDOM()) using .select(). As shown above, in order for your query to return the Article attributes, you must explicitly select them also.
I hope this helps; good luck
Just to enrich the thread with more examples, in case you have nested relations in the query, you can try with the following statement.
Person.find(params[:id]).cars.select('cars.*, lower(cars.name)').order("lower(cars.name) ASC")
In the given example, you're asking all the cars for a given person, ordered by model name (Audi, Ferrari, Porsche)
I don't think this is a better way, but may help to address this kind of situation thinking in objects and collections, instead of a relational (Database) way.
Thanks!
I assume that the .uniq method is translated to a DISTINCT clause on the SQL. PostgreSQL is picky (pickier than MySQL) -- all fields in the select list when using DISTINCT must be present in the ORDER_BY (and GROUP_BY) clauses.
It's a little unclear what you are attempting to do (a random ordering?). In addition to posting the full SQL sent, if you could explain your objective, that might be helpful in finding an alternative.
I just upgraded my 100% working and tested application from 3.1.1 to 3.2.7 and now have this same PG::Error.
I am using Cancan...
#users = User.accessible_by(current_ability).order('lname asc').uniq
Removing the .uniq solves the problem and it was not necessary anyway for this simple query.
Still looking through the change notes between 3.1.1 and 3.2.7 to see what caused this to break.

Why Is My Inline Table UDF so much slower when I use variable parameters rather than constant parameters?

I have a table-valued, inline UDF. I want to filter the results of that UDF to get one particular value. When I specify the filter using a constant parameter, everything is great and performance is almost instantaneous. When I specify the filter using a variable parameter, it takes a significantly larger chunk of time, on the order of 500x more logical reads and 20x greater duration.
The execution plan shows that in the variable parameter case the filter is not applied until very late in the process, causing multiple index scans rather than the seeks that are performed in the constant case.
I guess my questions are: Why, since I'm specifying a single filter parameter that is going to be highly selective against an indexed field, does my performance go into the weeds when that parameter is in a variable? Is there anything I can do about this?
Does it have something to do with the analytic function in the query?
Here are my queries:
CREATE FUNCTION fn_test()
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
SELECT DISTINCT GCN_SEQNO, Drug_package_version_ID
FROM
(
SELECT COALESCE(ndctbla.GCN_SEQNO, ndctblb.GCN_SEQNO) AS GCN_SEQNO,
dpv.Drug_package_version_ID, ROW_NUMBER() OVER (PARTITION BY dpv.Drug_package_version_id ORDER BY
ndctbla.GCN_SEQNO DESC) AS Predicate
FROM dbo.Drug_Package_Version dpv
LEFT JOIN dbo.NDC ndctbla ON ndctbla.NDC = dpv.Sp_package_code
LEFT JOIN dbo.NDC ndctblb ON ndctblb.SPC_NDC = dpv.Sp_package_code
) iq
WHERE Predicate = 1
GO
GRANT SELECT ON fn_test TO public
GO
-- very fast
SELECT GCN_SEQNO
FROM dbo.fn_test()
WHERE Drug_package_version_id = 10000
GO
-- comparatively slow
DECLARE #dpvid int
SET #dpvid = 10000
SELECT GCN_SEQNO
FROM dbo.fn_test()
WHERE Drug_package_version_id = #dpvid
Once you create a new projection through a UDF, it can't be expected that your indexes will still apply on the columns that are indexed on the original table and included in the projection. When you filter on the projection (and not in the UDF against the original table with the indexes) the indexes no longer apply.
What you want to do is parameterize the function to take in the parameter.
If you find that you have too many fields that you want to set parameters on, then you might want to take a look at indexed views, as you can create your projection and index it as well and then run queries against that.
Simply, the constant is easy to evaluate in the plan. The local variable is not. Especially with the ranking function and filter Predicate = 1
Paraphrasing casparOne, you need to push the filter as far inwards as possible so that you filter on dpv.Drug_package_version_id inside the iq derived table.
If you do that, then you also have no need for the PARTITION BY because you have only a single dpv.Drug_package_version_id. Then you can do a cleaner ...TOP 1 ... ORDER BY ndctbla.GCN_SEQNO DESC.
The responses I got were good, and I learned from them, but I think I've found an answer that satisfies me.
I do think it's the use of the PARTITION BY clause that is causing the problem here. I reformulated the UDF using a variant of the self-join idiom:
SELECT t1.A, t1.B, t1.C
FROM T t1
INNER JOIN
(
SELECT A, MAX(C) AS C
FROM T
GROUP BY A
) t2 ON t1.A = t2.A AND t1.C = t2.C
Ironically, this is more performant than using the SQL 2008-specific query, and also the optimizer doesn't have a problem with joining this version of the query using variables rather than constants. At this point, I'm concluding that the optimizer just doesn't handle the more recent SQL extensions as well as the older stuff. As a bonus, I can make use of the UDF now, in my pre-upgraded SQL 2000 platforms.
Thanks for your help, everyone!

Resources