Database: Intra-row calculations / calculated rows - database

I plan to design a database model for a Business Intelligence system that stores business figures for a set of locations and a set of years.
Some of these figures should be calculated from other figures for the same year and the same location. In the following text I'll call figures that are not being calculated "basic figures". To store the basic figures, a table design with these columns would make sense:
| year | location_id | goods_costs | marketing_costs | warehouse_costs | administrative_costs |
Using this table I could create a view that calculates all other necessary figures:
CREATE VIEW all_figures
SELECT *,
goods_costs + marketing_costs + warehouse_costs + administrative_costs
AS total_costs
FROM basic_figures
This would be great if I didn't run into the following problems:
Most databases (including MySQL which I'm planning to use [edit: but which I'm not bound to]) have some kind of colum count or row size limit. Since I have to store a lot of figures (and have to calculate even more), I'd exceed this limit.
It is not uncommon that new figures have to be added. (Adding a figure would require changes to the table design. And as such changes ususally perform poorly they would block any access to the table for quite a long time.)
I also have to store additional information for each figure, e.g. a description and a unit (all figures are decimal numbers, but some might be in US$/EUR whereas others might be in %). I'd have to make sure that the basic_figures table, the all_figures view and the table containing the figure information are all correctly updated if anything changes. (This is more a data normalization problem than a technical/implementation problem.)
~~
Therefore I considered to use this table design:
+---------+-------------+-------------+-------+
| year | location_id | figure_id | value |
+---------+-------------+-------------+-------+
| 2009 | 1 | goods_costs | 300 |
...
This entity-attribute-value-like design could be a first solution for these three issues. However, it would also have a new downside: Calculations get messy. Really messy.
To build a view similar to the one above, I'd have to use a query like this:
(SELECT * FROM basic_figures_eav)
UNION ALL
(SELECT a.year_id, a.location_id, "total_costs", a.value + b.value + c.value + d.value
FROM basic_figures_eav a
INNER JOIN basic_figures_eav b ON a.year_id = b.year_id AND a.location_id = b.location_id AND b.figure_id = "marketing_costs"
INNER JOIN basic_figures_eav c ON a.year_id = c.year_id AND a.location_id = c.location_id AND c.figure_id = "warehouse_costs"
INNER JOIN basic_figures_eav d ON a.year_id = d.year_id AND a.location_id = d.location_id AND d.figure_id = "administrative_costs"
WHERE a.figure_id = "goods_costs");
Isn't that a beauty? And notice that this is just the query for ONE figure. All other calculated figures (of whom there are many as I wrote above) would also have to UNIONed with this query.
~~
After this long explanation of my problems, I now conculde with my actual questions:
Which database design would you suggest? / Would you use one of the two designs above? (If yes, which and why? If no, why?)
Do you have a suggestion for a completely other approach? (Which I would very, very much appreciate!)
Should the database actually be the one that does the calculations after all? Does it make more sense to move the calculation to the application logic and simply store the results?
By the way: I already asked a similar question on the MySQL forums. However, since answers were a bit sparse and this is not just a MySQL issue after all, I completely rewrote my quesion and posted it here. (So this is not a cross-post.) Here's the link to the thread there: http://forums.mysql.com/read.php?125,560752,560752#msg-560752

The question is (at least somewhat) DBMS specific.
If you can consider other DBMS, you might want to look at PostgreSQL and it's hstore datatype which is essentially a key/value pair.
The downsize of that is, that you lose datatype checking with as everything is stored as a string in the map.
The design that you are aiming at is called "Entity Attribute Value". You might want to find other alternatives as well.
Edit, here is an example on how this could be used:
Table setup
CREATE TABLE basic_figures
(
year_id integer,
location_id integer,
figures hstore
);
insert into basic_figures (year_id, location_id, figures)
values
(1, 1, hstore ('marketing_costs => 200, goods_costs => 100, warehouse_costs => 400')),
(1, 2, hstore ('marketing_costs => 50, goods_costs => 75, warehouse_costs => 250')),
(1, 3, hstore ('adminstrative_costs => 100'));
Basic select
select year_id,
location_id,
to_number(figures -> 'marketing_costs', 'FM999999') as marketing_costs,
to_number(figures -> 'goods_costs', 'FM999999') as goods_costs,
to_number(figures -> 'warehouse_costs', 'FM999999') as warehouse_costs,
to_number(figures -> 'adminstrative_costs', 'FM999999') as adminstrative_costs
from basic_figures bf;
It's probably easier to create a view for that that hides the conversion of the hstore values. The downside of that is, that the view needs to be re-created each time a new cost type is added.
Getting the totals
To get the sum of all costs for each year_id/location_id you can use the following statement:
SELECT year_id,
location_id,
sum(to_number(value, '99999')) as total
FROM (
SELECT year_id,
location_id,
(each(figures)).key,
(each(figures)).value
FROM basic_figures
) AS data
GROUP BY year_id, location_id;
year_id | location_id | total
---------+-------------+-------
1 | 3 | 100
1 | 2 | 375
1 | 1 | 700
That could be joined to the query above, but it's probably faster and easier to use if you create a function that calculates the total for all keys in a single hstore column:
Function to sum the totals
create or replace function sum_hstore(figures hstore)
returns bigint
as
$body$
declare
result bigint;
figure_values text[];
begin
result := 0;
figure_values := avals(figures);
for i in 1..array_length(figure_values, 1) loop
result := result + to_number(figure_values[i], '999999');
end loop;
return result;
end;
$body$
language plpgsql;
That function can easily be used in the first select:
select bf.year_id,
bf.location_id,
to_number(bf.figures -> 'marketing_costs', '99999999') as marketing_costs,
to_number(bf.figures -> 'goods_costs', '99999999') as goods_costs,
to_number(bf.figures -> 'warehouse_costs', '99999999') as warehouse_costs,
to_number(bf.figures -> 'adminstrative_costs', '99999999') as adminstrative_costs,
sum_hstore(bf.figures) as total
from basic_figures bf;
Automatic view creation
The following PL/pgSQL block can be used to (re-)create a view that contains one column for each key in the figures column plus the totals based on the sum_hstore function above:
do
$body$
declare
create_sql text;
types record;
begin
create_sql := 'create or replace view extended_figures as select year_id, location_id ';
for types in SELECT distinct (each(figures)).key as type_name FROM basic_figures loop
create_sql := create_sql || ', to_number(figures -> '''||types.type_name||''', ''9999999'') as '||types.type_name;
end loop;
create_sql := create_sql ||', sum_hstore(figures) as total from basic_figures';
execute create_sql;
end;
$body$
language plpgsql;
After running that function you can simply do a:
select *
from extended_figures
and you'll get as many columns as there are different cost types.
Note that there is no error checking at all if the values in the hstore are actually numbers. That could potentially be done with a trigger.

This is a way to "denormalise" (pivot) an EAV table without needing pivot. Note the left JOIN and the coalesce, which causes non-existant rows to a appear as "zero cost".
NOTE: I had to replace the quoting of the string literals to single quotes.
CREATE TABLE basic_figures_eav
( year_id INTEGER
, location_id INTEGER
, figure_id varchar
, value INTEGER
);
INSERT INTO basic_figures_eav ( year_id , location_id , figure_id , value ) VALUES
(1,1,'goods_costs', 100)
, (1,1,'marketing_costs', 200)
, (1,1,'warehouse_costs', 400)
, (1,1,'administrative_costs', 800)
, (1,2,'goods_costs', 100)
, (1,2,'marketing_costs', 200)
, (1,2,'warehouse_costs', 400)
, (1,3,'administrative_costs', 800)
;
SELECT x.year_id, x.location_id
, COALESCE (a.value,0) AS goods_costs
, COALESCE (b.value,0) AS marketing_costs
, COALESCE (c.value,0) AS warehouse_costs
, COALESCE (d.value,0) AS administrative_costs
--
, COALESCE (a.value,0)
+ COALESCE (b.value,0)
+ COALESCE (c.value,0)
+ COALESCE (d.value,0)
AS total_costs
-- need this to get all the {year_id,location_id} combinations
-- that have at least one tuple in the EAV table
FROM (
SELECT DISTINCT year_id, location_id
FROM basic_figures_eav
-- WHERE <selection of wanted observations>
) AS x
LEFT JOIN basic_figures_eav a ON a.year_id = x.year_id AND a.location_id = x.location_id AND a.figure_id = 'goods_costs'
LEFT JOIN basic_figures_eav b ON b.year_id = x.year_id AND b.location_id = x.location_id AND b.figure_id = 'marketing_costs'
LEFT JOIN basic_figures_eav c ON c.year_id = x.year_id AND c.location_id = x.location_id AND c.figure_id = 'warehouse_costs'
LEFT JOIN basic_figures_eav d ON d.year_id = x.year_id AND d.location_id = x.location_id AND d.figure_id = 'administrative_costs'
;
Result:
CREATE TABLE
INSERT 0 8
year_id | location_id | goods_costs | marketing_costs | warehouse_costs | administrative_costs | total_costs
---------+-------------+-------------+-----------------+-----------------+----------------------+-------------
1 | 3 | 0 | 0 | 0 | 800 | 800
1 | 2 | 100 | 200 | 400 | 0 | 700
1 | 1 | 100 | 200 | 400 | 800 | 1500
(3 rows)

I just want to point out that the second half of your query is needlessly complicated. You can do:
(SELECT a.year_id, a.location_id, "total_costs",
sum(a.value)
FROM basic_figures_eav a
where a.figure_id in ('marketing_costs', 'warehouse_costs', 'administrative_costs',
'goods_costs')
)
Although this uses an aggregation, with a composite index on year_id, location_id, and figure_id, the performance should be similar.
As for the rest of your question, there is a problem with databases limiting the number of columns. I would suggest that you put your base data in a table, with an auto-incremented primary key. Then, create summary tables, linked by the same primary key.
In many environments, you can recreate the summary tables once per day or once per night. If you need real time information, you can use stored procedures/triggers to update the data. That is, when data is updated or inserted, then it can be modified in the summary tables.
Also, I tried to find out if calculated/computed columns in SQL Server count against the maximum number of columns in the table (1,024). I wasn't able to find anything definitive. This is easy enough to test, but I'm not near a database right now.

Related

SQL Server: Performance issue: OR statement substitute in WHERE clause

I want to select only the records from table Stock based on the column PostingDate.
The PostingDate should be after the InitDate in another table called InitClient. However, there are currently 2 clients in both tables (client 1 and client 2), that both have a different InitDate.
With the code below I get exactly what I need currently, based on the sample data also included underneath. However, two problems arise, first of all based on millions of records the query is taking way too long (hours). And second of all, it isn't dynamic at all, every time when a new client is included.
A potential option to cover the performance issue would be to write two separate query's, one for Client 1 and one for Client 2 with a UNION in between. Unfortunately, this then isn't dynamic enough since multiple clients are possible.
SELECT
Material
,Stock
,Stock.PostingDate
,Stock.Client
FROM Stock
LEFT JOIN (SELECT InitDate FROM InitClient where Client = 1) C1 ON 1=1
LEFT JOIN (SELECT InitDate FROM InitClient where Client = 2) C2 ON 1=1
WHERE
(
(Stock.Client = 1 AND Stock.PostingDate > C1.InitDate) OR
(Stock.Client = 2 AND Stock.PostingDate > C2.InitDate)
)
Sample dataset:
CREATE TABLE InitClient
(
Client varchar(300),
InitDate date
);
INSERT INTO InitClient (Client,InitDate)
VALUES
('1', '5/1/2021'),
('2', '1/31/2021');
SELECT * FROM InitClient
CREATE TABLE Stock
(
Material varchar(300),
PostingDate varchar(300),
Stock varchar(300),
Client varchar(300)
);
INSERT INTO Stock (Material,PostingDate,Stock,Client)
VALUES
('322', '1/1/2021', '5', '1'),
('101', '2/1/2021', '5', '2'),
('322', '3/2/2021', '10', '1'),
('101', '4/13/2021', '5', '1'),
('400', '5/11/2021', '170', '2'),
('401', '6/20/2021', '200', '1'),
('322', '7/20/2021', '160', '2'),
('400', '8/9/2021', '93', '2');
SELECT * FROM Stock
Desired result, but then with a substitute for the OR statement to ramp up the performance:
| Material | PostingDate | Stock | Client |
|----------|-------------|-------|--------|
| 322 | 1/1/2021 | 5 | 1 |
| 101 | 2/1/2021 | 5 | 2 |
| 322 | 3/2/2021 | 10 | 1 |
| 101 | 4/13/2021 | 5 | 1 |
| 400 | 5/11/2021 | 170 | 2 |
| 401 | 6/20/2021 | 200 | 1 |
| 322 | 7/20/2021 | 160 | 2 |
| 400 | 8/9/2021 | 93 | 2 |
Any suggestions if there is an substitute possible in the above code to keep performance, while making it dynamic?
You can optimize this query quite a bit.
Firstly, those two LEFT JOINs are basically just semi-joins, because you don't actually return any results from them. So we can turn them into a single EXISTS.
You will also get an implicit conversion to int, because Client is varchar and 1,2 is an int. So change that to '1','2', or you could change the column type.
PostingDate is also varchar, that should really be date
SELECT
s.Material
,s.Stock
,s.PostingDate
,s.Client
FROM Stock s
WHERE s.Client IN ('1','2')
AND EXISTS (SELECT 1
FROM InitClient c
WHERE s.PostingDate > c.InitDate
AND c.Client = s.Client
);
Next you want to look at indexing. For this query (not accounting for any other queries being run), you probably want the following indexes (remove the INCLUDE for a clustered index)
InitClient (Client, InitDate)
Stock (Client) INCLUDE (PostingDate, Material, Stock)
It is possible that even with these indexes that you may get a scan on Stock, because IN functions like an OR. This does not always happen, it's worth checking. If so, instead you can rewrite this to use UNION ALL
SELECT
s.Material
,s.Stock
,s.PostingDate
,s.Client
FROM (
SELECT *
FROM Stock s
WHERE s.Client = '1'
UNION ALL
SELECT *
FROM Stock s
WHERE s.Client = '2'
) s
WHERE EXISTS (SELECT 1
FROM InitClient c
WHERE s.PostingDate > c.InitDate
AND c.Client = s.Client
);
db<>fiddle
There is nothing wrong in expecting your query to be dynamic. However, in order to make it more performant, you may need to reach a compromise between to conflicting expectations. I will present here a few ways to optimize your query, some of them involves some drastic changes, but eventually it is you or your client who decides how this needs to be improved. Also, some of the improvements might be ineffective, so do not take anything for granted, test everything. Without further ado, let's see the suggestions
The query
First I would try to change the query a little, maybe something like this could help you
SELECT
Material
,Stock
,Stock.PostingDate
,C1.InitDate
,C2.InitDate
,Stock.Client
FROM Stock
LEFT JOIN InitClient C1 ON Client = 1
LEFT JOIN InitClient C2 ON Client = 2
WHERE
(
(Stock.Client = 1 AND Stock.PostingDate > C1.InitDate) OR
(Stock.Client = 2 AND Stock.PostingDate > C2.InitDate)
)
Sometimes a simple step of getting rid of subselects does the trick
The indexes
You may want to speed up your process by creating indexes, for example on Stock.PostingDate.
Helper table
You can create a helper table where you store the Stock records' relevant data, so you perform the slow query ONCE in a while, maybe once in a week, or each time a new client enters the stage and store the results in the helper table. Once the prerequisite calculation is done, you will be able to query only the helper table with its few records, reaching lightning fast behavior. So, the idea is to execute the slow query rarely, cache/store the results and reuse them instead of calculating it every time.
A new column
You could create a column in your Stock table named InitDate and fill that with data for each record periodically. It will take a long while at the first execution, but then you will be able to query only the Stock table without joins and subselects.

T-SQL Verify each character of each value in a column against values of another table

I have a table in a database with a column which has values like XX-xx-cccc-ff-gg. Let's assume this is table ABC and column is called ABC_FORMAT_STR. In another table, ABC_FORMAT_ELEMENTS I have a column called CHARS with values like, A, B, C, D... X, Y, Z, a, d, f, g, x, y, z etc. (please don't assume I have all ASCII values there, it's mainly some letters and numbers plus some special characters like *, ;, -, & etc.).
I need to add a constraint in [ABC].[ABC_FORMAT_STR] column, in such a way so, each and every character of every value of that column, should exist in [ABC_FORMAT_ELEMENTS].[CHARS]
Is the possible? Can someone help me with this?
Thank you very much in advance.
This is an example with simple names, keeping the names of the object above for clarity:
Example
SELECT [ABC_FORMAT_STR] FROM [ABC]
Nick
George
Adam
SELECT [CHARS] FROM [ABC_FORMAT_ELEMENTS]
A
G
N
a
c
e
g
i
k
o
r
After the coonstraint:
SELECT [ABC_FORMAT_STR] FROM [ABC]
Nick
George
Note on the result:
"Adam" cannot be included because "d" and "m" character are not in [ABC_FORMAT_ELEMENTS] table.
Here is a simple and most natural solution based on the TRANSLATE() function.
It will work starting from SQL Server 2017 onwards.
SQL
-- DDL and sample data population, start
DECLARE #ABC TABLE (ABC_FORMAT_STR VARCHAR(50));
INSERT INTO #ABC VALUES
('Nick'),
('George'),
('Adam');
DECLARE #ABC_FORMAT_ELEMENTS TABLE (CHARS CHAR(1));
INSERT INTO #ABC_FORMAT_ELEMENTS VALUES
('A'), ('G'), ('N'),('a'), ('c'), ('e'), ('g'),
('i'), ('k'), ('o'), ('r');
-- DDL and sample data population, end
SELECT a.*
, t1.legitChars
, t2.badChars
FROM #ABC AS a
CROSS APPLY (SELECT STRING_AGG(CHARS, '') FROM #ABC_FORMAT_ELEMENTS) AS t1(legitChars)
CROSS APPLY (SELECT TRANSLATE(a.ABC_FORMAT_STR, t1.legitChars, SPACE(LEN(t1.legitChars)))) AS t2(badChars)
WHERE TRIM(t2.badChars) = '';
Output
+----------------+-------------+----------+
| ABC_FORMAT_STR | legitChars | badChars |
+----------------+-------------+----------+
| Nick | AGNacegikor | |
| George | AGNacegikor | |
+----------------+-------------+----------+
Output with WHERE clause commented out
Just to see why the row with the 'Adam' value was filtered out.
+----------------+-------------+----------+
| ABC_FORMAT_STR | legitChars | badChars |
+----------------+-------------+----------+
| Nick | AGNacegikor | |
| George | AGNacegikor | |
| Adam | AGNacegikor | d m |
+----------------+-------------+----------+
Based on your sample data, here's one method to identify valid/invalid rows in ABC. You could easily adapt this to be part of a trigger that can check inserted or updated rows in inserted and rollback if any rows violate the criteria.
This uses a tally/numbers table (very often used for splitting strings), This defines one using a CTE but a permanent solution would have a permanent numbers table to reuse.
The logic is to split the strings into rows and then count the rows that exist in the lookup table and reject any with a count of rows that is less than the length of the string.
with
numbers (n) as (select top 100 Row_Number() over (order by (select null)) from sys.messages ),
strings as (
select a.ABC_FORMAT_STR, Count(*) over(partition by a.ABC_FORMAT_STR) n
from abc a cross join numbers n
where n.n<=Len(a.ABC_FORMAT_STR)
and exists (select * from ABC_FORMAT_ELEMENTS e where e.chars=Substring(a.ABC_FORMAT_STR,n,1))
)
select ABC_FORMAT_STR
from strings
where Len(ABC_FORMAT_STR)=n
group by ABC_FORMAT_STR
/* change to where Len(ABC_FORMAT_STR) <> n to find rows that aren't allowed */
See this DB Fiddle

How to respect the order of an array in a PostgreSQL select sentence

This is my (extremely simplified) product table and some test data.
drop table if exists product cascade;
create table product (
product_id integer not null,
reference varchar,
price decimal(13,4),
primary key (product_id)
);
insert into product (product_id, reference, price) values
(1001, 'MX-232', 100.00),
(1011, 'AX-232', 20.00),
(1003, 'KKK 11', 11.00),
(1004, 'OXS SUPER', 0.35),
(1005, 'ROR-MOT', 200.00),
(1006, '234PPP', 30.50),
(1007, 'T555-NS', 110.25),
(1008, 'LM234-XS', 101.20),
(1009, 'MOTOR-22', 12.50),
(1010, 'MOTOR-11', 30.00),
(1002, 'XUL-XUL1', 40.00);
I real life, listing product columns is a taught task, full of joins, case-when-end clauses, etc. On the other hand, there is a large number of queries to be fulfilled, as products by brand, featured products, products by title, by tags, by range or price, etc.
I don't want to repeat and maintain the complex product column listings every time I perform a query so, my current approach is breaking query processes in two tasks:
encapsulate the query in functions of type select_products_by_xxx(), that return product_id arrays, properly selected and ordered.
encapsulate all the product column complexity in a unique function list_products() that takes a product_id array as a parameter.
execute select * from list_products(select_products_by_xxx()) to obtain the desired result for every xxx function.
For example, to select product_id in reverse order (in case this was any meaningful select for the application), a function like this would do the case.
create or replace function select_products_by_inverse ()
returns int[]
as $$
select
array_agg(product_id order by product_id desc)
from
product;
$$ language sql;
It can be tested to work as
select * from select_products_by_inverse();
select_products_by_inverse |
--------------------------------------------------------|
{1011,1010,1009,1008,1007,1006,1005,1004,1003,1002,1001}|
To encapsulate the "listing" part of the query I use this function (again, extremely simplified and without any join or case for the benefit of the example).
create or replace function list_products (
tid int[]
)
returns table (
id integer,
reference varchar,
price decimal(13,4)
)
as $$
select
product_id,
reference,
price
from
product
where
product_id = any (tid);
$$ language sql;
It works, but does not respect the order of products in the passed array.
select * from list_products(select_products_by_inverse());
id |reference|price |
----|---------|--------|
1001|MX-232 |100.0000|
1011|AX-232 | 20.0000|
1003|KKK 11 | 11.0000|
1004|OXS SUPER| 0.3500|
1005|ROR-MOT |200.0000|
1006|234PPP | 30.5000|
1007|T555-NS |110.2500|
1008|LM234-XS |101.2000|
1009|MOTOR-22 | 12.5000|
1010|MOTOR-11 | 30.0000|
1002|XUL-XUL1 | 40.0000|
So, the problem is I am passing a custom ordered array of product_id but the list_products() function does not respect the order inside the array.
Obviously, I could include an order by clause in list_products(), but remember that the ordering must be determined by the select_products_by_xxx() functions to keep the list_products() unique.
Any idea?
EDIT
#adamkg solution is simple and works: adding a universal order by clause like this:
order by array_position(tid, product_id);
However, this means to ordering products twice: first inside select_products_by_xxx() and then inside list_products().
An explain exploration renders the following result:
QUERY PLAN |
----------------------------------------------------------------------|
Sort (cost=290.64..290.67 rows=10 width=56) |
Sort Key: (array_position(select_products_by_inverse(), product_id))|
-> Seq Scan on product (cost=0.00..290.48 rows=10 width=56) |
Filter: (product_id = ANY (select_products_by_inverse())) |
Now I am wondering if there is any other better approach to reduce cost, keeping separability between functions.
I see two promising strategies:
As for the explain clause and the issue itself, it seems that an complete scan of table product is being done inside list_products(). As there may be thousands of products, a better approach would be to scan the passed array instead.
The xxx functions can be refactored to return setof int instead of int[]. However, a set cannot be passed as a function parameter.
For long arrays you typically get (much!) more efficient query plans with unnesting the array and joining to the main table. In simple cases, this even preserves the original order of the array without adding ORDER BY. Rows are processed in order. But there are no guarantees and the order may be broken with more joins or with parallel execution etc. To be sure, add WITH ORDINALITY:
CREATE OR REPLACE FUNCTION list_products (tid int[]) -- VARIADIC?
RETURNS TABLE (
id integer,
reference varchar,
price decimal(13,4)
)
LANGUAGE sql STABLE AS
$func$
SELECT product_id, p.reference, p.price
FROM unnest(tid) WITH ORDINALITY AS t(product_id, ord)
JOIN product p USING (product_id) -- LEFT JOIN ?
ORDER BY t.ord
$func$;
Fast, simple, safe. See:
PostgreSQL unnest() with element number
Join against the output of an array unnest without creating a temp table
You might want to throw in the modifier VARIADIC, so you can call the function with an array or a list of IDs (max 100 items by default). See:
Return rows matching elements of input array in plpgsql function
Call a function with composite type as argument from native query in jpa
Function to select column values by comparing against comma separated list
I would declare STABLE function volatility.
You might use LEFT JOIN instead of JOIN to make sure that all given IDs are returned - with NULL values if a row with given ID has gone missing.
db<>fiddle here
Note a subtle logic difference with duplicates in the array. While product_id is UNIQUE ...
unnest + left join returns exactly one row for every given ID - preserving duplicates in the given IDs if any.
product_id = any (tid) folds duplicates. (One of the reasons it typically results in more expensive query plans.)
If there are no dupes in the given array, there is no difference. If there can be duplicates and you want to fold them, your task is ambiguous, as it's undefined which position to keep.
You're very close, all you need to add is ORDER BY array_position(tid, product_id).
testdb=# create or replace function list_products (
tid int[]
)
returns table (
id integer,
reference varchar,
price decimal(13,4)
)
as $$
select
product_id,
reference,
price
from
product
where
product_id = any (tid)
-- add this:
order by array_position(tid, product_id);
$$ language sql;
CREATE FUNCTION
testdb=# select * from list_products(select_products_by_inverse());
id | reference | price
------+-----------+----------
1011 | AX-232 | 20.0000
1010 | MOTOR-11 | 30.0000
1009 | MOTOR-22 | 12.5000
1008 | LM234-XS | 101.2000
1007 | T555-NS | 110.2500
1006 | 234PPP | 30.5000
1005 | ROR-MOT | 200.0000
1004 | OXS SUPER | 0.3500
1003 | KKK 11 | 11.0000
1002 | XUL-XUL1 | 40.0000
1001 | MX-232 | 100.0000
(11 rows)

ON CONFLICT DO UPDATE/DO NOTHING not working on FOREIGN TABLE

ON CONFLICT DO UPDATE/DO NOTHING feature is coming in PostgreSQL 9.5.
Creating Server and FOREIGN TABLE is coming in PostgreSQL 9.2 version.
When I'm using ON CONFLICT DO UPDATE for FOREIGN table it is not working,
but when i'm running same query on normal table it is working.Query is given below.
// For normal table
INSERT INTO app
(app_id,app_name,app_date)
SELECT
p.app_id,
p.app_name,
p.app_date FROM app p
WHERE p.app_id=2422
ON CONFLICT (app_id) DO
UPDATE SET app_date = excluded.app_date ;
O/P : Query returned successfully: one row affected, 5 msec execution time.
// For foreign table concept
// foreign_app is foreign table and app is normal table
INSERT INTO foreign_app
(app_id,app_name,app_date)
SELECT
p.app_id,
p.app_name,
p.app_date FROM app p
WHERE p.app_id=2422
ON CONFLICT (app_id) DO
UPDATE SET app_date = excluded.app_date ;
O/P : ERROR: there is no unique or exclusion constraint matching the ON CONFLICT specification
Can any one explain why is this happening ?
There are no constraints on foreign tables, because PostgreSQL cannot enforce data integrity on the foreign server – that is done by constraints defined on the foreign server.
To achieve what you want to do, you'll have to stick with the “traditional” way of doing this (e.g. this code sample).
I know this is an old question, but in some cases there is a way to do it with ROW_NUMBER OVER (PARTION). In my case, my first take was to try ON CONFLICT...DO UPDATE, but that doesn't work on foreign tables (as stated above; hence my finding this question). My problem was very specific, in that I had a foreign table (f_zips) to be populated with the best zip code (postal code) information possible. I also had a local table, postcodes, with very good data and another local table, zips, with lower-quality zip code information but much more of it. For every record in postcodes, there is a corresponding record in zips but the postal codes may not match. I wanted f_zips to hold the best data.
I solved this with a union, with a value of ind = 0 as the indicator that a record came from the better data set. A value of ind = 1 indicates lesser-quality data. Then I used row_number() over a partion to get the answer (where get_valid_zip5() is a local function to return either a five-digit zip code or a null value):
insert into f_zips (recnum, postcode)
select s2.recnum, s2.zip5 from (
select s1.recnum, s1.zip5, s1.ind, row_number()
over (partition by recnum order by s1.ind) as rn from (
select recnum, get_valid_zip5(postcode) as zip5, 0 as ind
from postcodes
where get_valid_zip5(postcode) is not null
union
select recnum, get_valid_zip5(zip9) as zip5, 1 as ind
from zips
where get_valid_zip5(zip9) is not null
order by 1, 3) s1
) s2 where s2.rn = 1
;
I haven't run any performance tests, but for me this runs in cron and doesn't directly affect the users.
Verified on more than 900,000 records (SQL formatting omitted for brevity) :
/* yes, the preferred data was entered when it existed in both tables */
select t1.recnum, t1.postcode, t2.zip9 from postcodes t1 join zips t2 on t1.recnum = t2.recnum where t1.postcode is not null and t2.zip9 is not null and t2.zip9 not in ('0') and length(t1.postcode)=5 and length(t2.zip9)=5 and t1.postcode <> t2.zip9 order by 1 limit 5;
recnum | postcode | zip9
----------+----------+-------
12022783 | 98409 | 98984
12022965 | 98226 | 98225
12023113 | 98023 | 98003
select * from f_zips where recnum in (12022783, 12022965, 12023113) order by 1;
recnum | postcode
----------+----------
12022783 | 98409
12022965 | 98226
12023113 | 98023
/* yes, entries came from the less-preferred dataset when they didn't exist in the better one */
select t1.recnum, t1.postcode, t2.zip9 from postcodes t1 right join zips t2 on t1.recnum = t2.recnum where t1.postcode is null and t2.zip9 is not null and t2.zip9 not in ('0') and length(t2.zip9)= 5 order by 1 limit 3;
recnum | postcode | zip9
----------+----------+-------
12021451 | | 98370
12022341 | | 98501
12022695 | | 98597
select * from f_zips where recnum in (12021451, 12022341, 12022695) order by 1;
recnum | postcode
----------+----------
12021451 | 98370
12022341 | 98501
12022695 | 98597
/* yes, entries came from the preferred dataset when the less-preferred one had invalid values */
select t1.recnum, t1.postcode, t2.zip9 from postcodes t1 left join zips t2 on t1.recnum = t2.recnum where t1.postcode is not null and t2.zip9 is null order by 1 limit 3;
recnum | postcode | zip9
----------+----------+------
12393585 | 98118 |
12393757 | 98101 |
12393835 | 98101 |
select * from f_zips where recnum in (12393585, 12393757, 12393835) order by 1;
recnum | postcode
----------+----------
12393585 | 98118
12393757 | 98101
12393835 | 98101

Keep nulls with two IN()

I'm refactoring very old code. Currently, PHP generates a separate select for every value. Say loc contains 1,2 and data contains a,b, it generates
select val from tablename where loc_id=1 and data_id=a;
select val from tablename where loc_id=1 and data_id=b;
select val from tablename where loc_id=2 and data_id=a;
select val from tablename where loc_id=2 and data_id=b;
...etc which all return either a single value or nothing. That meant I always had n(loc_id)*n(data_id) results, including nulls, which is necessary for subsequent processing. Knowing the order, this was used to generate an HTML table. Both data_id and loc_id can in theory scale up to a couple thousands (which is obviously not great in a table, but that's another concern).
+-----------+-----------+
| data_id 1 | data_id 2 |
+----------+-----------+-----------+
| loc_id 1 | - | 999.99 |
+----------+-----------+-----------+
+ loc_id 2 | 888.88 | - |
+----------+-----------+-----------+
To speed things up, I was looking at replacing this with a single query:
select val from tablename where loc_id in (1,2) and data_id in (a,b) order by loc_id asc, data_id asc;
to get a result like (below) and iterate to build my table.
Rownum VAL
------- --------
1 null
2 999.99
3 777.77
4 null
Unfortunately that approach drops the nulls from the resultset so I end up with
Rownum VAL
------- --------
1 999.99
2 777.77
Note that it is possible that neither data_id or loc_id have any match, in which case I would still need a null, null.
So I don't know which value matches which. I ways to match with the expected loc_id/data_id combination in php if I add loc_id and data_id... but that's getting messy.
Still a novice in SQL in general and that's absolutely the first time I work on PostgreSQL so hopefully that's not too obvious... As I post this I'm looking at two ways to solve this: any in array[] and joins. Will update if anything new is found.
tl;dr question
How do I do a where loc_id in (1,2) and data_id in (a,b) and keep the nulls so that I always get n(loc)*n(data) results?
You can achieve that in a single query with two steps:
Generate a matrix of all desired rows in the output.
LEFT [OUTER] JOIN to actual rows.
You get at least one row for every cell in your table.
If (loc_id, data_id) is unique, you get exactly one row.
SELECT t.val
FROM (VALUES (1), (2)) AS l(loc_id)
CROSS JOIN (VALUES ('a'), ('b')) AS d(data_id) -- generate total grid of rows
LEFT JOIN tablname t USING (loc_id, data_id) -- attach matching rows (if any)
ORDER BY l.loc_id, d.data_id;
Works for any number of columns with any number of values.
For your simple case:
SELECT t.val
FROM (
VALUES
(1, 'a'), (1, 'b')
, (2, 'a'), (2, 'b')
) AS ld (loc_id, data_id) -- total grid of rows
LEFT JOIN tablname t USING (loc_id, data_id) -- attach matching rows (if any)
ORDER BY ld.loc_id, ld.data_id;
where (loc_id in (1,2) or loc_id is null)
and (data_id in (a,b) or data_id is null)
Select the fields you use for filtering, so you know where the values came from:
select loc,data,val from tablename where loc in (1,2) and data in (a,b);
You won't get nulls this way either, but it's not a problem anymore. You know which fields are missing, and you know those are nulls.

Resources