Postgresql - Clean way to insert records if they don't exist, update if they do - database

Here's my situation. I have a table with a bunch of URLs and crawl
dates associated with them. When my program processes a URL, I want
to INSERT a new row with a crawl date. If the URL already exists, I
want to update the crawl date to the current datetime. With MS SQL or
Oracle I'd probably use a MERGE command for this. With mySQL I'd
probably use the ON DUPLICATE KEY UPDATE syntax.
I could do multiple queries in my program, which may or may not be
thread safe. I could write a SQL function which has various IF...ELSE
logic. However, for the sake of trying out Postgres features I've
never used before, I'm thinking about creating an INSERT rule -
something like this:
CREATE RULE Pages_Upsert AS ON INSERT TO Pages
WHERE EXISTS (SELECT 1 from Pages P where NEW.Url = P.Url)
DO INSTEAD
UPDATE Pages SET LastCrawled = NOW(), Html = NEW.Html WHERE Url = NEW.Url;
This seems to actually work great. It probably loses some points on
the "code readability" standpoint, as someone looking at my code for
the first time would have to magically know about this rule, but I
guess that could be solved with good code commenting and
documentation.
Are there any other drawbacks to this idea, or maybe a "your idea
sucks, you should do it /this/ way instead" comment? I'm on PG 9.0 if
that matters.
UPDATE: Query plan since someone wanted it :)
"Insert (cost=2.79..2.81 rows=1 width=0)"
" InitPlan 1 (returns $0)"
" -> Seq Scan on pages p (cost=0.00..2.79 rows=1 width=0)"
" Filter: ('http://www.foo.com'::text = lower((url)::text))"
" -> Result (cost=0.00..0.01 rows=1 width=0)"
" One-Time Filter: ($0 IS NOT TRUE)"
""
"Update (cost=2.79..5.46 rows=1 width=111)"
" InitPlan 1 (returns $0)"
" -> Seq Scan on pages p (cost=0.00..2.79 rows=1 width=0)"
" Filter: ('http://www.foo.com'::text = lower((url)::text))"
" -> Result (cost=0.00..2.67 rows=1 width=111)"
" One-Time Filter: $0"
" -> Seq Scan on pages (cost=0.00..2.66 rows=1 width=111)"
" Filter: ((url)::text = 'http://www.foo.com'::text)"

Ok, I managed to create a testcase. The result is that the update part is always executed, even on a fresh insert. COPY seems to bypass the rule system.
[For clarity I have put this into a separate reply]
DROP TABLE pages CASCADE;
CREATE TABLE pages
( url VARCHAR NOT NULL PRIMARY KEY
, html VARCHAR
, last TIMESTAMP
);
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );
CREATE RULE Pages_Upsert AS ON INSERT TO pages
WHERE EXISTS (SELECT 1 from pages P where NEW.url = P.url)
DO INSTEAD (
UPDATE pages SET html=new.html , last = NOW() WHERE url = NEW.url
);
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );
INSERT INTO pages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM pages pp;
COPY pages(url,html,last) FROM STDIN;
www.example.com://pageX stdin 2000-09-18 23:30:00
\.
SELECT * FROM pages;
The result:
url | html | last
-------------------------------+------------+----------------------------
www.example.com://page1 | meuk1 | 2001-09-18 23:30:00
www.example.com://page2 | meuk2 | 2011-09-18 23:48:30.775373
www.example.com://page3 | meuk3 | 2011-09-18 23:48:30.783758
www.example.com://page1/added | meuk1.html | 2011-09-18 23:48:30.792097
www.example.com://page2/added | meuk2.html | 2011-09-18 23:48:30.792097
www.example.com://page3/added | meuk3.html | 2011-09-18 23:48:30.792097
www.example.com://pageX | stdin | 2000-09-18 23:30:00
(7 rows)
UPDATE: Just to prove it can be done:
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );
CREATE VIEW vpages AS (SELECT * from pages);
CREATE RULE Pages_Upsert AS ON INSERT TO vpages
DO INSTEAD (
UPDATE pages p0
SET html=NEW.html , last = NOW() WHERE p0.url = NEW.url
;
INSERT INTO pages (url,html,last)
SELECT NEW.url, NEW.html, NEW.last
WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = NEW.url)
);
CREATE RULE Pages_Indate AS ON UPDATE TO vpages
DO INSTEAD (
INSERT INTO pages (url,html,last)
SELECT NEW.url, NEW.html, NEW.last
WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = OLD.url)
;
UPDATE pages p0
SET html=NEW.html , last = NEW.last WHERE p0.url = NEW.url
;
);
INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );
INSERT INTO vpages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM vpages pp;
UPDATE vpages SET last = last + interval '-10 years' WHERE url = 'www.example.com://page1' ;
-- Copy does NOT work on views
-- COPY vpages(url,html,last) FROM STDIN;
-- www.example.com://pageX stdin 2000-09-18 23:30:00
-- \.
SELECT * FROM vpages;
Result:
INSERT 0 1
INSERT 0 1
INSERT 0 3
UPDATE 1
url | html | last
-------------------------------+------------+---------------------
www.example.com://page2 | meuk2 | 2002-09-18 23:30:00
www.example.com://page3 | meuk3 | 2003-09-18 23:30:00
www.example.com://page1/added | meuk1.html | 2021-09-18 23:30:00
www.example.com://page2/added | meuk2.html | 2022-09-18 23:30:00
www.example.com://page3/added | meuk3.html | 2023-09-18 23:30:00
www.example.com://page1 | meuk1 | 1991-09-18 23:30:00
(6 rows)
The view is necessary to prevent the rewrite system to go into recursion.
Construction of a DELETE rule is left as an exercise to the reader.

Some good points from someone who should know it or be very near to someone like that ;-)
What are PostgreSQL RULEs good for?
Short story:
Do the rules work well with SERIAL and BIGSERIAL ?
Do the rules work well with the RETURNING clauses of INSERT and UPDATE ?
Do the rules work well with stuff like random()?
All these things boils down to the fact, that the rule system is not row driven but transforms your statements in a way you never imagine.
Do yourself and your team mates a favour and stop using roles for things like that.
Edit: Your problem is well discussed in the PostgreSQL community. Search keywords are: MERGE, UPSERT.

I don't know if this gets too subjective but what I think about your solution is: It's all about semantics. When I do an insert, I expect an insert and not some fancy logic that maybe does an insert but maybe not. Indeed that's what functions are for.
At first I'd try checking for the URL in your program and then choosing whether to insert or update. If that turned out to be too slow, I'd use a function. If you name it like insert_or_update_url, you automatically get some documentation for free. The rewrite rule requires you to have some implicit knowledge and I generally try to avoid that.
On the plus side: If someone copies the data but forgets rules and functions, your solution might break silently (but that may depend on other constraints), but a missing function goes down screaming. Don't get me wrong, I think your solution is very creative and smart. Just a bit too obscure for my taste.

There's an example of implementing upsert / merge using simple function in Postgres documentation.
Never use rules — they're evil.

You cannot refer to other tables than old an new in the rule qualification.
You should instead do this in the rule body.
This is all because the rule is just a way to inform the rewrite system about what transformations it should and should not perform. Rules are not triggers, executing for every row, but they give the query planner a fine massage and ask it nicely to rewrite the plan.
From the docs:
What is a rule qualification? It is a restriction that tells when the actions of the rule should be done and when not. This qualification can only reference the pseudorelations NEW and/or OLD, which basically represent the relation that was given as object (but with a special meaning).

Related

STUFF function equivalent in netezza

whats the STUFF equivalent in netezza, I am trying to do rows to column concatenation. I tried the GROUP_CONCAT()/ String_AGG as mentioned in the other question but i am not able to use both of them.
I've never used STUFF before, but looking at the description, there is not a native Netezza function that is its equivalent. However, it seems a simple thing to recreate with SUBSTR. Since I don't really use STUFF, take this with a grain of salt, but give it a try.
If we take an example of using STUFF:
SELECT STUFF('abcdef', 2, 3, 'ijklmn');
---------
aijklmnef
(1 row(s) affected)
If I dummy up the values and STUFF parameters into a subselect for demonstration purposes, here's how you can do it with SUBSTR.
SELECT
substr(orig_string ,0,start_pos)
|| replace_string ||
substr(orig_string,start_pos+del_length, LENGTH(orig_string)) new_string,
orig_string , replace_string
FROM
( SELECT
'abcdef' orig_string ,
2 start_pos ,
3 del_length ,
'ijklmn' replace_string ) foo
;
NEW_STRING | ORIG_STRING | REPLACE_STRING
------------+-------------+----------------
aijklmnef | abcdef | ijklmn
(1 row)
You can use GROUP_CONCAT to achieve similar results as with STUFF. The GROUP_CONCAT function must installed from nzlua examples directory on the Netezza Linux environment.
export NZ_PASSWORD='YOURADMINPASSWORD'
cd /nz/extensions/nz/nzlua/examples
../bin/nzl group_concat.nzl
After this group_concat is available in Netezza SYSTEM database. As ADMIN you can execute these commands to make them easily available to all users in a database.
grant execute on system..group_concat(varchar(128)) to public; -- once
create synonym group_concat for system..group_concat; -- in every user database

MSSQL Data type conversion

I have a pair of databases (one mssql and one oracle), ran by different teams. Some data are now being synchronized regularily by a stored procedure in the mssql table. This stored procedure is calling a very large
MERGE [mssqltable].[Mytable] as s
USING THEORACLETABLE.BLA as t
ON t.[R_ID] = s.[R_ID]
WHEN MATCHED THEN UPDATE SET [Field1] = s.[Field1], ..., [Brokenfield] = s.[BrokenField]
WHEN NOT MATCHED BY TARGET THEN
... another big statement
Field Brokenfield was a numeric one until today, and could take value NULL, 0, 1, .., 24
Now, the oracle team introduced a breaking change today for some reason, changed the type of the column to string and now has values NULL, "", "ALFA", "BRAVO"... in the column. Of course, the sync got broken.
What is the easiest way to fix the sync here? I (Mysql team lead, frontend expert but not so in databases) would usually apply one of our database expert guys here, but all of them are now ill, and the fix must go online today....
I thought of a stored procedure like CONVERT_BROKENFIELD_INT_TO_STRING or so, based on some switch-case, which could be called in that merge statement, but not sure how to do that.
Edit/Clarification:
What I need is a way to make a chunk of SQL code (stored procedure), taking an input of "ALFA" and returning 1, "BRAVO" -> 2, etc. and which can be reused, to avoid writing huge ifs in more then one place.
If you can not simplify the logic for correct values the way #RichardHansell desribed, you can create a crosswalk table for BrokenField to the correct values. Then you can use a common table expression or subquery with a left join to that crosswalk to use in the merge.
create table dbo.BrokenField_Crosswalk (
BrokenField varchar(32) not null primary key
, CorrectedValue int
);
insert into dbo.BrokenField_Crosswalk (BrokenField,CorrectedValue) values
('ALFA', 1)
, ('ALPHA', 1)
, ('BRAVO', 2)
...
go
And your code for the merge would look something like this:
;with cte as (
select o.R_ID
, o.Field1
, BrokenField = cast(isnull(c.CorrectedValue,o.BrokenField) as int)
....
from oracle_table.bla as o
left join dbo.BrokenField_Crosswalk as c
)
merge into [mssqltable].[Mytable] t
using cte as s
on t.[R_ID] = s.[R_ID]
when matched
then update set
[Field1] = s.[Field1]
, ...
, [Brokenfield] = s.[BrokenField]
when not matched by target
then
If they are using names with a letter at the start that goes in a sequence:
A = 1
B = 2
C = 3
etc.
Then you could do something like this:
MERGE [mssqltable].[Mytable] as s
USING THEORACLETABLE.BLA as t
ON t.[R_ID], 1)) - ASCII('A') + 1 = s.[R_ID]
WHEN MATCHED THEN UPDATE SET [Field1] = s.[Field1], ..., [Brokenfield] = s.[BrokenField]
WHEN NOT MATCHED BY TARGET THEN
... another big statement
Edit: but actually I re-read your question and you are talking about [Brokenfield] being the problem column, so my solution wouldn't work.
I don't really understand now, as it seems as though the MERGE statement is updating the oracle table with numbers, so surely you need the mapping to work the other way, i.e. 1 -> ALFA, 2 -> BETA, etc.?

Attempting to run a while loop in my select statement under cases in SQL Server 2012

The Data
Let us say I have a field in SQL that consists of multi-line Information, each of which consists of i topics, each topic consisting of m points of information. Topics are prefaced with 'i.' and information with a dash. It looks something like:
________________________________________________
|Number | Information
|===============================================
|1 | 1. Topic 1.1
| | -Info 1.1.1
| | - ... [more info]
| | 2. Topic 1.2
| | -Info 1.2.1
| | - ...[more info]
| | ... [more topics]
|_______|_____________________________
|2 | 1. Topic 2.1
|....and so on
The Current System
What I am doing with this information is to parse out each topic into it's own column, then unpivoting those columns and searching for Topics that contain a given keyword #keyword.
Currently the code reads something like:
Select
Number
,Case When Information LIKE '%1. %2. %'
Then substring (Information, charindex('1.',Information),
charindex('2.', Information) -(charindex('1.',Information)+2) )
Else Information
End as [Topic1]
,Case When Information LIKE '%2. %3. %'
Then substring (Information, charindex('2.',Information),
charindex('3.', Information) -(charindex('2.',Information)+2) )
Else 'N/A'
End as [Topic2]
...repeat 2nd case for each set of numbers up to '%20. %21. %'
The only reason the first one is different is because if it doesn't match the pattern then I want to grab the whole field so that I don't miss anything. I then unpivot the Topic fields that I just created into a general [Topic] field, and then utilize a WHERE [Topic] LIKE '%' +#keyword+'%' to pull out any particular topics and their associated case number to output as my final table. The cases can have anywhere from 1 to 40+ topics attached, with 1-7 attached info fields per topic.
The Desired Modification
Notice: To make the code easier to read, I will not be writing my substring code in proper syntax, instead opting to write substring(Information,ci(#Iter), ci(#Iter+1)-ci(#Iter)) to denote the substring running from the position given by '(iter).' to the position given by '(iter+1).'
What I would like to do is to perform the following:
Declare #Iter smallint
Declare #Result varchar(max)
Select
Number
, Set #Iter=1
Set #Result = ' '
Case When Information LIKE '%'+#keyword+'%' --keyword chosen at front end
Then While #Iter < #n --#n set by the user from front end
Begin
Case When Information LIKE '%' + cast(#Iter as varchar(5))
+ '. %'+cast((#Iter+1) as varchar(5))+'. %'
and substring(Information,ci(#Iter), ci(#Iter+1)-ci(#Iter) )
LIKE '%'+#keyword+'%'
Then Set #Result = #Result +substring(Information,ci(#Iter),
ci(#Iter+1)-ci(#Iter) )
Else Set #Result = #Result end
Set #Iter = #Iter +1
End
Else ' ' end [Result]
The Explanation
In case what I want isn't clear, I'll run through what I'm trying to accomplish
I want to output a list of case numbers that include Topics that include the keyword.
For each case in the list I want to output only those topics that include the keyword.
I want to allow the end user of the report to choose how many Topics in each case they'll search.
I don't want to have to create a table with a column for each Topic when I can't know how many the user will want to create.
Due to these considerations it feels like a loop would be the best option, but there are problems in trying to accomplish that.
The Problem
SQL server won't allow me to utilize a loop in my Select statement--Incorrect syntax near 'While'.
The place where the information comes from prohibits normalization of the information in the table I'm searching
Even if it didn't I am barred from creating my own permanent tables at work, so I can't normalize the data for all incoming data
I am also not allowed to write my own stored procedures.
If there is any way (for example through a cte) to implement these changes, I'm open to hearing them! I'm mostly looking at ways to make the code less daunting looking (20 cases to produce 20 fields in my current cte looks scary, which then needs 3 ctes just to unpack properly [unpivot, removal of certain cases meeting certain conditions, combination into a workable output table])
Thanks in advance for reading this and helping!
I think you're working too hard.
If all you need are topic names and numbers, isn't it easier to split the Information column by newlines, and then collect all lines that start with a number and not a "dash" by then, you will have a list of strings that look like:
Topic 1.1
Topic 2.1
And then it's easy to just match the lines against the keyword?
Something like this untested SQL:
select SUBSTRING(s.value,1, PATINDEX('% %', s.Value) - 1) AS topicId
, SUBSTRING(s.Value, PATINDEX('% %', s.Value), LENGTH(s.Value)) AS topicText
from [table that would make Codd cry] t
cross apply STRING_SPLIT(t.Information, CHAR(13)) s
where s.Value LIKE '[0-9]%' -- Starts with a number
AND s.Value LIKE #keywords --matches keywords
Not sure if you can create functions or you have STRING_SPLIT available in your SQL Server version, but if you don't, there are some string splitting CTEs you can find on the net to do the job for you

Assistance with a 4 table Join operation

In the attempt of being as clear as posible, I have 4 tables in my database as it follows
Join_Contrato_Medidor
ID_Union (identity)
ID_Contrato
ID_Medidor
Omitido (filger ?)
Promedios
ID_Contrato
ID_Medidor
ID_Marchamo
{Info I want}
Medidores
ID_Medidor
ID_Dispensario (filter ?)
Marchamo
ID_Marchamo
My current SQL Statement...
SELECT {Promedios.LI_1, Promedios.LF_1, Promedios.Total_1, Promedios.Qva_1, ...}
FROM (((
Join_Contrato_Medidor LEFT OUTER JOIN
Promedios ON Join_Contrato_Medidor.ID_Contrato = Promedios.ID_Contrato)
LEFT OUTER JOIN
Medidores ON Join_Contrato_Medidor.ID_Medidor = Medidores.ID_Medidor)
LEFT OUTER JOIN
Marchamo ON Promedios.ID_Marchamo = Marchamo.ID_Marchamo)
WHERE (Join_Contrato_Medidor.ID_Contrato = ?) AND (Medidores.ID_Dispensario = ?) AND (Join_Contrato_Medidor.Omitido <> TRUE)
The output im obtaining:
Information Columns | Omitido | ID_Union
Info | False | 806
Info | False | 806
Info | False | 806
Info | False | 806
*I wanted to include an image but I cannot do so until I have more reputation :( *
I have those 4 tables that I am Joining right now. I am currently getting all the columns desired to be output in the query, but the thing is that I would only like to get those records in which --Join_Contrato_Medidor.Omitido <> true-- instead of getting ALL records that match the ID_Contrato and ID_Dispensario conditions.
As a sample, I am outputing ID_Union, which is the identity field for the Join_Contrato_Medidor. It is marking all the records with a single ID_Union, which happens to be the only one record out of the 4 that has Omitido <> true. Also, the latest 3 records have their Omitido field set to true in the database nevertheless it is showing false in the query result.
If the question is unclear, please post me for clarification.
Thanks in advance
After working on other things until I had to face this issue again, I am back checking it. Your comment led me to try and see if switching the order of the tables would do the job, and it did! Thank you very much.
I started asking for hte Promedios table first and the nperform the rest of the query. This gave me access to the exact information that I wanted. Moreover, all the following queries I created them following this order and lead to better shorter queries.

How to get last access/modification date of a PostgreSQL database?

On development server I'd like to remove unused databases. To realize that I need to know if database is still used by someone or not.
Is there a way to get last access or modification date of given database, schema or table?
You can do it via checking last modification time of table's file.
In postgresql,every table correspond one or more os files,like this:
select relfilenode from pg_class where relname = 'test';
the relfilenode is the file name of table "test".Then you could find the file in the database's directory.
in my test environment:
cd /data/pgdata/base/18976
ls -l -t | head
the last command means listing all files ordered by last modification time.
There is no built-in way to do this - and all the approaches that check the file mtime described in other answers here are wrong. The only reliable option is to add triggers to every table that record a change to a single change-history table, which is horribly inefficient and can't be done retroactively.
If you only care about "database used" vs "database not used" you can potentially collect this information from the CSV-format database log files. Detecting "modified" vs "not modified" is a lot harder; consider SELECT writes_to_some_table(...).
If you don't need to detect old activity, you can use pg_stat_database, which records activity since the last stats reset. e.g.:
-[ RECORD 6 ]--+------------------------------
datid | 51160
datname | regress
numbackends | 0
xact_commit | 54224
xact_rollback | 157
blks_read | 2591
blks_hit | 1592931
tup_returned | 26658392
tup_fetched | 327541
tup_inserted | 1664
tup_updated | 1371
tup_deleted | 246
conflicts | 0
temp_files | 0
temp_bytes | 0
deadlocks | 0
blk_read_time | 0
blk_write_time | 0
stats_reset | 2013-12-13 18:51:26.650521+08
so I can see that there has been activity on this DB since the last stats reset. However, I don't know anything about what happened before the stats reset, so if I had a DB showing zero activity since a stats reset half an hour ago, I'd know nothing useful.
PostgreSQL 9.5 let us to track last modified commit.
Check track commit is on or off using the following query
show track_commit_timestamp;
If it return "ON" go to step 3 else modify postgresql.conf
cd /etc/postgresql/9.5/main/
vi postgresql.conf
Change
track_commit_timestamp = off
to
track_commit_timestamp = on
Restart the postgres / system
Repeat step 1.
Use the following query to track last commit
SELECT pg_xact_commit_timestamp(xmin), * FROM YOUR_TABLE_NAME;
SELECT pg_xact_commit_timestamp(xmin), * FROM YOUR_TABLE_NAME where COLUMN_NAME=VALUE;
My way to get the modification date of my tables:
Python Function
CREATE OR REPLACE FUNCTION py_get_file_modification_timestamp(afilename text)
RETURNS timestamp without time zone AS
$BODY$
import os
import datetime
return datetime.datetime.fromtimestamp(os.path.getmtime(afilename))
$BODY$
LANGUAGE plpythonu VOLATILE
COST 100;
SQL Query
SELECT
schemaname,
tablename,
py_get_file_modification_timestamp('*postgresql_data_dir*/*tablespace_folder*/'||relfilenode)
FROM
pg_class
INNER JOIN
pg_catalog.pg_tables ON (tablename = relname)
WHERE
schemaname = 'public'
I'm not sure if things like vacuum can mess this aproach, but in my tests it's a pretty acurrate way to get tables that are no longer used, at least, on INSERT/UPDATE operations.
I guess you should activate some log options. You can get information about logging on postgreSQL here.

Resources