Cassandra DB Inserting data into multiple tables - database

I have been reading the documentation of Cassandra Db on datastax as well as Apache docs. So far I have learned that we cannot create more than one index (one primary, one secondary index) on a table. And there should be an individual table for each query.
Comparing this to an SQL table for example one on which we want to query 4 fields, for this table in case of Cassandra we should split this table into 4 tables right? ( please correct me if I am wrong ).
on these 4 tables I can have the indexes and make the queries,
My question is How can we insert data into these 4 tables, should I make 4 consecutive insert requests?
my priority is to avoid secondary index

To keep data in sync across denormalised tables, you need to use CQL BATCH statements.
For example, if you had these tables to maintain:
movies
movies_by_actor
movies_by_genre
then you would group the updates in a CQL BATCH like this:
BEGIN BATCH
INSERT INTO movies (...) VALUES (...);
INSERT INTO movies_by_actor (...) VALUES (...);
INSERT INTO movies_by_genre (...) VALUES (...);
APPLY BATCH;
Note that it is also possible to do UPDATE and DELETE statements as well as conditional writes in a batch.
The above example is just to illustrate it in cqlsh and is not used in reality. Here is an example BatchStatement using the Java driver:
SimpleStatement insertMovies =
SimpleStatement.newInstance(
"INSERT INTO movies (...) VALUES (?, ...)", <some_values>);
SimpleStatement insertMoviesByActor =
SimpleStatement.newInstance(
"INSERT INTO movies_by_actor (...) VALUES (?, ...)", <some_values>);
SimpleStatement insertMoviesByGenre =
SimpleStatement.newInstance(
"INSERT INTO movies_by_genre (...) VALUES (?, ...)", <some_values>);
BatchStatement batch =
BatchStatement.builder(DefaultBatchType.LOGGED)
.addStatement(insertMovies)
.addStatement(insertMoviesByActor)
.addStatement(insertMoviesByGenre)
.build();
For details, see Java driver Batch statements. Cheers!

Cassandra supports secondary Index, SSTable Attached Secondary Index(SASI). Storage Attached Indexes (SAI) has been donated to the project but not yet accepted.
You need to create your tables such that you can get all your required data from table using a single query which looks something like this
SELECT * from keyspace.table_name where key = 'ABC';
So what it means to a as a designer. You need to get all queries identified and on the basis of those queries you define your data model (tables). So if you think you will need 4 tables to satisfy your queries then you are right.
Since all the 4 tables defined by you will have to be in sync if they represent same data, best way is to use batch
BEGIN BATCH
DML_statement1 ;
DML_statement2 ;
DML_statement3 ;
DML_statement4 ;
APPLY BATCH ;
Batch does not guarantee that all statements will be successful are rolled back. It informs the client that group of statements has failed. So client should retry to apply them.
Better to avoid secondary indexes if you can because of performance issues with them. A general rule of thumb is to index a column with low cardinality of few values.

Related

Use Sybase triggers to write dynamic statement using all old and new values for creating your own replication transaction statement log?

PROBLEM SUMMARY
I have to write I/U/D-statement-generating-triggers for a bucardo/symmetricDS-inspired homemade bidirectional replication system between Sybase ADS and Postgresql 11 groups of nodes, using BEFORE triggers on any Postgresql and Sybase DB that creates Insert/Update/Delete commands based on the command entered in a replicating source table: e.g. an INSERT INTO PERSON (first_name,last_name,gender,age,ethnicity) Values ('John','Doe','M',42,'C') and manipulate them into a corresponding Insert statement, and UPDATE by getting OLD and NEW values to dynamically make an UPDATE statement, along with getting OLD values to make a DELETE command, all to run per command on a destination at some interval.
I know this is difficult and no one does this but it is for a job and I have no other options and can't object to offer a different solution. I have no other teammates or human resources to help outside of SO and something like Codementors, which was not so helpful. My idea/strategy is to copy parts of bucardo/SymmetricDS when inserting OLD and NEW values for generating a statement/command to run on the destination. Right now, I am snapshotting the whole table to a CSV as opposed to doing by individual command, but by command and looping through table that generates and saves commands will make the job much easier.
One big issue is that they come from Sybase ADS and have a mixed Key/Index structure (many tables have NO PK) and are mirroring that in Postgresql, so I am trying to write PK-less statements, or all-column commands to get around the no-pk tables. They also will only replicate certain columns for certain tables, so I have a column in a table for them to insert the column names delimited by ';' and then split it out into an array and link the column names to the values for each statement to generate a full command for I/U/D, Hopefully. I am open to other strategies but this is a big solo project and I have gone at it many ways with much difficulty.
I mostly come from DBA background and have some programming experience with the fundamentals, so I am mostly pseudocoding each major sequence,googling for syntax by part, and adjusting as I go or encounter a language incapability. I am thankful for any help given, as I am getting a bit desperate and discouraged.
WHAT I HAVE TRIED
I have to do this for Sybase ADS and Postgresql but this question is intially over ADS since it's more challenging and older.
To have one "Log" table which tracks row changes for each of the replicating tables and records and ultimately dynamically generates a command is the goal for both platforms. I am trying to make trigger statements like:
CREATE TRIGGER PERSON_INSERT
ON PERSON
BEFORE
INSERT
BEGIN
INSERT INTO Backlog (SourceTableID, TriggerType, Status, CreateTimeDate, NewValues) select ID, 'INSERT','READY', NOW(),''first_name';'last_name';'gender';'age';'ethnicity'' from __new;
END;
CREATE TRIGGER PERSON_UPDATE
ON PERSON
BEFORE
UPDATE
BEGIN
INSERT INTO Backlog (SourceTableID, TriggerType, Status, CreateTimeDate, NewValues) select ID, 'U','UPDATE','READY', NOW(),''first_name';'last_name';'gender';'age';'ethnicity'' from __new;
UPDATE Backlog SET OldValues=select ''first_name';'last_name';'gender';'age';'ethnicity'' from __old where SourceTableID=select ID from __old;
END;
CREATE TRIGGER PERSON_DELETE
ON PERSON
BEFORE
DELETE
BEGIN
INSERT INTO Backlog (SourceTableID, TriggerType, Status, CreateTimeDate, OldValues) select ID, 'D','DELETE','READY', NOW(),''first_name';'last_name';'gender';'age';'ethnicity'' from __old;
END;
but I would like the "''first_name';'last_name';'gender';'age';'ethnicity''" to come from another table as a value to make it dynamic since multiple tables will write their value and statement info to the single log table. Then, it can be made into a variable and then probably split to link to the corresponding values so the IUD statements can be made which will be executed on the destination one at a time.
ATTEMPTED INCOMPLETE SAMPLE TRIGGER CODE
CREATE TRIGGER PERSON_INSERT
ON PERSON
BEFORE
INSERT
BEGIN
--Declare #Columns string
--#Columns=select Columns from metatable where tablename='PERSON'
--String Split(#Columns,';') into array to correspond to new and old VALUES
--#NewValues=#['#Columns='+NEW.#Columns+'']
INSERT INTO Backlog (SourceTableID, TriggerType, Status, CreateTimeDate, NewValues) select ID, 'INSERT','READY', NOW(),''first_name';'last_name';'gender';'age';'ethnicity'' from __new;
END;
CREATE TRIGGER PERSON_UPDATE
ON PERSON
BEFORE
UPDATE
BEGIN
--Declare #Columns string
--#Columns=select Columns from metatable where tablename='PERSON'
--String Split(#Columns,';') into array to correspond to new and old VALUES
--#NewValues=#['#Columns='+NEW.#Columns+'']
--#OldValues=#['#Columns='+OLD.#Columns+'']
INSERT INTO Backlog (SourceTableID, TriggerType, Status, CreateTimeDate, NewValues) select ID, 'U','UPDATE','READY', NOW(),''first_name';'last_name';'gender';'age';'ethnicity'' from __new;
UPDATE Backlog SET OldValues=select ''first_name';'last_name';'gender';'age';'ethnicity'' from __old where SourceTableID=select ID from __old;
END;
CREATE TRIGGER PERSON_DELETE
ON PERSON
BEFORE
DELETE
BEGIN
--Declare #Columns string
--#Columns=select Columns from metatable where tablename='PERSON'
--String Split(#Columns,',') into array to correspond to new and old VALUES
--#OldValues=#['#Columns='+OLD.#Columns+'']
INSERT INTO Backlog (SourceTableID, TriggerType, Status, CreateTimeDate, OldValues) select ID, 'D','DELETE','READY', NOW(),''first_name';'last_name';'gender';'age';'ethnicity'' from __old;
END;
CONCLUSION
For each row inserted,updated, or deleted; in a COMMAND column in the log table, I am trying to generate a corresponding 'INSERT INTO PERSON ('+#Columns+') VALUES ('+#NewValues+')' type statement, or an UPDATE or DELETE. Then a Foreach service will run each command value ordered by create time, as the main replication service.
To be clear, I am trying to make my sample code trigger write all old values and new values to a column in a dynamic way without hardcoding the columns in each trigger since it will be used for multiple tables, and writing the values into a single column delimited by a comma or semicolon.
An even bigger wish or goal behind this is to find a way to save/script each IUD command and then be able to run them on subscriber server.DBs of postgresql and Sybase platform, therefore making my own replication from a log
It is a complex but solvable problem that would take time and careful planning to write. I think what you are looking for is the "Execute Immediate" command in ADS SQL syntax. With this command you can create a dynamic statement to then be executed once construction of the SQL statement is terminated. Save each desired column value to a temp table by carefully constructing the statement as a string and then execute it with Execute Immediate. For example:
DECLARE TableColumns Cursor ;
DECLARE FldName Char(100) ;
...
OPEN TableColumns AS SELECT *
FROM system.columns
WHERE parent = #cTableName
AND field_type < 21 //ADS_ROWVERSION
AND field_type <> 6 //ADS_BINARY
AND field_type <> 7; //ADS_IMAGE
While Fetch TableColumns DO
FldName = Trim( TableColumns.Name) ;
StrSql = 'SELECT New.[' + Trim( FldName ) + '] newVal' +
'INTO #myTmpTable FROM ___New n' ;
After constructing the statement as a string it can then be executed like this:
EXECUTE IMMEDIATE STRSQL ;
You can pickup old and new values from __old and __new temp tables that are always available to triggers. Insert values into temp table myTmpTable and then use it to update the target. Remember to drop myTmpTable at the end.
Furthermore, I would think you can create a function on the DD that can actually be called from each trigger on the tables you want to keep track of instead of writing a long trigger for each table and cTableName can be a parameter sent to the function. That would make maintenance a little easier.

How can I insert a 100+k rows of XML into SQL Server 2012 in a fast way

I have the following scenario/requirements for which I am not sure what is the best way to address to perform in the fastest way possible, looking for some guidance of features to use and examples of them, if available
I will receive anywhere between 10k to 100k of entities (in XML format) from a web service that I want to upsert (some rows might exist, others might not).
here are some of the requirements:
The source of the XML is a web service that I'm calling from C# code. Actually two different methods. For one of the methods, the return schema will be something flat that I can map directly to one of my tables. For the other, it will return an XML representation that I might need to work with in C# in order to be able to map it to flat entities for my tables. In that scenario, would it be best to do the modifications needed and then write to file to an XML to use as source?
The returned XML can contain up to 150k entities in XML, that may or may not exist in my tables yet, so I'm looking to upsert them. The files, when written to disk, can weight up to 20 megabytes. I asked if they could do JSON instead of XML, but apparently that's not a choice.
The SQL database is on a different server than my IIS server, so I rather avoid having the SQL server retrieve the XML from a file, I rather pass it from C# as a string or as a Table Value Parameter.
The tables are rather simple and don't have indexes other than the PK ones.
I've never been big on XML, although it got way easier with LINQ to XML, which I was initially using to parse each record and send individual inserts but the performance was just bad, so based on some research I've been doing, I'm thinking I could use:
Upserts from SQL server through MERGE statements.
Pass the whole XML as a parameter and use OPENXML to use as source in the MERGE statement.
Or, somehow generate a Table Value Parameter in C# and pass that to SQL to use on the MERGE.
I read on this similar question (which didnt have access to upsert/merge) that instead of trying to upsert directly from the XML, that it might be better to insert everything to a temporary table and do the merge/upsert against the temporary table?
Would this work and be considerably fast?
If anyone has had a similar scenario, can you share your thoughts/ideas about what combination of features would be best?
Thanks.
You are on the right track. I have a similar setup using XML to transfer data between an online portal and the client-server application. The rest of the setup is very similar to what you have.
The fact that your tables are not indexed is a bit of a concern, if you are comparing any fields that are not PK Fields, regardless of how you index the temp tables. It is important to have either one index with all of the fields used in the merge match clause, or an index for each of them - I find the former yields better performance. Beyond that, using an XML parameter, OpenXML and temp tables is the way to go.
The following code has not been tested, so may need a bit of debugging, but it will put you on the right track. A couple of notes: If all of the fields in the OpenXML WITH clause are attributes, then you can drop the last parameter (i.e. ", 2") and field source specifiers (i.e. "#id" for the detail table). Although the data in your description is flat, in which case you will only need one table, I do often need to import into linked records. I have included a simple master-detail relationship example in the code below, just for the sake of completeness.
CREATE PROCEDURE usp_ImportFromXML (#data XML) AS
BEGIN
/*
<root>
<data>
<match_field_1>1</match_field_1>
<match_field_2>val2</match_field_2>
<data_1>val3</data_1>
<data_2>val4</data_2>
<detail_records>
<detail_data id="detailID1">
<detail_1>blah1<detail_1>
<detail_2>blah2<detail_2>
</detail_data>
<detail_data id="detailID2">
<detail_1>blah3<detail_1>
<detail_2>blah4<detail_2>
</detail_data>
</detail_records>
</data>
<data>
...
</root>
*/
DECLARE #iDoc INT
EXEC sp_xml_preparedocument #iDoc OUTPUT, #data
SELECT * INTO #temp
FROM OpenXML(#iDoc, '/root/data', 2) WITH (
match_field_1 INT,
match_field_2 VARCHAR(50),
data_1 VARCHAR(50),
data_2 VARCHAR(50)
)
SELECT * INTO #detail
FROM OpenXML(#iDoc, '/root/data/detail_data', 2) WITH (
match_field_1 INT '../../match_field_1',
match_field_2 VARCHAR(50) '../../match_field_2',
detail_id VARCHAR(50) '#id',
detail_1 VARCHAR(50),
detail_2 VARCHAR(50)
)
EXEC sp_xml_removedocument #iDoc
CREATE INDEX #IX_temp ON #temp(match_field_1, match_field_2)
CREATE INDEX #IX_detail ON #detail(match_field_1, match_field_2, detail_id)
MERGE data_table a
USING #temp ta
ON ta.match_field_1 = a.match_field_1 AND ta.match_field_2 = a.match_field_2
WHEN MATCHED THEN
UPDATE SET data_1 = ta.data_1, data_2 = ta.data_2
WHEN NOT MATCHED THEN
INSERT (match_field_1, match_field_2, data_1, data_2) VALUES (ta.match_field_1, ta.match_field_2, ta.data_1, ta.data_2)
MERGE detail_table a
USING (SELECT d.*, p._key FROM #detail d, data_table p WHERE d.match_field_1 = p.match_field_1 AND d.match_field_2 = p.match_field_2) ta
ON a.id = ta.id AND a.parent_key = ta._key
WHEN MATCHED THEN
UPDATE SET detail_1 = ta.detail_1, detail2 = ta.detail_2
WHEN NOT MATCHED THEN
INSERT (parent_key, id, detail_1, detail_2) VALUES (ta._key, ta.id, ta.detail_1, ta.detail_2)
DROP TABLE #temp
DROP TABLE #detail
END
Use (3). Process the data ready for upset in C#. C# is made for this kind of algorithmic work. It is both the right programming language as well as the faster programming language. T-SQL is not the right tool. You do not want to use XML with T-SQL for very high performance stuff because it burns CPU like crazy. Instead use the fast TDS protocol to send TVP or bulk data.
Then, send the data to the server using either a TVP or a bulk-insert (SqlBulkCopy) to a temp table. The latter technique is great for very many rows (>10k?). Bulk insert uses special TDS features. It does not use SQL batches to transfer the data. It does not get faster than this.
Then use the MERGE statement as you described. Use big batch sizes, potentially all rows in one batch.
The best way I've found is to bulk insert into a temp table from your C# code, then issue the merge once the data is in SQL Server. I have an example here on my blog SQL Server Bulk Upsert
I use this in production to insert millions of rows daily, and have yet to find a faster way to do it. Give it a try, I think you will be impressed with the performance of the solution.

check existing record before inserting

I want to insert multiple records (~1000) using C# and SQL Server 2000 as a datatabase but before inserting how can I check if the record i'm inserting already exists and if so the next record should be inserted. The records are coming from a structured excel file then I load them in a generic collection and iterate through each item and perform insert like this
// Insert records into database
private void insertRecords() {
try {
// iterate through all records
// and perform insert on each iteration
for (int i = 0; i < names.Count; i++) {
sCommand.Parameters.AddWithValue("#name", Names[i]);
sCommand.Parameters.AddWithValue("#person", ContactPeople[i]);
sCommand.Parameters.AddWithValue("#number", Phones[i]);
sCommand.Parameters.AddWithValue("#address", Addresses[i]);
// Open the connection
sConnection.Open();
sCommand.ExecuteNonQuery();
sConnection.Close();
}
} catch (SqlException ex) {
throw ex;
}
}
This code uses a stored procedure to insert the records but I can check the record before inserting?
Inside your stored procedure, you can have a check something like this (guessing table and column names, since you didn't specify):
IF EXISTS(SELECT * FROM dbo.YourTable WHERE Name = #Name)
RETURN
-- here, after the check, do the INSERT
You might also want to create a UNIQUE INDEX on your Name column to make sure no two rows with the same value exist:
CREATE UNIQUE NONCLUSTERED INDEX UIX_Name
ON dbo.YourTable(Name)
The easiest way would probably be to have an inner try block inside your loop. Catch any DB errors and re-throw them if they are not a duplicate record error. If it is a duplicate record error, then don't do anything (eat the exception).
Within the stored procedure, for the row to be added to the database, first check if the row is present in the table. If it is present, UPDATE it, otherwise INSERT it. SQL 2008 also has the MERGE command, which essentially moshes update and insert together.
Performance-wise, RBAR (row-by-agonizing-row) is pretty inefficient. If speed is an issue, you'd want to look into the various "insert a lot of rows all at once" procsses: BULK INSERT, the bcp utility, and SSIS packages. You still have the either/or issue, but at least it'd perform better.
Edit:
Bulk inserting data into an empty table is easy. Bulk inserting new data in a non-empty table is easy. Bulk inserting data into a table where some of the data (as, presumably, defined by the primary key) is already present is tricky. Alas, the specific steps get detailed quickly and are very dependent upon your system, code, data structures, etc. etc.
The general steps to follow are:
- Create a temporary table
- Load the data into the temporary table
- Compare the contents of the temporary table with those of the target table
- Where they match (old data), UPDATE
- Where they don't match (new data), INSERT
I did a quick search on SO for other posts that covered this, and stumbled across something I'd never thought of. Try this; not only would it work, its elegant.
Does your table have a primary key? If so you should be able to check that the key value to be inserted is not already in the table.

error when insert into linked server

I want to insert some data on the local server into a remote server, and used the following sql:
select * into linkservername.mydbname.dbo.test from localdbname.dbo.test
But it throws the following error
The object name 'linkservername.mydbname.dbo.test' contains more than the maximum number of prefixes. The maximum is 2.
How can I do that?
I don't think the new table created with the INTO clause supports 4 part names.
You would need to create the table first, then use INSERT..SELECT to populate it.
(See note in Arguments section on MSDN: reference)
The SELECT...INTO [new_table_name] statement supports a maximum of 2 prefixes: [database].[schema].[table]
NOTE: it is more performant to pull the data across the link using SELECT INTO vs. pushing it across using INSERT INTO:
SELECT INTO is minimally logged.
SELECT INTO does not implicitly start a distributed transaction, typically.
I say typically, in point #2, because in most scenarios a distributed transaction is not created implicitly when using SELECT INTO. If a profiler trace tells you SQL Server is still implicitly creating a distributed transaction, you can SELECT INTO a temp table first, to prevent the implicit distributed transaction, then move the data into your target table from the temp table.
Push vs. Pull Example
In this example we are copying data from [server_a] to [server_b] across a link. This example assumes query execution is possible from both servers:
Push
Instead of connecting to [server_a] and pushing the data to [server_b]:
INSERT INTO [server_b].[database].[schema].[table]
SELECT * FROM [database].[schema].[table]
Pull
Connect to [server_b] and pull the data from [server_a]:
SELECT * INTO [database].[schema].[table]
FROM [server_a].[database].[schema].[table]
I've been struggling with this for the last hour.
I now realise that using the syntax
SELECT orderid, orderdate, empid, custid
INTO [linkedserver].[database].[dbo].[table]
FROM Sales.Orders;
does not work with linked servers. You have to go onto your linked server and manually create the table first, then use the following syntax:
INSERT INTO [linkedserver].[database].[dbo].[table]
SELECT orderid, orderdate, empid, custid
FROM Sales.Orders
WHERE shipcountry = 'UK';
I've experienced the same issue and I've performed the following workaround:
If you are able to log on to remote server where you want to insert data with MSSQL or sqlcmd and rebuild your query vice-versa:
so from:
SELECT * INTO linkservername.mydbname.dbo.test
FROM localdbname.dbo.test
to the following:
SELECT * INTO localdbname.dbo.test
FROM linkservername.mydbname.dbo.test
In my situation it works well.
#2Toad: For sure INSERT INTO is better / more efficient. However for small queries and quick operation SELECT * INTO is more flexible because it creates the table on-the-fly and insert your data immediately, whereas INSERT INTO requires creating a table (auto-ident options and so on) before you carry out your insert operation.
I may be late to the party, but this was the first post I saw when I searched for the 4 part table name insert issue to a linked server. After reading this and a few more posts, I was able to accomplish this by using EXEC with the "AT" argument (for SQL2008+) so that the query is run from the linked server. For example, I had to insert 4M records to a pseudo-temp table on another server, and doing an INSERT-SELECT FROM statement took 10+ minutes. But changing it to the following SELECT-INTO statement, which allows the 4 part table name in the FROM clause, does it in mere seconds (less than 10 seconds in my case).
EXEC ('USE MyDatabase;
BEGIN TRY DROP TABLE TempID3 END TRY BEGIN CATCH END CATCH;
SELECT Field1, Field2, Field3
INTO TempID3
FROM SourceServer.SourceDatabase.dbo.SourceTable;') AT [DestinationServer]
GO
The query is run on DestinationServer, changes to right database, ensures the table does not already exist, and selects from the SourceServer. Minimally logged, and no fuss. This information may already out there somewhere, but I hope it helps anyone searching for similar issues.

Hidden Features of PostgreSQL [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm surprised this hasn't been posted yet. Any interesting tricks that you know about in Postgres? Obscure config options and scaling/perf tricks are particularly welcome.
I'm sure we can beat the 9 comments on the corresponding MySQL thread :)
Since postgres is a lot more sane than MySQL, there are not that many "tricks" to report on ;-)
The manual has some nice performance tips.
A few other performance related things to keep in mind:
Make sure autovacuum is turned on
Make sure you've gone through your postgres.conf (effective cache size, shared buffers, work mem ... lots of options there to tune).
Use pgpool or pgbouncer to keep your "real" database connections to a minimum
Learn how EXPLAIN and EXPLAIN ANALYZE works. Learn to read the output.
CLUSTER sorts data on disk according to an index. Can dramatically improve performance of large (mostly) read-only tables. Clustering is a one-time operation: when the table is subsequently updated, the changes are not clustered.
Here's a few things I've found useful that aren't config or performance related per se.
To see what's currently happening:
select * from pg_stat_activity;
Search misc functions:
select * from pg_proc WHERE proname ~* '^pg_.*'
Find size of database:
select pg_database_size('postgres');
select pg_size_pretty(pg_database_size('postgres'));
Find size of all databases:
select datname, pg_size_pretty(pg_database_size(datname)) as size
from pg_database;
Find size of tables and indexes:
select pg_size_pretty(pg_relation_size('public.customer'));
Or, to list all tables and indexes (probably easier to make a view of this):
select schemaname, relname,
pg_size_pretty(pg_relation_size(schemaname || '.' || relname)) as size
from (select schemaname, relname, 'table' as type
from pg_stat_user_tables
union all
select schemaname, relname, 'index' as type
from pg_stat_user_indexes) x;
Oh, and you can nest transactions, rollback partial transactions++
test=# begin;
BEGIN
test=# select count(*) from customer where name='test';
count
-------
0
(1 row)
test=# insert into customer (name) values ('test');
INSERT 0 1
test=# savepoint foo;
SAVEPOINT
test=# update customer set name='john';
UPDATE 3
test=# rollback to savepoint foo;
ROLLBACK
test=# commit;
COMMIT
test=# select count(*) from customer where name='test';
count
-------
1
(1 row)
The easiest trick to let postgresql perform a lot better (apart from setting and using proper indexes of course) is just to give it more RAM to work with (if you have not done so already). On most default installations the value for shared_buffers is way too low (in my opinion). You can set
shared_buffers
in postgresql.conf. Divide this number by 128 to get an approximation of the amount of memory (in MB) postgres can claim. If you up it enough this will make postgresql fly. Don't forget to restart postgresql.
On Linux systems, when postgresql won't start again you will probably have hit the kernel.shmmax limit. Set it higher with
sysctl -w kernel.shmmax=xxxx
To make this persist between boots, add a kernel.shmmax entry to /etc/sysctl.conf.
A whole bunch of Postgresql tricks can be found here:
http://www.postgres.cz/index.php/PostgreSQL_SQL_Tricks
Postgres has a very powerful datetime handling facility thanks to its INTERVAL support.
For example:
select NOW(), NOW() + '1 hour';
now | ?column?
-------------------------------+-------------------------------
2009-04-18 01:37:49.116614+00 | 2009-04-18 02:37:49.116614+00
(1 row)
select current_date ,(current_date + interval '1 year')::date;
date | date
---------------------+----------------
2014-10-17 | 2015-10-17
(1 row)
You can cast many strings to an INTERVAL type.
COPY
I'll start. Whenever I switch to Postgres from SQLite, I usually have some really big datasets. The key is to load your tables with COPY FROM rather than doing INSERTS. See documentation:
http://www.postgresql.org/docs/8.1/static/sql-copy.html
The following example copies a table to the client using the vertical bar (|) as the field delimiter:
COPY country TO STDOUT WITH DELIMITER '|';
To copy data from a file into the country table:
COPY country FROM '/usr1/proj/bray/sql/country_data';
See also here:
Faster bulk inserts in sqlite3?
My by far favorite is generate_series: at last a clean way to generate dummy rowsets.
Ability to use a correlated value in a LIMIT clause of a subquery:
SELECT (
SELECT exp_word
FROM mytable
OFFSET id
LIMIT 1
)
FROM othertable
Abitlity to use multiple parameters in custom aggregates (not covered by the documentation): see the article in my blog for an example.
One of the things I really like about Postgres is some of the data types supported in columns. For example, there are column types made for storing Network Addresses and Arrays. The corresponding functions (Network Addresses / Arrays) for these column types let you do a lot of complex operations inside queries that you'd have to do by processing results through code in MySQL or other database engines.
Arrays are really cool once you get to know 'em.
Lets say you would like to store some hyper links between pages. You might start by thinking about creating a Table kinda like this:
CREATE TABLE hyper.links (
tail INT4,
head INT4
);
If you needed to index the tail column, and you had, say 200,000,000 links-rows (like wikipedia would give you), you would find yourself with a huge Table and a huge Index.
However, with PostgreSQL, you could use this Table format instead:
CREATE TABLE hyper.links (
tail INT4,
head INT4[],
PRIMARY KEY(tail)
);
To get all heads for a link you could send a command like this (unnest() is standard since 8.4):
SELECT unnest(head) FROM hyper.links WHERE tail = $1;
This query is surprisingly fast when it is compared with the first option (unnest() is fast and the Index is way way smaller). Furthermore, your Table and Index will take up much less RAM-memory and HD-space, especially when your Arrays are so long that they are compressed to a Toast Table. Arrays are really powerful.
Note: while unnest() will generate rows out of an Array, array_agg() will aggregate rows into an Array.
Materialized Views are pretty easy to setup:
CREATE VIEW my_view AS SELECT id, AVG(my_col) FROM my_table GROUP BY id;
CREATE TABLE my_matview AS SELECT * FROM my_view;
That creates a new table, my_matview, with the columns and values of my_view. Triggers or a cron script can then be setup to keep the data up to date, or if you're lazy:
TRUNCATE my_matview;
INSERT INTO my_matview SELECT * FROM my_view;
Inheritance..infact Multiple Inheritance (as in parent-child "inheritance" not 1-to-1 relation inheritance which many web frameworks implement when working with postgres).
PostGIS (spatial extension), a wonderful add-on that offers comprehensive set of geometry functions and coordinates storage out of the box. Widely used in many open-source geo libs (e.g. OpenLayers,MapServer,Mapnik etc) and definitely way better than MySQL's spatial extensions.
Writing procedures in different languages e.g. C, Python,Perl etc (makes your life easir to code if you're a developer and not a db-admin).
Also all procedures can be stored externally (as modules) and can be called or imported at runtime by specified arguments. That way you can source control the code and debug the code easily.
A huge and comprehensive catalogue on all objects implemented in your database (i.e. tables,constraints,indexes,etc).
I always find it immensely helpful to run few queries and get all meta info e.g. ,constraint names and fields on which they have been implemented on, index names etc.
For me it all becomes extremely handy when I have to load new data or do massive updates in big tables (I would automatically disable triggers and drop indexes) and then recreate them again easily after processing has finished. Someone did an excellent job of writing handful of these queries.
http://www.alberton.info/postgresql_meta_info.html
Multiple schemas under one database, you can use it if your database has large number of tables, you can think of schemas as categories. All tables (regardless of it's schema) have access to all other tables and functions present in parent db.
You don't need to learn how to decipher "explain analyze" output, there is a tool: http://explain.depesz.com
select pg_size_pretty(200 * 1024)
pgcrypto: more cryptographic functions than many programming languages' crypto modules provide, all accessible direct from the database. It makes cryptographic stuff incredibly easy to Just Get Right.
A database can be copied with:
createdb -T old_db new_db
The documentation says:
this is not (yet) intended as a general-purpose "COPY DATABASE" facility
but it works well for me and is much faster than
createdb new_db
pg_dump old_db | psql new_db
Memory storage for throw-away data/global variables
You can create a tablespace that lives in the RAM, and tables (possibly unlogged, in 9.1) in that tablespace to store throw-away data/global variables that you'd like to share across sessions.
http://magazine.redhat.com/2007/12/12/tip-from-an-rhce-memory-storage-on-postgresql/
Advisory locks
These are documented in an obscure area of the manual:
http://www.postgresql.org/docs/9.0/interactive/functions-admin.html
It's occasionally faster than acquiring multitudes of row-level locks, and they can be used to work around cases where FOR UPDATE isn't implemented (such as recursive CTE queries).
This is my favorites list of lesser know features.
Transactional DDL
Nearly every SQL statement is transactional in Postgres. If you turn off autocommit the following is possible:
drop table customer_orders;
rollback;
select *
from customer_orders;
Range types and exclusion constraint
To my knowledge Postgres is the only RDBMS that lets you create a constraint that checks if two ranges overlap. An example is a table that contains product prices with a "valid from" and "valid until" date:
create table product_price
(
price_id serial not null primary key,
product_id integer not null references products,
price numeric(16,4) not null,
valid_during daterange not null
);
NoSQL features
The hstore extension offers a flexible and very fast key/value store that can be used when parts of the database need to be "schema-less". JSON is another option to store data in a schema-less fashion and
insert into product_price
(product_id, price, valid_during)
values
(1, 100.0, '[2013-01-01,2014-01-01)'),
(1, 90.0, '[2014-01-01,)');
-- querying is simply and can use an index on the valid_during column
select price
from product_price
where product_id = 42
and valid_during #> date '2014-10-17';
The execution plan for the above on a table with 700.000 rows:
Index Scan using check_price_range on public.product_price (cost=0.29..3.29 rows=1 width=6) (actual time=0.605..0.728 rows=1 loops=1)
Output: price
Index Cond: ((product_price.valid_during #> '2014-10-17'::date) AND (product_price.product_id = 42))
Buffers: shared hit=17
Total runtime: 0.772 ms
To avoid inserting rows with overlapping validity ranges a simple (and efficient) unique constraint can be defined:
alter table product_price
add constraint check_price_range
exclude using gist (product_id with =, valid_during with &&)
Infinity
Instead of requiring a "real" date far in the future Postgres can compare dates to infinity. E.g. when not using a date range you can do the following
insert into product_price
(product_id, price, valid_from, valid_until)
values
(1, 90.0, date '2014-01-01', date 'infinity');
Writeable common table expressions
You can delete, insert and select in a single statement:
with old_orders as (
delete from orders
where order_date < current_date - interval '10' year
returning *
), archived_rows as (
insert into archived_orders
select *
from old_orders
returning *
)
select *
from archived_rows;
The above will delete all orders older than 10 years, move them to the archived_orders table and then display the rows that were moved.
1.) When you need append notice to query, you can use nested comment
SELECT /* my comments, that I would to see in PostgreSQL log */
a, b, c
FROM mytab;
2.) Remove Trailing spaces from all the text and varchar field in a database.
do $$
declare
selectrow record;
begin
for selectrow in
select
'UPDATE '||c.table_name||' SET '||c.COLUMN_NAME||'=TRIM('||c.COLUMN_NAME||') WHERE '||c.COLUMN_NAME||' ILIKE ''% '' ' as script
from (
select
table_name,COLUMN_NAME
from
INFORMATION_SCHEMA.COLUMNS
where
table_name LIKE 'tbl%' and (data_type='text' or data_type='character varying' )
) c
loop
execute selectrow.script;
end loop;
end;
$$;
3.) We can use a window function for very effective removing of duplicate rows:
DELETE FROM tab
WHERE id IN (SELECT id
FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), id
FROM tab) x
WHERE x.row_number > 1);
Some PostgreSQL's optimized version (with ctid):
DELETE FROM tab
WHERE ctid = ANY(ARRAY(SELECT ctid
FROM (SELECT row_number() OVER (PARTITION BY column_with_duplicate_values), ctid
FROM tab) x
WHERE x.row_number > 1));
4.) When we need to identify server's state, then we can use a function:
SELECT pg_is_in_recovery();
5.) Get functions's DDL command.
select pg_get_functiondef((select oid from pg_proc where proname = 'f1'));
6.) Safely changing column data type in PostgreSQL
create table test(id varchar );
insert into test values('1');
insert into test values('11');
insert into test values('12');
select * from test
--Result--
id
character varying
--------------------------
1
11
12
You can see from the above table that I have used the data type – ‘character varying’ for ‘id’
column. But it was a mistake, because I am always giving integers as id. So using varchar here is a
bad practice. So let’s try to change the column type to integer.
ALTER TABLE test ALTER COLUMN id TYPE integer;
But it returns:
ERROR: column “id” cannot be cast automatically to type integer SQL
state: 42804 Hint: Specify a USING expression to perform the
conversion
That means we can’t simply change the data type because data is already there in the column. Since the data is of type ‘character varying’ postgres cant expect it as integer though we entered integers only. So now, as postgres suggested we can use the ‘USING’ expression to cast our data into integers.
ALTER TABLE test ALTER COLUMN id TYPE integer USING (id ::integer);
It Works.
7.) Know who is connected to the Database
This is more or less a monitoring command. To know which user connected to which database
including their IP and Port use the following SQL:
SELECT datname,usename,client_addr,client_port FROM pg_stat_activity ;
8.) Reloading PostgreSQL Configuration files without Restarting Server
PostgreSQL configuration parameters are located in special files like postgresql.conf and pg_hba.conf. Often, you may need to change these parameters. But for some parameters to take effect we often need to reload the configuration file. Of course, restarting server will do it. But in a production environment it is not preferred to restarting the database, which is being used by thousands, just for setting some parameters. In such situations, we can reload the configuration files without restarting the server by using the following function:
select pg_reload_conf();
Remember, this wont work for all the parameters, some parameter
changes need a full restart of the server to be take in effect.
9.) Getting the data directory path of the current Database cluster
It is possible that in a system, multiple instances(cluster) of PostgreSQL is set up, generally, in different ports or so. In such cases, finding which directory(physical storage directory) is used by which instance is a hectic task. In such cases, we can use the following command in any database in the cluster of our interest to get the directory path:
SHOW data_directory;
The same function can be used to change the data directory of the cluster, but it requires a server restarts:
SET data_directory to new_directory_path;
10.) Find a CHAR is DATE or not
create or replace function is_date(s varchar) returns boolean as $$
begin
perform s::date;
return true;
exception when others then
return false;
end;
$$ language plpgsql;
Usage: the following will return True
select is_date('12-12-2014')
select is_date('12/12/2014')
select is_date('20141212')
select is_date('2014.12.12')
select is_date('2014,12,12')
11.) Change the owner in PostgreSQL
REASSIGN OWNED BY sa TO postgres;
12.) PGADMIN PLPGSQL DEBUGGER
Well explained here
It's convenient to rename an old database rather than mysql can do. Just using the following command:
ALTER DATABASE name RENAME TO new_name

Resources