we had problem with performance on table, So I found that relavisible was 0 and relpage around 1000000. I did VACUUM ANALYZE and everything started to working fine but after one day relavisible for this table back to 0. Why this happend? We keep this table as archive so only once a day we move data from orginal table to that one using a simple function. One of solution may be do VACUUM ANALYZE after function but I don't want to do it. I think that postgres should handle this problem on his own. I understand that after some time table again will have 0 relavisible but no after one day where table is around 10m rows and we copy 15-20k rows per day.
That is not normal. Most likely someone did something to the table, like VACUUM FULL or CLUSTER or a bulk UPDATE, which cleared the visibility map.
It is possible each block had a bit of freespace in it, and those got filled up by the next bulk insert. But if that is the case, now that the space is full it shouldn't happen again. Unless whatever created the scattered free space happens again.
Related
description
I use Postgres together with python3
There are 17 million rows in the table, the max ID 3000 million+
My task is select id,link from table where data is null;.And do some codes them Update table set data = %s where id = %s.
I tested a single data update needs 0.1s.
my thoughts
The following is my idea
Try a new database, I heard radis soon.But i don't know how to do.
In addition,what is the best number of connections?
I used to made 5-6 connections.
Now only two connections, but better.One hour updated 2million data.
If there is any way you can push the calculation of the new value into the database, i.e. issue a single large UPDATE statement like
UPDATE "table"
SET data = [calculation here]
WHERE data IS NULL;
you would be much faster.
But for the rest of this discussion I'll assume that you have to calculate the new values in your code, i.e. run one SELECT to get all the rows where data IS NULL and then issue a lot of UPDATE statements, each targeting a single row.
In that case, there are two ways how you can speed up processing considerable:
Avoid index updates
Updating an index is more expensive than adding a tuple to the table itself (the appropriately so-called heap, onto which it is quick and easy to pile up entries). So by avoiding index updates, you will be much faster.
There are two ways to avoid index updates:
Drop all indexes after selecting the rows to change and before the UPDATEs and recreate them after processing is completed.
This will be a net win if you update enough rows.
Make sure that there is no index on data and that the tables have been created with a fillfactor of less then 50. Then there is room enough in the data pages to write the update into the same page as the original row version, which obviates the need to update the index (this is known as a HOT update).
This is probably not an option for you, since you probably didn't create the table with a fillfactor like that, but I wanted to add it for completeness' sake.
Bundle many updates in a single transaction
By default, each UPDATE will run in its own transaction, which is committed at the end of the statement. However, each COMMIT forces the transaction log (WAL) to be written out to disk, which slows down processing considerably.
You do that by explicitly issuing a BEGIN before the first UPDATE and a COMMIT after the last one. That will also make the whole operation atomic, so that all changes are undone automatically if processing is interrupted.
A recent employee of our company had a stored procedure that has gone haywire, and caused mass inserts into a debug table of his. The table is unindexed, is now at close to 1.7 billion rows, and is taking up so much space that the backup no longer fits on the backup drive (Backups now reach close to 250GB).
I haven't really seen anything like this, so I'm seeking advice from the MSSQL Gurus out here.
I know I could nibble away at the table, but being unindexed, the DELETE FROM [TABLE] WHERE ID IN (SELECT TOP 10000 [ID] FROM [TABLE]) nearly locks up the server searching for them.
I also don't want my log file to get massive, it's currently sitting at 480GB on a 1TB drive. If I delete this table, will I be able to shrink it back down? (My recovery mode is simple)
We could index the id field on the table, though we only have around 9 hours downtime a day, and during business hours we can't be locking up the database.
Just looking for advice here, and a point in the right direction.
Thanks.
You may want to consider TRUNCATE
MSDN reference: http://technet.microsoft.com/en-us/library/aa260621(v=sql.80).aspx
Removes all rows from a table without logging the individual row deletes.
Syntax:
TRUNCATE TABLE [YOUR_TABLE]
As #Rahul suggests in the comments, you could also use DROP TABLE [YOUR_TABLE] if you no longer plan to use the table in question. The TRUNCATE option would simply empty the table but leave it in place if you wanted to continue to use it.
With regards to the space issue, both of these operations will be comparatively quick and the space will be reclaimed, but it won't happen instantly. When using TRUNCATE, the data still has to be deleted, but SQL Server will simply deallocate the data pages used by the table and use a background process to actually perform the clean up afterwards.
This post should provide some useful information.
One suggestion would be ... take the back up of only that 1.7 billion rows table (probably in a tape drive/somewhere with good enough space) and then drop the table saying drop table table_name.
That way, if at all that debug table data is needed in future; you have a copy and can restore from backup.
I would remove the logging for this table and launch a delete stored procedure that would commit every 1000 rows.
My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!
It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.
I have database with one of the tables that got over populated ( 465025579 records ), what is the best way to delete the records and keep only 3 months of the records, without the device to hang?
Delete them in batches based on date earliest first. Sure it'll take some time, but it's safer (as you are defining which to delete) and not so resource intensive. It also means you can shrink the database in batches too, instead of one big hit (which is quite resource intensive).
Yeah, it might fragment the database a little, but until you've got the actual data down to a manageable level, there isn't that much you can do.
To be fair, 200G of data isn't that much on a decent machine these days.
All this said, I'm presuming you want the database to remain 'online'
If you don't need the database to be available whilst you're doing this, the easiest thing to do is usually to select the rows that you want to keep into a different table, run a TRUNCATE on this table, and then copy the saved rows back in.
From TRUNCATE:
TRUNCATE TABLE is similar to the DELETE statement with no WHERE clause; however, TRUNCATE TABLE is faster and uses fewer system and transaction log resources.
I have a table that gets updated from an outside source. It normally sits empty until they push data to me. With this data I am supposed to add, update or delete records in two other tables (link by a primary/foreign key). Data is pushed to me one row at a time and occasionally in a large download twice a year. They want me to update my tables in real time. SHould I use a trigger and have it read line by line or merge the tables?
I'd have a scheduled job that runs a sproc to check for work to do in that table, and them process them in batches. Have a column on the import/staging table that you can update with a batch number or timestamp so if something goes wrong (like they have pushed you some goofy data) you know where to restart from and can identify which row caused the problem.
If you use a trigger, not only might it slow down them feeding you a large batch of data, but you'll also possibly lose the ability to keep a record of where the process got to if it fails.
If it was always one row at a time then I think the trigger method would be okay option.
Edit: Just to clarify the point about batch number/timestamp, this is so if you have new/unexpected data which crashes your import, you can alter the code and re-run the process as much as you like without having to ask for a fresh import.