I am developing a test application that requires me to insert 1 million records in a Postgresql database but at random points the insert stops and if I try to restart the insertion process, the application refuses to populate the table with more records. I've read that databases have a size cap, which is around 4 Gb, but I'm sure my database didn't even come close to this value.
So, what other reasons could be for why insertion stopped?
It happened a few times, once capping at 170872 records, another time at 25730 records.
I know the question might sound silly but I can't find any other reasons for why it stops inserting.
Thanks in advance!
JUST A QUICK UPDATE:
Indeed the problem isn't the database cap, here are the official figures for PostgreSQL:
- Maximum Database Size Unlimited
- Maximum Table Size 32 TB
- Maximum Row Size 1.6 TB
- Maximum Field Size 1 GB
- Maximum Rows per Table Unlimited
- Maximum Columns per Table 250 - 1600 depending on column types
- Maximum Indexes per Table Unlimited
Update:
Error in log file:
2012-03-26 12:30:12 EEST WARNING: there is no transaction in progress
So I'm looking up for an answer that fits this issue. If you can give any hints I would be very grateful.
I've read that databases have a size cap, which is around 4 Gb
I rather doubt that. It's certainly not true about PostgreSQL.
[...]at random points the insert stops and if I try to restart the insertion process, the application refuses to populate the table with more records
Again, I'm afraid I doubt this. Unless your application has become self aware it's refusing to do nothing. It might be crashing, or locking, or waiting for something to happen though.
I know the question might sound silly but I can't find any other reasons for why it stops inserting.
I don't think you've looked hard enough. Obvious things to check:
Are you getting any errors in the PostgreSQL logs?
If not, are you sure you're logging errors? Issue a bad query to check.
Are you getting any errors in the application?
If not,. are you sure you're checking? Again, check
What is/are the computer(s) up to? How much CPU/RAM/Disk IO is in use? Any unusual activity?
Any unusual locks begin taken (check the pg_locks view).
If you asked the question having checked the above then there's someone who'll be able to help. Probably though, you'll figure it out yourself once you've got the facts in front of you.
OK - if you're getting "no transaction in progress" that means you're issuing a commit/rollback but outside of an explicit transaction. If you don't issue a "BEGIN" then each statement gets its own transaction.
This is unlikely to be the cause of the problem.
Something is causing the inserts to stop, and you've still not told us what. You said earlier you weren't getting any errors inside the application. That shouldn't be possible if PostgreSQL is returning an error you should be picking it up in the application.
It's difficult to be more helpful without more accurate information. Every statement you send to PostgreSQL will return a status code. If you get an error inside a multi-statement transaction then all the statements in that transaction will be rolled back. You've either got some confused transaction control in the application or it is falling down for some other reason.
One of the possibilities is that the OP is using ssl, and the ssl_renegotiation_limit is reached. In any case: set the log_connections / log_disconnections to "On" and check the logfile.
I found out what was the problem with my insert command, and although it might seem funny it's one of those things you never thought could go wrong.
My application is developed in Django and has a command that simply calls for the file that does the insert operations into the tables.
i.e. in the command line terminal I just write:
time python manage.py populate_sql
The reason for which I use the time command is because I want to see how long it takes for the insertion to execute. Well, the problem was here. That time command issued an error, a Out of memory error which stopped the insertion into the database. I found this little code while running the command with the --verbose option which lets you see all the details of the command.
I would like to thank you all for your answers, for the things that I have learned from them and for the time you used trying to help me.
EDIT:
If you have a Django application in which you make a lot of operations with the database, then my advice to you is to set the 'DEBUG' variable in settings.py to 'FALSE' because it eats up a lot of your memory in time.
So,
DEBUG = False
And in the end, thank you again for the support Richard Huxton!
Related
I attempted to perform a full Solr reindex for our Cassandra cluster this past weekend. It seemed that two nodes were taking a lot longer than the other three, in fact they keep indexing for hours after the others were done. Finally it seemed they had finished, at least in the web console they both said "no" for indexing field in the web console.
Unfortunately about an hour later one of those two nodes became completely unresponsive, and ultimately had to be restarted.
Today I'm looking at the nodes, and the 3 that didn't seem to have any problems all claim to have about 14.8 million docs or so, which is about what it should be. However the two that were stuck, or took forever (including the one that ulimately became unresponsive) have only 9 and 7 million respectively. That is a huge discrepancy which tells me that they didn't complete correctly.
So, to resolve the issue I have two questions:
1) Since this was a full reindex, are the changes that were implemented to the schema and hence the reason for the full index, good? In other words is it only the indexing part that didn't finish, so can I just run a regular in place reindex to get everything back to the way it should be?
2) Assuming I don't have to run a full reindex, can I just run an in place reindex on the two nodes that are out of whack? From a time perspective this would be ideal as I'd have to do it after hours anyway, and it would hopefully finish overnight.
Just wondering how to proceed, as I haven't had this issue in the past.
Regarding your questions:
1) Yes, you can do a reload with in-place reindex by setting reindex=true, deleteAll=false.
2) Yes, you can run an in-place reindex on the failed nodes only by invoking a reload on each node and setting reindex=true, deleteAll=false, distributed=false.
Have a look at: http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchReldCore.html
Anyway, it would be good to first understand why those nodes failed: that kind of behaviour looks like an out of memory error, but are there any exceptions in your logs?
Got the error upon create/delete/update queries:
ERROR: database is not accepting commands to avoid wraparound data
loss in database "mydb" HINT: Stop the postmaster and use a
standalone backend to vacuum that database. You might also need to
commit or roll back old prepared transactions.
So, the database is blocked and it is only possible to perform SELECT queries.
Database's size 350 GB. 1 table(my_table) has ~1 billion rows.
system: "PostgreSQL 9.3.4 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4), 64-bit"
postgresq.conf some settings:
effective_io_concurrency = 15 # 1-1000; 0 disables prefetching
autovacuum_vacuum_cost_delay = -1
#vacuum_cost_delay = 0 # 0-100 milliseconds
#vacuum_cost_page_hit = 1 # 0-10000 credits
#vacuum_cost_page_miss = 10 # 0-10000 credits
#vacuum_cost_page_dirty = 20 # 0-10000 credits
#vacuum_cost_limit = 200
I do not use prepared transactions. But basic stored procedures are used(which means, automatic tranactions, right?) 50mln times per day.
Сurrently "autovacuum: VACUUM ANALYZE public.my_table (to prevent wraparound)" is perforing, it is almost 12 hours of that query activity.
As far as I understand, the problem with not-vacuumed dead touples, right?
How to resolve this problem and prevent this in the future? Please, help :)
The end of story( ~one month later)
Now my big table is partitioned by thousands of tables. Each small table is vacuumed much faster. Autovacuum configuration was set more closer to default. If needed, i could be set to more agressive again, but so far database with billions of rows works pretty well.
So, the problem of the topic should not appear again.
ps now i'm looking at Postgres-XL as a next step of data scalability.
The problem isn't dead tuples, it's transaction ids, which control row visibility. Each transaction gets a sequential XID, since they're 32 bit ints, they will eventually wrap around.
See here for more detail: http://www.postgresql.org/docs/9.3/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND, but the short version is that all tables need to be VACUUMed (either manually or with autovacuum) at least every 2 billion transactions. The longer you go without vacuuming the longer it takes.
To fix your current problem you don't need to do a VACUUM ANALYZE, just a VACUUM - I am not sure how much of a speed difference there is, but it should be faster.
What kind of hardware is this running on, and what's your maintenance_work_mem set to? You may want to raise it (possibly temporarily) to complete the VACUUM faster.
In the future, you basically just need to VACUUM more: either increase autovacuum frequency (see here: https://dba.stackexchange.com/questions/21068/aggressive-autovacuum-on-postgresql, for example) or even schedule manual VACUUMs with cron. Also look at vacuum_freeze_min_age and related settings.
What kind of data is it, and what kind of transactions are you running? That's a pretty big table, can it be partitioned (by date, for instance)?
Edit
You may also want to enable log_autovacuum_min_duration (set it to a small value), to see what autovacuum is actually doing when the database is live, and if there are locking issues preventing it from running.
Responding to Comments
You don't have to run VACUUM standalone, you can run it now, unless that will interfere too much with your other databases. Just need to do it as superuser, so system tables are also vacuumed.
Doing a dump/restore seems drastic, and I can't imagine it would be faster than completing the VACUUM.
Switching away from stored procedures will not help: any queries that modify data will generate XIDs, it doesn't matter if you use transactions explicitly, they're still transactions.
You're on the right way - getting autovacuum to keep up with your inserts/updates is the best solution (logging its activity should help understand what's going wrong now).
Judging by your table structure, this may be the classic case for table partitioning (http://www.postgresql.org/docs/9.3/static/ddl-partitioning.html) - am I right in thinking that it's all inserts, rather than updates/deletes? If you're always writing to one small partition, you can vacuum it more aggressively (autovacuum can be configured per table), and VACUUM FREEZE the others.
I think you have no choice but to stop the database, restart in standalone mode, and do a vacuum. Letting the autovac complete will not help, because once it completes it will go to update the system catalog to reflect that completion, and that update will be rejected because it cannot acquire the needed transaction ID. At least that was my experience.
As for preventing it in the future, do you restart your database on a regular basis? If you restart your database every 24 hours, but you have a table that takes 30 hours to vacuum, then that table will never be vacuumed successfully, and you will get into trouble eventually.
I'm running on SQL Server 2008 R2 and am trying to fine-tune performance. I did everything I could from:
Code review of SQL code
Create or remove indexes as I think appropriate
Auto create stats ON
Auto update stats ON
Auto update stats async ON
I have a 24/7 system that constantly stores data. Sometimes we do reads and that's where the issue is. Sometimes the reads take a couple of seconds or less (which would be expected and acceptable to us). Other times, the reads take several seconds that could amount to a minute before the stored procedure completes and we render data on the UI.
If we do the read again, it would be faster. The SQL profiler would trace the particular stored procedure or query that took several seconds. We would zoom into that stored procedure, and do everything we can do to optimize it if we can.
I also traced the auto stats event and the recompile event. It's hard to tell if a stat is being updated causing the read to take a long time, or if a recompile caused it. Sometimes, I see that the profiler traced a recompile of the read query that took several unacceptable minutes, other times it doesn't trace a recompile.
I tried to prevent the query optimizer from blocking the read until it recompiles or updates stats by using option use plan XML, etc. But I ran into compile errors complaining that the query plan XML isn't valid; that could be true because the query is quiet involved: select + joins that involve a local table var. I sort of hacked the XML and maybe that's why it deemed it invalid. So I gave up on using plan hint.
We tried periodic (every 15 minutes) manual running update stats in order to keep stats up-to-date as much as we can, but that hurt performance. updatestats blocks writes, and I'm sure even reads; updatestats seemed to maintain a bunch of statistics and on average it was taking around 80-90 seconds. A read that waits that long is unacceptable.
So the idea is to let the reads happen and prevent a situation when a recompile/update stat blocks it, correct? Does it make sense to disable auto statistics altogether? Or perhaps disable auto create statistics after deleting all the auto created stats?
This goes against Microsoft recommendations perhaps, since they enable auto create statistics and auto update statistics by default, and performance may suffer, but any ideas/hints you can give would be appreciated.
From what you are explaining, it looks like the below (all or some) might be happening.
You are doing physical reads. The quick way you avoid this is by increasing the amount of RAM you throw at the box. You haven't mentioned the hardware specs of your server. Please add details.
If you trace the SQL calls then you can easily figure out why the RECOMPILE happened. Look at the EventSubClass to figure out the reason and work towards resolving that.
ref: http://msdn.microsoft.com/en-us/library/ms187105.aspx
You mentioned table variables. These are notorious for causing performance issues when NOT using at the right place. If you use table variables in a JOIN, parallel plan is out of the question and no stats also. I am NOT sure how and where you are using but try replacing them with temp tables. And starting from SQL Server 2005, you will get only STMT recompilation at best and NOT the complete SP recompile as it happened in 2000.
You mentioned Update Stats ASYNC option and this won't block the query.
What are the TOP WAIT STATS on this server? Have you identified the expensive procedures based on CPU, Logical reads & execution count?
Have you looked the Page Life Expectancy, amount of IO using virtual file stats DMV?
Updating Stats every 15 minutes is NOT a good plan. How often is data inserted into the system? What is the sample rate you are using? What is your index maintenance strategy?
Have you looked at the missing indexes DMV?
There are a bunch of good queries to identify problems in more granular fashion using the below queries.
ref: http://dl.dropbox.com/u/13748067/SQL%20Server%202008%20Diagnostic%20Information%20Queries%20%28April%202011%29.sql
There are so many other things to look at but the above is a good starting point.
OK, here is my IMHO catch on this:
DBCC INDEXDEFRAG is worth trying and is an ONLINE function hence can be used on a live system
You could be reaching the maximum capacity of your architectural design. You can scale up which can always help but more likely you have to change the architecture to achieve better scalability sacrificing simplicity
A common trick is partitioning. You are writing to a table whose index distribution looks nothing like it was a few hours ago - hence degrading performance. This is a massive write, such a table could be divided to daily write and the rest of the data with nightly batches of moving stuff across.
More and more, people are being converting to CQRS. You might be the next. This solves the problem by separating reads from writes (a very simplistic explanation).
I am investigating an issue relating to a large log expansion during an ETL process, even though the database is set in bulk logged mode (and it is not running in psuedo simple but truely bulk logged)
Using the ::fn_dblog(null,null) function to examine the transaction log operations and the context of the operation, the log expansion is pretty much entirely down to the logging of a LOP_FORMAT_PAGE operation, on a LCX_Heap context. (97% of the expansion is that operation, appearing in the log over 600k times for a single data load.)
The question is, what is the lop_format_page doing / recording that SQL has done?
Given that, I should be able to reverse the logic and understand what the cause / effect chain is that results in this and be able to alter the ETL if appropriate.
I'm not expecting many people have come across this one, the level of available detail on the operations and context is minimal to none.
You're correct that this is very thinly (AKA not!) documented. I've done a little poking around inside logs and have done a lot of log-reduction work (mostly by ensuring bulk inserts were actually being done in bulk!). So I know this can be challenging to track down.
My best guess, having seen LOP_FORMAT_PAGE used in context, is that it's clearing out a new page-- for example when splitting an index page once that page is full and another entry needs to be created. So, if this assumption is correct, you may want to track down what may be causing a whole bunch of new pages to get allocated.
Do you know which operations are going on in the ETL while you're seeing the log expansion? It would be helpful to understand this context-- please add that info to your question if possible.
Also, are you able to run and vary your ETL code in a test environment? Instead of figuring out this inscrutable log record definition, it may be easier to isolate the problem by running your ETL while commenting out some steps (or limiting the number of rows affected) and then seeing which change makes the problem go away.
I think you and Justin are onto the answer, but it is not all that complicated.
The ETL process (Extract, transform, load) is loading data into the db. Naturally, as pages fill up, new ones need to be allocated on the heap.
I thought that LOP_FORMAT_PAGE only formatting page too. But it contains either full page data if count of arrays is 1 or part of page with data (header plus records) and offsets to records from the end of page in second array.
I'm not sure I 100% understand what the database does. If I just have some misconception, please point it out.
Let's say I have a function that wants to create 100 new entry in the database with has 100,000 entries.
It seems a lot faster when those 100 entries get create and the commit is made after the last entry is created.
Now, if those 100 entries get created by different users, is there a easy way to commit only after 100 entries are created?
Edit:
Should I maybe write some sort of buffer?
Databases are optimized for set-based operations, so yes it wouldbe faster to insert 100 records in a set than one at a time. However, when you are talking about users entering records one ata atime, you would not want to group them together under any circumstances that I can think of. Why?
First, if there was one bad record, the others would fail. This would make for 99 cranky users out of 100 (actually 100, but one would not really have reason to be cranky becasue he did the bad data entry to begin with).
Second, users would not see the records immediately after being entered. It is also true that they would not be able to do something further with those records until they are entered such as enter data into related tables. Having a delay like this would make users cranky. If users are entering data from customers through a phone call, they will be especially cranky at the wait (I worked at a call center with a horribly slow commercial product and believe me I know how upset the users used to get!)
Third, users will have gone on to something else and would not realize that their data was rejected for bad information, not a good thing at all.
How long are you going to wait to get your set number of records? 5 seconds, ten minutes?
What happens if for some reason the netwrok connection is lost during that time, wouldn;t the users lose the data they entered.
You might be able to hack something like that together, but you really shouldn't, because it wrecks your data integrity, which is the whole point of using transactions.
In your proposed solution, a problem with any insert in the batch would cause all the other (possibly totally valid) inserts from completely different users to fail. Also, users wouldn't be able to see the data they just tried to insert because the system was waiting to do the insert until the batch was full.
P.S. Here's a quick intro to transaction processing.
I think you do have a misconception. It sounds like you're looking at the database as something that is only for some sort of "long-term" memory. This is a bad concept; the database is the only memory your application has. Even when this isn't true, it's best to pretend that it is.
To go a little deeper, your application has:
scoped memory: variables that you define within view functions, for example. These all get destroyed when flow leaves the function.
globals: variables that are defined in the outermost part of your code. It is really important not to use these for any sort of state except perhaps configuration constants. The important thing is that you should rely on any dynamic behavior. Otherwise you will have to battle concurrency and forked processes (depending on server gateway) that aren't aware of each other. Just don't do it.
a caching scheme, if you choose to implement one. This is entirely optional in django, and there are many ways to do it. However, one typically uses some scheme to ensure that even if the cache crashes, the database reflects the current state of the data accurately.
your local filesystem. From a design point of view, most ways of taking advantage of this will either resemble a caching system (above) or be clumsy and fragile. From a performance point of view, it might be about as slow as a database.
your database.
So you see that there's not much place for you to put your data besides the database.