SQL Server 2005 Transaction Log Entry : LOP_Format_Page - sql-server

I am investigating an issue relating to a large log expansion during an ETL process, even though the database is set in bulk logged mode (and it is not running in psuedo simple but truely bulk logged)
Using the ::fn_dblog(null,null) function to examine the transaction log operations and the context of the operation, the log expansion is pretty much entirely down to the logging of a LOP_FORMAT_PAGE operation, on a LCX_Heap context. (97% of the expansion is that operation, appearing in the log over 600k times for a single data load.)
The question is, what is the lop_format_page doing / recording that SQL has done?
Given that, I should be able to reverse the logic and understand what the cause / effect chain is that results in this and be able to alter the ETL if appropriate.
I'm not expecting many people have come across this one, the level of available detail on the operations and context is minimal to none.

You're correct that this is very thinly (AKA not!) documented. I've done a little poking around inside logs and have done a lot of log-reduction work (mostly by ensuring bulk inserts were actually being done in bulk!). So I know this can be challenging to track down.
My best guess, having seen LOP_FORMAT_PAGE used in context, is that it's clearing out a new page-- for example when splitting an index page once that page is full and another entry needs to be created. So, if this assumption is correct, you may want to track down what may be causing a whole bunch of new pages to get allocated.
Do you know which operations are going on in the ETL while you're seeing the log expansion? It would be helpful to understand this context-- please add that info to your question if possible.
Also, are you able to run and vary your ETL code in a test environment? Instead of figuring out this inscrutable log record definition, it may be easier to isolate the problem by running your ETL while commenting out some steps (or limiting the number of rows affected) and then seeing which change makes the problem go away.

I think you and Justin are onto the answer, but it is not all that complicated.
The ETL process (Extract, transform, load) is loading data into the db. Naturally, as pages fill up, new ones need to be allocated on the heap.

I thought that LOP_FORMAT_PAGE only formatting page too. But it contains either full page data if count of arrays is 1 or part of page with data (header plus records) and offsets to records from the end of page in second array.

Related

Using SQLAlchemy sessions and transactions

While learning SQLAlchemy I came across two ways of dealing with SQLAlchemy's sessions.
One was creating the session once globally while initializing my database like:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
and import this DBSession instance in all my requests (all my insert/update) operations that follow.
When I do this, my DB operations have the following structure:
with transaction manager:
for each_item in huge_file_of_million_rows:
DBSession.add(each_item)
//More create, read, update and delete operations
I do not commit or flush or rollback anywhere assuming my Zope transaction manager takes care of it for me
(it commits at the end of the transaction or rolls back if it fails)
The second way and the most frequently mentioned on the web way was:
create a DBSession once like
DBSession=sessionmaker(bind=engine)
and then create a session instance of this per transaction:
session = DBSession()
for row in huge_file_of_million_rows:
for item in row:
try:
DBsesion.add(item)
//More create, read, update and delete operations
DBsession.flush()
DBSession.commit()
except:
DBSession.rollback()
DBSession.close()
I do not understand which is BETTER ( in terms of memory usage,
performance, and healthy) and how?
In the first method, I
accumulate all the objects to the session and then the commit
happens in the end. For a bulky insert operation, does adding
objects to the session result in adding them to the memory(RAM) or
elsewhere? where do they get stored and how much memory is consumed?
Both the ways tend to be very slow when I have about a
million inserts and updates. Trying SQLAlchemy core also takes the
same time to execute. 100K rows select insert and update takes about
25-30 minutes. Is there any way to reduce this?
Please point me in the right direction. Thanks in advance.
Here you have a very generic answer, and with the warning that I don't know much about zope. Just some simple database heuristics. Hope it helps.
How to use SQLAlchemy sessions:
First, take a look to their own explanation here
As they say:
The calls to instantiate Session would then be placed at the point in the application where database conversations begin.
I'm not sure I understand what you mean with method 1.; just in case, a warning: you should not have just one session for the whole application. You instantiate Session when the database conversations begin, but you surely have several points in the application in which you have different conversations beginning. (I'm not sure from your text if you have different users).
One commit at the end of a huge number of operations is not a good idea
Indeed it will consume memory, probably in the Session object of your python program, and surely in the database transaction. How much space? That's difficult to say with the information you provide; it will depend on the queries, on the database...
You could easily estimate it with a profiler. Take into account that if you run out of resources everything will go slower (or halt).
One commit per register is also not a good idea when processing a bulk file
It means you are asking the database to persist changes every time for every row. Certainly too much. Try with an intermediated number, commit every n hundreds of rows. But then it gets more complicated; one commit at the end of the file assures you that the file is either processed or not, while intermediate commits force you to take into account, when something fails, that your file is half through - you should reposition.
As for the times you mention, it is very difficult with the information you provide + what is your database + machine. Anyway, the order of magnitude of your numbers, a select+insert+update per 15ms, probably plus commit, sounds pretty high but more or less on the expected range (again it depends on queries + database + machine)... If you have to frequently insert so many registers you could consider other database solutions; it will depend on your scenario, and probably on dialects and may not be provided by an orm like SQLAlchemy.

Improve speed of BCP - command line options?

I'm importing a huge amount of data using BCP, and I was curious about what options it has to improve the speed of this operation.
I read about parallel loading in a couple of places, and also saw that you can tell it to not bother to sort the data or check constraints (which are all viable options for me since the source is another database with good integrity).
I haven't seen examples of these options being used though (as in, I don't know what command line switch enables parallel loading or contraint checking disabling).
Does anyone know a good resource for learning this, or can someone give me a couple trivial examples? And please don't point me to the BCP parameters help page, I couldn't make heads or tails of it with regard to these specific options.
Any help is greatly appreciated!
You need to read The Data Loading Performance Guide. There is no magical command line switch 'load faster', is a very complicated balance of doing the right thing in the right context. It depends on whether you load a heap or a B-Tree, whether there is already data or the table is empty, whether it has secondary indexes, whether minimally logging is possible in the database recovery model, whether the table is partitioned or not, whether the data is pre-sorted or not and that is just the surface. The linked white paper has all the details.
It looks like the parallel loading you are talking about is just running multiple instances of the BCP utility against the same table. You would be responsible for partitioning the data before hand. You use it by specifying the TABLOCK table hint. From MSDN:
Bulk update (BU) locks allow processes to bulk copy data concurrently into the same table while preventing other processes that are not bulk copying data from accessing the table.
So it's really just a special lock for BCP.
To further increase performance, you may read further into the -a flag on the BCP parameters page.
-a allows to to specify a larger packet size (between 4096 and 65535) to increase the amount of data sent at one time to the server per network packet.
I would also suggest using the -e flag with an error if you intend to run multiple BCP processes to help keep track of any errors encountered.

Postgresql inserts stop at random number of records

I am developing a test application that requires me to insert 1 million records in a Postgresql database but at random points the insert stops and if I try to restart the insertion process, the application refuses to populate the table with more records. I've read that databases have a size cap, which is around 4 Gb, but I'm sure my database didn't even come close to this value.
So, what other reasons could be for why insertion stopped?
It happened a few times, once capping at 170872 records, another time at 25730 records.
I know the question might sound silly but I can't find any other reasons for why it stops inserting.
Thanks in advance!
JUST A QUICK UPDATE:
Indeed the problem isn't the database cap, here are the official figures for PostgreSQL:
- Maximum Database Size Unlimited
- Maximum Table Size 32 TB
- Maximum Row Size 1.6 TB
- Maximum Field Size 1 GB
- Maximum Rows per Table Unlimited
- Maximum Columns per Table 250 - 1600 depending on column types
- Maximum Indexes per Table Unlimited
Update:
Error in log file:
2012-03-26 12:30:12 EEST WARNING: there is no transaction in progress
So I'm looking up for an answer that fits this issue. If you can give any hints I would be very grateful.
I've read that databases have a size cap, which is around 4 Gb
I rather doubt that. It's certainly not true about PostgreSQL.
[...]at random points the insert stops and if I try to restart the insertion process, the application refuses to populate the table with more records
Again, I'm afraid I doubt this. Unless your application has become self aware it's refusing to do nothing. It might be crashing, or locking, or waiting for something to happen though.
I know the question might sound silly but I can't find any other reasons for why it stops inserting.
I don't think you've looked hard enough. Obvious things to check:
Are you getting any errors in the PostgreSQL logs?
If not, are you sure you're logging errors? Issue a bad query to check.
Are you getting any errors in the application?
If not,. are you sure you're checking? Again, check
What is/are the computer(s) up to? How much CPU/RAM/Disk IO is in use? Any unusual activity?
Any unusual locks begin taken (check the pg_locks view).
If you asked the question having checked the above then there's someone who'll be able to help. Probably though, you'll figure it out yourself once you've got the facts in front of you.
OK - if you're getting "no transaction in progress" that means you're issuing a commit/rollback but outside of an explicit transaction. If you don't issue a "BEGIN" then each statement gets its own transaction.
This is unlikely to be the cause of the problem.
Something is causing the inserts to stop, and you've still not told us what. You said earlier you weren't getting any errors inside the application. That shouldn't be possible if PostgreSQL is returning an error you should be picking it up in the application.
It's difficult to be more helpful without more accurate information. Every statement you send to PostgreSQL will return a status code. If you get an error inside a multi-statement transaction then all the statements in that transaction will be rolled back. You've either got some confused transaction control in the application or it is falling down for some other reason.
One of the possibilities is that the OP is using ssl, and the ssl_renegotiation_limit is reached. In any case: set the log_connections / log_disconnections to "On" and check the logfile.
I found out what was the problem with my insert command, and although it might seem funny it's one of those things you never thought could go wrong.
My application is developed in Django and has a command that simply calls for the file that does the insert operations into the tables.
i.e. in the command line terminal I just write:
time python manage.py populate_sql
The reason for which I use the time command is because I want to see how long it takes for the insertion to execute. Well, the problem was here. That time command issued an error, a Out of memory error which stopped the insertion into the database. I found this little code while running the command with the --verbose option which lets you see all the details of the command.
I would like to thank you all for your answers, for the things that I have learned from them and for the time you used trying to help me.
EDIT:
If you have a Django application in which you make a lot of operations with the database, then my advice to you is to set the 'DEBUG' variable in settings.py to 'FALSE' because it eats up a lot of your memory in time.
So,
DEBUG = False
And in the end, thank you again for the support Richard Huxton!

Terrible SQL reads performance (culprit update stats?)

I'm running on SQL Server 2008 R2 and am trying to fine-tune performance. I did everything I could from:
Code review of SQL code
Create or remove indexes as I think appropriate
Auto create stats ON
Auto update stats ON
Auto update stats async ON
I have a 24/7 system that constantly stores data. Sometimes we do reads and that's where the issue is. Sometimes the reads take a couple of seconds or less (which would be expected and acceptable to us). Other times, the reads take several seconds that could amount to a minute before the stored procedure completes and we render data on the UI.
If we do the read again, it would be faster. The SQL profiler would trace the particular stored procedure or query that took several seconds. We would zoom into that stored procedure, and do everything we can do to optimize it if we can.
I also traced the auto stats event and the recompile event. It's hard to tell if a stat is being updated causing the read to take a long time, or if a recompile caused it. Sometimes, I see that the profiler traced a recompile of the read query that took several unacceptable minutes, other times it doesn't trace a recompile.
I tried to prevent the query optimizer from blocking the read until it recompiles or updates stats by using option use plan XML, etc. But I ran into compile errors complaining that the query plan XML isn't valid; that could be true because the query is quiet involved: select + joins that involve a local table var. I sort of hacked the XML and maybe that's why it deemed it invalid. So I gave up on using plan hint.
We tried periodic (every 15 minutes) manual running update stats in order to keep stats up-to-date as much as we can, but that hurt performance. updatestats blocks writes, and I'm sure even reads; updatestats seemed to maintain a bunch of statistics and on average it was taking around 80-90 seconds. A read that waits that long is unacceptable.
So the idea is to let the reads happen and prevent a situation when a recompile/update stat blocks it, correct? Does it make sense to disable auto statistics altogether? Or perhaps disable auto create statistics after deleting all the auto created stats?
This goes against Microsoft recommendations perhaps, since they enable auto create statistics and auto update statistics by default, and performance may suffer, but any ideas/hints you can give would be appreciated.
From what you are explaining, it looks like the below (all or some) might be happening.
You are doing physical reads. The quick way you avoid this is by increasing the amount of RAM you throw at the box. You haven't mentioned the hardware specs of your server. Please add details.
If you trace the SQL calls then you can easily figure out why the RECOMPILE happened. Look at the EventSubClass to figure out the reason and work towards resolving that.
ref: http://msdn.microsoft.com/en-us/library/ms187105.aspx
You mentioned table variables. These are notorious for causing performance issues when NOT using at the right place. If you use table variables in a JOIN, parallel plan is out of the question and no stats also. I am NOT sure how and where you are using but try replacing them with temp tables. And starting from SQL Server 2005, you will get only STMT recompilation at best and NOT the complete SP recompile as it happened in 2000.
You mentioned Update Stats ASYNC option and this won't block the query.
What are the TOP WAIT STATS on this server? Have you identified the expensive procedures based on CPU, Logical reads & execution count?
Have you looked the Page Life Expectancy, amount of IO using virtual file stats DMV?
Updating Stats every 15 minutes is NOT a good plan. How often is data inserted into the system? What is the sample rate you are using? What is your index maintenance strategy?
Have you looked at the missing indexes DMV?
There are a bunch of good queries to identify problems in more granular fashion using the below queries.
ref: http://dl.dropbox.com/u/13748067/SQL%20Server%202008%20Diagnostic%20Information%20Queries%20%28April%202011%29.sql
There are so many other things to look at but the above is a good starting point.
OK, here is my IMHO catch on this:
DBCC INDEXDEFRAG is worth trying and is an ONLINE function hence can be used on a live system
You could be reaching the maximum capacity of your architectural design. You can scale up which can always help but more likely you have to change the architecture to achieve better scalability sacrificing simplicity
A common trick is partitioning. You are writing to a table whose index distribution looks nothing like it was a few hours ago - hence degrading performance. This is a massive write, such a table could be divided to daily write and the rest of the data with nightly batches of moving stuff across.
More and more, people are being converting to CQRS. You might be the next. This solves the problem by separating reads from writes (a very simplistic explanation).

Django data creation and commits

I'm not sure I 100% understand what the database does. If I just have some misconception, please point it out.
Let's say I have a function that wants to create 100 new entry in the database with has 100,000 entries.
It seems a lot faster when those 100 entries get create and the commit is made after the last entry is created.
Now, if those 100 entries get created by different users, is there a easy way to commit only after 100 entries are created?
Edit:
Should I maybe write some sort of buffer?
Databases are optimized for set-based operations, so yes it wouldbe faster to insert 100 records in a set than one at a time. However, when you are talking about users entering records one ata atime, you would not want to group them together under any circumstances that I can think of. Why?
First, if there was one bad record, the others would fail. This would make for 99 cranky users out of 100 (actually 100, but one would not really have reason to be cranky becasue he did the bad data entry to begin with).
Second, users would not see the records immediately after being entered. It is also true that they would not be able to do something further with those records until they are entered such as enter data into related tables. Having a delay like this would make users cranky. If users are entering data from customers through a phone call, they will be especially cranky at the wait (I worked at a call center with a horribly slow commercial product and believe me I know how upset the users used to get!)
Third, users will have gone on to something else and would not realize that their data was rejected for bad information, not a good thing at all.
How long are you going to wait to get your set number of records? 5 seconds, ten minutes?
What happens if for some reason the netwrok connection is lost during that time, wouldn;t the users lose the data they entered.
You might be able to hack something like that together, but you really shouldn't, because it wrecks your data integrity, which is the whole point of using transactions.
In your proposed solution, a problem with any insert in the batch would cause all the other (possibly totally valid) inserts from completely different users to fail. Also, users wouldn't be able to see the data they just tried to insert because the system was waiting to do the insert until the batch was full.
P.S. Here's a quick intro to transaction processing.
I think you do have a misconception. It sounds like you're looking at the database as something that is only for some sort of "long-term" memory. This is a bad concept; the database is the only memory your application has. Even when this isn't true, it's best to pretend that it is.
To go a little deeper, your application has:
scoped memory: variables that you define within view functions, for example. These all get destroyed when flow leaves the function.
globals: variables that are defined in the outermost part of your code. It is really important not to use these for any sort of state except perhaps configuration constants. The important thing is that you should rely on any dynamic behavior. Otherwise you will have to battle concurrency and forked processes (depending on server gateway) that aren't aware of each other. Just don't do it.
a caching scheme, if you choose to implement one. This is entirely optional in django, and there are many ways to do it. However, one typically uses some scheme to ensure that even if the cache crashes, the database reflects the current state of the data accurately.
your local filesystem. From a design point of view, most ways of taking advantage of this will either resemble a caching system (above) or be clumsy and fragile. From a performance point of view, it might be about as slow as a database.
your database.
So you see that there's not much place for you to put your data besides the database.

Resources