How to reduce SQLite memory consumption? - c

I'm looking for ways to reduce memory consumption by SQLite3 in my application.
At each execution it creates a table with the following schema:
(main TEXT NOT NULL PRIMARY KEY UNIQUE, count INTEGER DEFAULT 0)
After that, the database is filled with 50k operations per second. Write only.
When an item already exists, it updates "count" using an update query (I think this is called UPSERT). These are my queries:
INSERT OR IGNORE INTO table (main) VALUES (#SEQ);
UPDATE tables SET count=count+1 WHERE main = #SEQ;
This way, with 5 million operations per transaction, I can write really fast to the DB.
I don't really care about disk space for this problem, but I have a very limited RAM space. Thus, I can't waste too much memory.
sqlite3_user_memory() informs that its memory consumption grows to almost 3GB during the execution. If I limit it to 2GB through sqlite3_soft_heap_limit64(), database operations' performance drops to almost zero when reaching 2GB.
I had to raise cache size to 1M (page size is default) to reach a desirable performance.
What can I do to reduce memory consumption?

It seems that the high memory consumption may be caused by the fact that too many operations are concentrated in one big transaction. Trying to commit smaller transaction like per 1M operations may help. 5M operations per transaction consumes too much memory.
However, we'd balance the operation speed and memory usage.
If smaller transaction is not an option, PRAGMA shrink_memory may be a choice.
Use sqlite3_status() with SQLITE_STATUS_MEMORY_USED to trace the dynamic memory allocation and locate the bottleneck.

I would:
prepare the statements (if you're not doing it already)
lower the amount of INSERTs per transaction (10 sec = 500,000 sounds appropriate)
use PRAGMA locking_mode = EXCLUSIVE; if you can
Also, (I'm not sure if you know) the PRAGMA cache_size is in pages, not in MBs. Make sure you define your target memory in as PRAGMA cache_size * PRAGMA page_size or in SQLite >= 3.7.10 you can also do PRAGMA cache_size = -kibibytes;. Setting it to 1 M(illion) would result in 1 or 2 GB.
I'm curious how cache_size helps in INSERTs though...
You can also try and benchmark if the PRAGMA temp_store = FILE; makes a difference.
And of course, whenever your database is not being written to:
PRAGMA shrink_memory;
VACUUM;
Depending on what you're doing with the database, these might also help:
PRAGMA auto_vacuum = 1|2;
PRAGMA secure_delete = ON;
I ran some tests with the following pragmas:
busy_timeout=0;
cache_size=8192;
encoding="UTF-8";
foreign_keys=ON;
journal_mode=WAL;
legacy_file_format=OFF;
synchronous=NORMAL;
temp_store=MEMORY;
Test #1:
INSERT OR IGNORE INTO test (time) VALUES (?);
UPDATE test SET count = count + 1 WHERE time = ?;
Peaked ~109k updates per second.
Test #2:
REPLACE INTO test (time, count) VALUES
(?, coalesce((SELECT count FROM test WHERE time = ? LIMIT 1) + 1, 1));
Peaked at ~120k updates per second.
I also tried PRAGMA temp_store = FILE; and the updates dropped by ~1-2k per second.
For 7M updates in a transaction, the journal_mode=WAL is slower than all the others.
I populated a database with 35,839,987 records and now my setup is taking nearly 4 seconds per each batch of 65521 updates - however, it doesn't even reach 16 MB of memory consumption.
Ok, here's another one:
Indexes on INTEGER PRIMARY KEY columns (don't do it)
When you create a column with INTEGER PRIMARY KEY, SQLite uses this
column as the key for (index to) the table structure. This is a hidden
index (as it isn't displayed in SQLite_Master table) on this column.
Adding another index on the column is not needed and will never be
used. In addition it will slow INSERT, DELETE and UPDATE operations
down.
You seem to be defining your PK as NOT NULL + UNIQUE. PK is UNIQUE implicitly.

Assuming that all the operations in one transaction are distributed all over the table so that all pages of the table need to be accessed, the size of the working set is:
about 1 GB for the table's data, plus
about 1 GB for the index on the main column, plus
about 1 GB for the original data of all the table's pages changed in the transaction (probably all of them).
You could try to reduce the amount of data that gets changed for each operation by moving the count column into a separate table:
CREATE TABLE main_lookup(main TEXT NOT NULL UNIQUE, rowid INTEGER PRIMARY KEY);
CREATE TABLE counters(rowid INTEGER PRIMARY KEY, count INTEGER DEFAULT 0);
Then, for each operation:
SELECT rowid FROM main_lookup WHERE main = #SEQ;
if not exists:
INSERT INTO main_lookup(main) VALUES(#SEQ);
--read the inserted rowid
INSERT INTO counters VALUES(#rowid, 0);
UPDATE counters SET count=count+1 WHERE rowid = #rowid;
In C, the inserted rowid is read with sqlite3_last_insert_rowid.
Doing a separate SELECT and INSERT is not any slower than INSERT OR IGNORE; SQLite does the same work in either case.
This optimization is useful only if most operations update a counter that already exists.

In the spirit of brainstorming I will venture an answer. I have not done any testing like this fellow:
Improve INSERT-per-second performance of SQLite?
My hypothesis is that the index on the text primary key might be more RAM-intensive than a couple of indexes on two integer columns (what you'd need to simulate a hashed-table).
EDIT: Actually, you dont' even need a primary key for this:
create table foo( slot integer, myval text, occurrences int);
create index ix_foo on foo(slot); // not a unique index
An integer primary key (or a non-unique index on slot) would leave you with no quick way to determine if your text value were already on file. So to address that requirement, you might try implementing something I suggested to another poster, simulating a hashed-key:
SQLite Optimization for Millions of Entries?
A hash-key-function would allow you to determine where the text-value would be stored if it did exist.
http://www.cs.princeton.edu/courses/archive/fall08/cos521/hash.pdf
http://www.fearme.com/misc/alg/node28.html
http://cs.mwsu.edu/~griffin/courses/2133/downloads/Spring11/p677-pearson.pdf

Related

weird phenomena during deadlocks involving IMAGE or TEXT columns

This is something very disturbing I stumbled upon while stress-testing an application using Sybase ASE 15.7.
We have the following table:
CREATE TABLE foo
(
i INT NOT NULL,
blob IMAGE
);
ALTER TABLE foo ADD PRIMARY KEY (i);
The table has, even before starting the test, a single row with some data in the IMAGE column. No rows are either deleted or inserted during the test. So the table always contains a single row. Column blob is only updated (in transaction T1 below) to some value (not NULL).
Then, we have the following two transactions:
T1: UPDATE foo SET blob=<some not null value> WHERE i=1
T2: SELECT * FROM foo WHERE i=1
For some reason, the above transactions may deadlock under load (approx. 10 threads doing T1 20 times in a loop and another 10 threads doing T2 20 times in loop).
This is already weird enough, but there's more to come. T1 is always chosen as the deadlock victim. So, the application logic, on the event of a deadlock (error code 1205) simply retries T1. This should work and should normally be the end of the story. However …
… it happens that sometimes T2 will retrieve a row in which the value of the blob column is NULL! This is even though the table already starts with a row and the updates simply reset the previous (non-NULL) value to some other (non-NULL) value. This is 100% reproducible in every test run.
This is observed with the READ COMMITTED serialization level.
I verified that the above behavior also occurs with the TEXT column type but not with VARCHAR.
I've also verified that obtaining an exlusive lock on table foo in transaction T1 makes the issue go away.
So I'd like to understand how can something that so fundamentally breaks transaction isolation be even possible? In fact, I think this is worse than transaction isolation as T1 never sets the value of the blob column to NULL.
The test code is written in Java using the jconn4.jar driver (class com.sybase.jdbc4.jdbc.SybDriver) so I don't rule out that this may be a JDBC driver bug.
update
This is reproducible simply using isql and spawning several shells in parallel that continuously execute T1 in a loop. So I am removing the Java and JDBC tags as this is definitely server-related.
Your example create table code by default would create an allpages locked table unless your DBA has changed the system-wide 'lock scheme' parameter via sp_configure to another value(you can check this yourself as anyone via sp_configure 'lock scheme'.
Unless you have a very large number of rows they are all going to be sat on a single data page because an int is only 4 bytes long and the blob data is stored at the end of the table (unless you use the in-row LOB functionality in ASE15.7 and up). This is why you are getting deadlocks. You have by definition created a single hotspot where all the data is being accessed at the page level. This is even more likely where larger page sizes > 2k are used, since by their nature they will have even more rows per page and with allpages locking, even more likelihood of contention.
Change your locking scheme to datarows (unless you are planning to have very high rowcounts) as has been said above and your problem should go away. I will add that your blob column looks to allow nulls from your code, so you should also consider setting the 'dealloc_first_txtpg' attribute for your table to avoid wasted space if you have nulls in your image column.
We've seen all kinds of weird stuff with isolation level 1. I'm under the impression that when T2 is in progress, T1 can change data and T2 might return intermediate result of T1.
Try isolation level 2 and see if it helps (does for us).

Improve performance of insert?

I ran the Performance – Top Queries by Total IO (I am trying to improve this process).
The top #1 is this code:
DECLARE #LeadsVS3 AS TT_LEADSMERGE
DECLARE #LastUpdateDate DATETIME
SELECT #LastUpdateDate = MAX(updatedate)
FROM [BUDatamartsource].[dbo].[salesforce_lead]
INSERT INTO #LeadsVS3
SELECT
Lead_id,
(more columns…)
OrderID__c,
City__c
FROM
[ReplicatedVS3].[dbo].[Lead]
WHERE
UpdateDate > #LastUpdateDate
(the code is a piece of a larger SP)
This is in a job that runs every 15 minutes... Other than running the job less frequently is there any other improvement I could make?
Make a try with a local hash table like #LeadsVS3, it is faster than udtt in most cases
Also there is another trick you may do.
On those cases where you always get all 'recent' rows, you may get locked for 1 row, the latest, waiting to commit. You may sacrifice a small part e.g. 1 minute that is to ignore last minute records ( current datetime - 1 minute ). You get those to the next run and save yourself any transaction (or replication) lock waits.
The execution plan that you posted appears to be the estimated execution plan. (the actual execution plan includes the actual number of rows). Without the actual plan it's impossible to tell what's really going on.
The obvious improvement would be to add a covering nonclustered index on Lead.leadid that includes the other columns in your SELECT statement. Right now your scanning a the widest possible index (your clustered index) to retrieve a presumably small percentage of records. Turning that clustered scan into a non-clustered seek will be huge.
On that same note you could make that index a filtered index that's only includes records for dates greater than your last UpdateDate. Then setup a regular SQL Job that periodically rebuilds it to filter on a more current date.
Other things you can do to increase insert performance:
Drop any constraints and/or indexes before the insert then rebuild
them after.
Use smaller data types

Update query is taking 2 seconds to update the table

I have a table in SQL DB Server. (Table Name : Material, 6 columns). It contains 2.6 million records. I need to update this table based on two column values. For update, system is taking 2 seconds.
Please help me how optimize below query.
UPDATE Material
SET Value = #Value,
Format = #Format,
SValue = #SValue,
CGroup = #CGroup
WHERE
SM = #SM
AND Characteristic = #Characteristic
You really need to provide the query plan before we can tell you with any certainty what, if anything, might help.
Having said that, the first thing I would check is whether the plan is showing a great deal of time doing a table scan. If so, you could improve performance substantially if it is a large table by adding an index on SM and Characteristic - that will allow the profiler to use the index to perform an index seek instead of a table scan, and could improve performance dramatically.
As you got big data few tweaks can increase query performance
(1) If column to be updated is indexed, remove index
(2) Executing the update in smaller batches
DECLARE #i INT=1
WHILE( #i <= 10 )
BEGIN
UPDATE TOP(20000) Material
SET Value = #Value,
Format = #Format,
SValue = #SValue,
CGroup = #CGroup
WHERE
SM = #SM
AND Characteristic = #Characteristic
SET #i=#i + 1
END
(3) Disabling Delete triggers (if any)
Hope this helps !
Try to put composite index for SM & Characteristic .By doing this, the sql server will be able to select records more easily. Operational wise, Update is a combination of insert & delete.If your table is having more columns, it may slow down your update even if you are not try to update all the columns.
Steps i prefer
Try to put composite index with SM & Characteristic
Try to re create a table with required columns & use joins where ever needed.
2.6 mil rows is not that much. 2 secs for an update is probably too much.
Having said that, the update times could depend on two things.
First, how many rows are being updated with a single update command, ie is it just one row or some larger set? You can't really do much about that, just saying it should be taken into consideration.
The other thing are indexes - you could either have too many of then or not enough.
If the table is missing an index on (SM, Characteristic) -- or (Characteristic, SM), depending on the selectivity -- then it's probably a full table scan every time. If the update touches only a couple of rows, it's waste of time. So, it's the first thing to check.
If there are too many indexes on the affected columns, this could slow down updates as well, because those indexes have to be maintained with every change of data. You can check the usefulness of indexes by querying the sys.dm_db_index_usage_stats DMV (plenty of explanation on the internet, so I won't get into it here) and remove the unused ones. Just be carefull with this and test thoroughly.
One other thing to check is whether the affected columns are part of some foreign key constraint. In that case, the engine must check the validity of the constraint every time (iow, check if the new value exists in the referenced table, or check if there's data in referencing tables, depending on which side of the FK the column is). If there are no supporting indexes for this check, it would cause (again) a scan on the other tables involved.
But to really make sure, check the exec plan and IO stats (SET STATISTICS IO ON), it will tell you exactly what is going on.

Efficient DELETE TOP?

Is it more efficient and ultimately FASTER to delete rows from a DB in blocks of 1000 or 10000? I am having to remove approx 3 million rows from many tables. I first did the deletes in blocks of 100K rows but the performance wasn't looking good. I changed to 10000 and seem to be removing faster. Wondering if even smaller like 1K per DELETE statement is even better.
Thoughts?
I am deleting like this:
DELETE TOP(10000)
FROM TABLE
WHERE Date < '1/1/2012'
Yes, it is. It all depends on your server though. I mean, last time I did that i was using this approeach to delete things in 64 million increments (on a table that had at that point around 14 billion rows, 80% Of which got ultimately deleted). I got a delete through every 10 seconds or so.
It really depends on your hardware. Going moreg granular is more work but it means less waiting for tx logs for other things operating on the table. You have to try out and find where you are comfortable - there is no ultimate answer because it is totally dependend on usage of the table and hardware.
We used Table Partitioning to remove 5 million rows in less than a sec but this was from just one table. It took some work up-front but ultimately was the best way. This may not be the best way for you.
From our document about partitioning:
Let’s say you want to add 5 million rows to a table but don’t want to lock the table up while you do it. I ran into a case in an ordering system where I couldn’t insert the rows without stopping the system from taking orders. BAD! Partitioning is one way of doing it if you are adding rows that don’t overlap current data.
WHAT TO WATCH OUT FOR:
Data CANNOT overlap current data. You have to partition the data on a value. The new data cannot be intertwined within the currently partitioned data. If removing data, you have to remove an entire partition or partitions. You will not have a WHERE clause.
If you are doing this on a production database and want to limit the locking on the table, create your indexes with “ONLINE = ON”.
OVERVIEW OF STEPS:
FOR ADDING RECORDS
Partition the table you want to add records to (leave a blank partition for the new data). Do not forget to partition all of your indexes.
Create new table with the exact same structure (keys, data types, etc.).
Add a constraint to the new table to limit that data so that it would fit into the blank partition in the old table.
Insert new rows into new table.
Add indexes to match old table.
Swap the new table with the blank partition of the old table.
Un-partition the old table if you wish.
FOR DELETING RECORDS
Partition the table into sets so that the data you want to delete is all on partitions by itself (this could be many different partitions).
Create a new table with the same partitions.
Swap the partitions with the data you want to delete to the new table.
Un-partition the old table if you wish.
Yes, no, it depends on the usage of table due to locking. I would try to delete the records in a slower pace. So the opposite of the op's question.
set rowcount 10000
while ##rowcount > 0
begin
waitfor delay '0:0:1'
delete
from table
where date < convert(datetime, '20120101', 112)
end
set rowcount 0

When should I use a table variable vs temporary table in sql server?

I'm learning more details in table variable. It says that temp tables are always on disk, and table variables are in memory, that is to say, the performance of table variable is better than temp table because table variable uses less IO operations than temp table.
But sometimes, if there are too many records in a table variable that can not be contained in memory, the table variable will be put on disk like the temp table.
But I don't know what the "too many records" is. 100,000 records? or 1000,000 records? How can I know if a table variable I'm using is in memory or is on disk? Is there any function or tool in SQL Server 2005 to measure the scale of the table variable or letting me know when the table variable is put on disk from memory?
Your question shows you have succumbed to some of the common misconceptions surrounding table variables and temporary tables.
I have written quite an extensive answer on the DBA site looking at the differences between the two object types. This also addresses your question about disk vs memory (I didn't see any significant difference in behaviour between the two).
Regarding the question in the title though as to when to use a table variable vs a local temporary table you don't always have a choice. In functions, for example, it is only possible to use a table variable and if you need to write to the table in a child scope then only a #temp table will do
(table-valued parameters allow readonly access).
Where you do have a choice some suggestions are below (though the most reliable method is to simply test both with your specific workload).
If you need an index that cannot be created on a table variable then you will of course need a #temporary table. The details of this are version dependant however. For SQL Server 2012 and below the only indexes that could be created on table variables were those implicitly created through a UNIQUE or PRIMARY KEY constraint. SQL Server 2014 introduced inline index syntax for a subset of the options available in CREATE INDEX. This has been extended since to allow filtered index conditions. Indexes with INCLUDE-d columns or columnstore indexes are still not possible to create on table variables however.
If you will be repeatedly adding and deleting large numbers of rows from the table then use a #temporary table. That supports TRUNCATE (which is more efficient than DELETE for large tables) and additionally subsequent inserts following a TRUNCATE can have better performance than those following a DELETE as illustrated here.
If you will be deleting or updating a large number of rows then the temp table may well perform much better than a table variable - if it is able to use rowset sharing (see "Effects of rowset sharing" below for an example).
If the optimal plan using the table will vary dependent on data then use a #temporary table. That supports creation of statistics which allows the plan to be dynamically recompiled according to the data (though for cached temporary tables in stored procedures the recompilation behaviour needs to be understood separately).
If the optimal plan for the query using the table is unlikely to ever change then you may consider a table variable to skip the overhead of statistics creation and recompiles (would possibly require hints to fix the plan you want).
If the source for the data inserted to the table is from a potentially expensive SELECT statement then consider that using a table variable will block the possibility of this using a parallel plan.
If you need the data in the table to survive a rollback of an outer user transaction then use a table variable. A possible use case for this might be logging the progress of different steps in a long SQL batch.
When using a #temp table within a user transaction locks can be held longer than for table variables (potentially until the end of transaction vs end of statement dependent on the type of lock and isolation level) and also it can prevent truncation of the tempdb transaction log until the user transaction ends. So this might favour the use of table variables.
Within stored routines, both table variables and temporary tables can be cached. The metadata maintenance for cached table variables is less than that for #temporary tables. Bob Ward points out in his tempdb presentation that this can cause additional contention on system tables under conditions of high concurrency. Additionally, when dealing with small quantities of data this can make a measurable difference to performance.
Effects of rowset sharing
DECLARE #T TABLE(id INT PRIMARY KEY, Flag BIT);
CREATE TABLE #T (id INT PRIMARY KEY, Flag BIT);
INSERT INTO #T
output inserted.* into #T
SELECT TOP 1000000 ROW_NUMBER() OVER (ORDER BY ##SPID), 0
FROM master..spt_values v1, master..spt_values v2
SET STATISTICS TIME ON
/*CPU time = 7016 ms, elapsed time = 7860 ms.*/
UPDATE #T SET Flag=1;
/*CPU time = 6234 ms, elapsed time = 7236 ms.*/
DELETE FROM #T
/* CPU time = 828 ms, elapsed time = 1120 ms.*/
UPDATE #T SET Flag=1;
/*CPU time = 672 ms, elapsed time = 980 ms.*/
DELETE FROM #T
DROP TABLE #T
Use a table variable if for a very small quantity of data (thousands of bytes)
Use a temporary table for a lot of data
Another way to think about it: if you think you might benefit from an index, automated statistics, or any SQL optimizer goodness, then your data set is probably too large for a table variable.
In my example, I just wanted to put about 20 rows into a format and modify them as a group, before using them to UPDATE / INSERT a permanent table. So a table variable is perfect.
But I am also running SQL to back-fill thousands of rows at a time, and I can definitely say that the temporary tables perform much better than table variables.
This is not unlike how CTE's are a concern for a similar size reason - if the data in the CTE is very small, I find a CTE performs as good as or better than what the optimizer comes up with, but if it is quite large then it hurts you bad.
My understanding is mostly based on http://www.developerfusion.com/article/84397/table-variables-v-temporary-tables-in-sql-server/, which has a lot more detail.
Microsoft says here
Table variables does not have distribution statistics, they will not trigger recompiles. Therefore, in many cases, the optimizer will build a query plan on the assumption that the table variable has no rows. For this reason, you should be cautious about using a table variable if you expect a larger number of rows (greater than 100). Temp tables may be a better solution in this case.
I totally agree with Abacus (sorry - don't have enough points to comment).
Also, keep in mind it doesn't necessarily come down to how many records you have, but the size of your records.
For instance, have you considered the performance difference between 1,000 records with 50 columns each vs 100,000 records with only 5 columns each?
Lastly, maybe you're querying/storing more data than you need? Here's a good read on SQL optimization strategies. Limit the amount of data you're pulling, especially if you're not using it all (some SQL programmers do get lazy and just select everything even though they only use a tiny subset). Don't forget the SQL query analyzer may also become your best friend.
Variable table is available only to the current session, for example, if you need to EXEC another stored procedure within the current one you will have to pass the table as Table Valued Parameter and of course this will affect the performance, with temporary tables you can do this with only passing the temporary table name
To test a Temporary table:
Open management studio query editor
Create a temporary table
Open another query editor window
Select from this table "Available"
To test a Variable table:
Open management studio query editor
Create a Variable table
Open another query editor window
Select from this table "Not Available"
something else I have experienced is: If your schema doesn't have GRANT privilege to create tables then use variable tables.
writing data in tables declared declare #tb and after joining with other tables, I realized that the response time compared to temporary tables tempdb .. # tb is much higher.
When I join them with #tb the time is much longer to return the result, unlike #tm, the return is almost instantaneous.
I did tests with a 10,000 rows join and join with 5 other tables

Resources