SQLite vs serializing to disk - database

I'm doing some performance comparison whether to go for serializing data or to store them in a a DB. The application receives hell of a lot of data (x GB) that needs to be persisted with a minimum speed rate of 18mb/s (as for now)
Storing in DB offers easier functionality in terms of searching and accessing data at a later time, data snapshots, data migration and etc, but my tests so far shows a huge difference in performance time.
The test saves 1000 objects (of about 7hundredsomething kb each). Either to their respective columns in table or to disk by saving them as a generic List. (The SQLite ends up with a bit more data)
Saving to SQLite v3, total size 745mb: 30.7seconds (~speed: 24,3 mb/s)
Serializing to disk, total size 741mb: 0.33 seconds (~speed: 2245 mb/s)
I haven't done any performance tweaks to SQLite, just use it out of the box with Fluent nHibernate and the SQLite.Data adapter (no transaction), but at first thought that is a huge difference.
Obviously I know that going through a ORM mapper and DB to write to disk gives an overhead compared to serializing, but that was a lot.
Also into considerations are to persist the data right away as I recieve them. If there is a power failure I need the last data recieved.
Any thoughts?
----- Updates (as I continue to investigate solutions) ------
Wrapping the 1000 inserts in a transaction the time was now ~14s = 53mb/s, however if I throw an exception halfway I loose all my data.
Using a IStatelessSession seems to improve the time by 0.5-1s
Didn't see any performance gain by assigning the ID to the entity instead of having it automaticly assigned in the table and thus getting rid of (select row_generatedid()) for every insert sql. -> Id(x => x.Id).GeneratedBy.Assigned();
the nosync() alternative in SQLite is not an alternative as the DB might be corrupted in case of a power failure.

I had a similar problem once and I suggest you go the SQLite route.
As for your performance issues, I'm pretty sure you'll get a very significant boost if you:
execute all INSERTs in a single transaction - write queries must acquire (and release) a lock to the SQLite file, this is very expensive in terms of disk I/O and you should notice a huge boost***
consider using multi-INSERTs (this probably won't work for you since you rely on a ORM)
as #user896756 mentioned you should also prepare your statements
Test 1: 1000 INSERTs
CREATE TABLE t1(a INTEGER, b INTEGER, c VARCHAR(100));
INSERT INTO t1 VALUES(1,13153,'thirteen thousand one hundred fifty three');
INSERT INTO t1 VALUES(2,75560,'seventy five thousand five hundred sixty');
... 995 lines omitted
INSERT INTO t1 VALUES(998,66289,'sixty six thousand two hundred eighty nine');
INSERT INTO t1 VALUES(999,24322,'twenty four thousand three hundred twenty two');
INSERT INTO t1 VALUES(1000,94142,'ninety four thousand one hundred forty two');
PostgreSQL: 4.373
MySQL: 0.114
SQLite 2.7.6: 13.061
SQLite 2.7.6 (nosync): 0.223
Test 2: 25000 INSERTs in a transaction
BEGIN;
CREATE TABLE t2(a INTEGER, b INTEGER, c VARCHAR(100));
INSERT INTO t2 VALUES(1,59672,'fifty nine thousand six hundred seventy two');
... 24997 lines omitted
INSERT INTO t2 VALUES(24999,89569,'eighty nine thousand five hundred sixty nine');
INSERT INTO t2 VALUES(25000,94666,'ninety four thousand six hundred sixty six');
COMMIT;
PostgreSQL: 4.900
MySQL: 2.184
SQLite 2.7.6: 0.914
SQLite 2.7.6 (nosync): 0.757
*** These benchmarks are for SQLite 2, SQLite 3 should be even faster.

You should consider using compiled statements for sqlite.
Check this
On insert/update queries there is a huge performance boost, I managed to obtain from 2x to 10x faster execution time using compiled statements, although from 33 sec to 0.3 sec is long way.
On the other hand, the SQLite execution speed depends on the schema of the table you are using, ex: if you have an index on a huge data, it would result a slow insert.

After investigating further, the answer lays in a bit of a confusion of the intial results.
While testing the result with larger data I got some other result.
The disk transfer rate is limited to 126mb/s by the manufacturer and how could I write 750MB in a split second? Not sure why. But when I increased the data amount the transfer rate when fast down to ~136 mb/s.
As for database, using a transaction I got speeds up to 90mb/s using the IStatelessSession with large amounts of data (5-10GB). This is good enough for our purpose and I'm sure it can still be tweaked with compiled SQL statements and other if needed.

Related

Is the number of result sets limited in SQL Server?

Is the number of result sets limited that a stored procedure can return in SQL Server? Or is there any other component between server and a .Net Client using sqlncli11 limiting it? I'm thinking of really large numbers like 100000 result sets.
I couldn't find a specific answer to this in the Microsoft docs or here on SO.
My use case:
A stored procedure that iterates over a cursor and produces around 100 rows each iteration. I could collect all the rows in a temp table first, but since this is a long-running operation I want the client to start sooner with processing the results. Also the temp table can get quite large and the execution plans shows 98% cost on the INSERT INTO part.
I'm thinking of really large numbers like 100000 result sets.
Ah, I hope you have a LOT of time.
100k result sets means 100k SELECT statements.
Just switching from one result set to the next will take - together - a long time. 1ms? that is 100 seconds.
Is the number of result sets limited that a stored procedure can return in SQL Server?
Not to my knowledge. Remember, those are not part of any real metadata - there is a stream of data, endmarker, next stream. The number of resultsets a procedure returns is not defined (as: it can vary).
Also the temp table can get quite large
I have seen temp tables with hundreds of GB.
and the execution plans shows 98% cost on the INSERT INTO part.
That basically indicates that there is otherwise not a lot happening. Note that unless you do optimization - the relative cost is not relevant, the absolute is.
Have you considered a middle ground? Collect data and return resultsets grouping i.e. 100 resultsets.
But yes, staging into temp has a lot of overhead. It also means you can not start returning data BEFORE all processing is finished. That can be a bummer. Your approach will allow processing to start while the SP is still working on more data.

Speed of Insert

this is a newbe question.
I have two tables in a SQL data base. Both simply a dozen columns of string, int or date, no indexes, no stored procedures. In a select * from statement I get ~30,000 rows per second. But on insert into ... I only get < 1000 inserts per second.
Is this (factor) what I should expect. (I actually expected a comparable speed on the insert part.)
Insert speed varies wildly by the method of inserting. There is a disk and a CPU component. From the low speed of inserting I guess that that you are inserting rows one by one in a separate transaction each. This pretty much maximizes CPU and disk usage. Each insert is a write to disk.
Make yourself familiar with efficient ways of inserting. There are plenty with various degrees of performance and development time required to program them.
To get you started with something simple: Enclose many (100+) inserts in one transaction. Insert in batches.

What is the normal insert time in a medium size database on an indexed string column?

I have a sqlite database with only one table (around 50,000 rows) and I recurrently perform update-otherwise-insert operations on it using Java and sqlitejdbc (i.e. I try to update rows if they exist and insert new rows otherwise). My table is similar to a word frequency table with "word" and "frequency" columns, and without a primary key!
The problem is that I perform this update-otherwise-insert operation hundreds of thousands of times and on average the insert or update operation takes more than 2ms. There are even times when the insert operations take some 20 milliseconds. I should also mention that the table has an index on the column on which I use the "where" clause in my insert operations (the "word" column"), which naturally should make the insert operation more expensive.
Firstly I want to make sure if 2ms for an insert operation on an indexed table with 50,000 rows is normal and there isn't anything that I've missed, and after that any suggestions to improve the performance is more than welcome. It struck me that dropping the index before performing large crunches of insert operations and recreating it again afterwards is good practice, but I can't do it here because I need to check if a row with the same word already exists.
I know all the stuff about "it depends on the hardware" and "it depends on the rest of your code" etc, but I really think one CAN have an idea of how much an insert operation should take on an average pc.
I partially solved my problem. For anyone interested in an answer to this, this link will be helpful. In short, turning off journal mode in sqlite ("pragma journal_mode=OFF") improves the insert performance significantly (almost four times the previous speed in my case) to the cost of making the code prone to data loss in case of unexpected shutdown.
As for the normal insert speed, it is way faster than 2ms/operation. It can reach as high as hundreds of thousands of insert operations per second using the right pragma instructions, making best use of transactions, etc.

Is it better to use one complex query or several simpler ones?

Which option is better:
Writing a very complex query having large number of joins, or
Writing 2 queries one after the other, applying the obtained result set of the processed query on other.
Generally, one query is better than two, because the optimizer has more information to work with and may be able to produce a more efficient query plan than either separately. Additionally, using two (or more) queries typically means you'll be running the second query multiple times, and the DBMS might have to generate the query plan for the query repeatedly (but not if you prepare the statement and pass the parameters as placeholders when the query is (re)executed). This means fewer back and forth exchanges between the program and the DBMS. If your DBMS is on a server on the other side of the world (or country), this can be a big factor.
Arguing against combining the two queries, you might end up shipping a lot of repetitive data between the DBMS and the application. If each of 10,000 rows in table T1 is joined with an average of 30 rows from table T2 (so there are 300,000 rows returned in total), then you might be shipping a lot of data repeatedly back to the client. If the row size of (the relevant projection of) T1 is relatively small and the data from T2 is relatively large, then this doesn't matter. If the data from T1 is large and the data from T2 is small, then this may matter; measure before deciding.
When I was a junior DB person I once worked for a year in a marketing dept where I had so much free time I did each task 2 or 3 different ways. I made a habit of writing one mega-select that grabbed everything in one go and comparing it to a script that built interim tables of selected primary keys and then once I had the correct keys went and got the data values.
In almost every case the second method was faster. the cases where it wasn't were when dealing with a small number of small tables. Where it was most noticeably faster was of course large tables and multiple joins.
I got into the habit of select the required primary keys from tableA, select the required primary keys from tableB, etc. Join them and select the final set of primary keys. Use the selected primary keys to go back to the tables and get the data values.
As a DBA I now understand that this method resulted in less purging of the data cache and played nicer with others using the DB (as mentioned by Amir Raminfar).
It does however require the use of temporary tables which some places / DBA don't like (unfairly in my mind)
Depends a lot on the actual query and the actual database i.e. SQL, Oracle mySQL.
At large companies, they prefer option 2 because option 1 will hog the database cpu. This results in all other connections being slow and everything being a bottle neck. That being said, it all depends on your data and the ammount you are joining. If you are joining on 10000 to 1000 then you are going to get back 10000 x 1000 records. (Assuming an inner join)
Possible duplicate MySQL JOIN Abuse? How bad can it get?
Assuming "better" means "faster", you can easily test these scenarios in a junit test. Note that a determining factor that you may not be able to get from a unit test is network latency. If the database sits right next to your machine where you run the unit test, you may see no difference in performance that is attributed to the network. If your production servers are in another town, country, or continent from the database, network traffic becomes more of a bottleneck. You do not want to go back and forth across the wire- you more likely want to make one round trip and get everything at once.
Again, it all depends :)
It could depend on many things: ,
the indexes you have set up
how many tables,
what the actual query is,
how big the data set is,
what the underlying DB is,
what table engine you are using
The best thing to do would probably test both methods on a variety of test data and see which one bottle necks.
If you are using MySQL, ( and Oracle maybe? ) you can use
EXPLAIN SELECT .....
and it will give you a lot of info on how it will execute the query, and therefor how you can improve it etc.

Database that can handle >500 millions rows

I am looking for a database that could handle (create an index on a column in a reasonable time and provide results for select queries in less than 3 sec) more than 500 millions rows. Would Postgresql or Msql on low end machine (Core 2 CPU 6600, 4GB, 64 bit system, Windows VISTA) handle such a large number of rows?
Update: Asking this question, I am looking for information which database I should use on a low end machine in order to provide results to select questions with one or two fields specified in where clause. No joins. I need to create indices -- it can not take ages like on mysql -- to achieve sufficient performance for my select queries. This machine is a test PC to perform an experiment.
The table schema:
create table mapper {
key VARCHAR(1000),
attr1 VARCHAR (100),
attr1 INT,
attr2 INT,
value VARCHAR (2000),
PRIMARY KEY (key),
INDEX (attr1),
INDEX (attr2)
}
MSSQL can handle that many rows just fine. The query time is completely dependent on a lot more factors than just simple row count.
For example, it's going to depend on:
how many joins those queries do
how well your indexes are set up
how much ram is in the machine
speed and number of processors
type and spindle speed of hard drives
size of the row/amount of data returned in the query
Network interface speed / latency
It's very easy to have a small (less than 10,000 rows) table which would take a couple minutes to execute a query against. For example, using lots of joins, functions in the where clause, and zero indexes on a Atom processor with 512MB of total ram. ;)
It takes a bit more work to make sure all of your indexes and foreign key relationships are good, that your queries are optimized to eliminate needless function calls and only return the data you actually need. Also, you'll need fast hardware.
It all boils down to how much money you want to spend, the quality of the dev team, and the size of the data rows you are dealing with.
UPDATE
Updating due to changes in the question.
The amount of information here is still not enough to give a real world answer. You are going to just have to test it and adjust your database design and hardware as necessary.
For example, I could very easily have 1 billion rows in a table on a machine with those specs and run a "select top(1) id from tableA (nolock)" query and get an answer in milliseconds. By the same token, you can execute a "select * from tablea" query and it take a while because although the query executed quickly, transferring all of that data across the wire takes awhile.
Point is, you have to test. Which means, setting up the server, creating some of your tables, and populating them. Then you have to go through performance tuning to get your queries and indexes right. As part of the performance tuning you're going to uncover not only how the queries need to be restructured but also exactly what parts of the machine might need to be replaced (ie: disk, more ram, cpu, etc) based on the lock and wait types.
I'd highly recommend you hire (or contract) one or two DBAs to do this for you.
Most databases can handle this, it's about what you are going to do with this data and how you do it. Lots of RAM will help.
I would start with PostgreSQL, it's for free and has no limits on RAM (unlike SQL Server Express) and no potential problems with licences (too many processors, etc.). But it's also my work :)
Pretty much every non-stupid database can handle a billion rows today easily. 500 million is doable even on 32 bit systems (albeit 64 bit really helps).
The main problem is:
You need to have enough RAM. How much is enough depends on your queries.
You need to have a good enough disc subsystem. This pretty much means if you want to do large selects, then a single platter for everything is totally out of the question. Many spindles (or a SSD) are needed to handle the IO load.
Both Postgres as well as Mysql can easily handle 500 million rows. On proper hardware.
What you want to look at is the table size limit the database software imposes. For example, as of this writing, MySQL InnoDB has a limit of 64 TB per table, while PostgreSQL has a limit of 32 TB per table; neither limits the number of rows per table. If correctly configured, these database systems should not have trouble handling tens or hundreds of billions of rows (if each row is small enough), let alone 500 million rows.
For best performance handling extremely large amounts of data, you should have sufficient disk space and good disk performance—which can be achieved with disks in an appropriate RAID—and large amounts of memory coupled with a fast processor(s) (ideally server-grade Intel Xeon or AMD Opteron processors). Needless to say, you'll also need to make sure your database system is configured for optimal performance and that your tables are indexed properly.
The following article discusses the import and use of a 16 billion row table in Microsoft SQL.
https://www.itprotoday.com/big-data/adventures-big-data-how-import-16-billion-rows-single-table.
From the article:
Here are some distilled tips from my experience:
The more data you have in a table with a defined clustered index, the
slower it becomes to import unsorted records into it. At some point,
it becomes too slow to be practical. If you want to export your table
to the smallest possible file, make it native format. This works best
with tables containing mostly numeric columns because they’re more
compactly represented in binary fields than character data. If all
your data is alphanumeric, you won’t gain much by exporting it in
native format. Not allowing nulls in the numeric fields can further
compact the data. If you allow a field to be nullable, the field’s
binary representation will contain a 1-byte prefix indicating how many
bytes of data will follow. You can’t use BCP for more than
2,147,483,647 records because the BCP counter variable is a 4-byte
integer. I wasn’t able to find any reference to this on MSDN or the
Internet. If your table consists of more than 2,147,483,647 records,
you’ll have to export it in chunks or write your own export routine.
Defining a clustered index on a prepopulated table takes a lot of disk
space. In my test, my log exploded to 10 times the original table size
before completion. When importing a large number of records using the
BULK INSERT statement, include the BATCHSIZE parameter and specify how
many records to commit at a time. If you don’t include this parameter,
your entire file is imported as a single transaction, which requires a
lot of log space. The fastest way of getting data into a table with a
clustered index is to presort the data first. You can then import it
using the BULK INSERT statement with the ORDER parameter.
Even that is small compared to the multi-petabyte Nasdaq OMX database, which houses tens of petabytes (thousands of terabytes) and trillions of rows on SQL Server.
Have you checked out Cassandra? http://cassandra.apache.org/
As mentioned pretty much all DB's today can handle this situation - what you want to concentrate on is your disk i/o subsystem. You need to configure a RAID 0 or RAID 0+1 situation throwing as many spindles to the problem as you can. Also, divide up your Log/Temp/Data logical drives for performance.
For example, let say you have 12 drives - in your RAID controller I'd create 3 RAID 0 partitions of 4 drives each. In Windows (let's say) format each group as a logical drive (G,H,I) - now when configuring SQLServer (let's say) assign the tempdb to G, the Log files to H and the data files to I.
I don't have much input on which is the best system to use, but perhaps this tip could help you get some of the speed you're looking for.
If you're going to be doing exact matches of long varchar strings, especially ones that are longer than allowed for an index, you can do a sort of pre-calculated hash:
CREATE TABLE BigStrings (
BigStringID int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED,
Value varchar(6000) NOT NULL,
Chk AS (CHECKSUM(Value))
);
CREATE NONCLUSTERED INDEX IX_BigStrings_Chk ON BigStrings(Chk);
--Load 500 million rows in BigStrings
DECLARE #S varchar(6000);
SET #S = '6000-character-long string here';
-- nasty, slow table scan:
SELECT * FROM BigStrings WHERE Value = #S
-- super fast nonclustered seek followed by very fast clustered index range seek:
SELECT * FROM BigStrings WHERE Value = #S AND Chk = CHECKSUM(#S)
This won't help you if you aren't doing exact matches, but in that case you might look into full-text indexing. This will really change the speed of lookups on a 500-million-row table.
I need to create indices (that does not take ages like on mysql) to achieve sufficient performance for my select queries
I'm not sure what you mean by "creating" indexes. That's normally a one-time thing. Now, it's typical when loading a huge amount of data as you might do, to drop the indexes, load your data, and then add the indexes back, so the data load is very fast. Then as you make changes to the database, the indexes would be updated, but they don't necessarily need to be created each time your query runs.
That said, databases do have query optimization engines where they will analyze your query and determine the best plan to retrieve the data, and see how to join the tables (not relevant in your scenario), and what indexes are available, obviously you'd want to avoid a full table scan, so performance tuning, and reviewing the query plan is important, as others have already pointed out.
The point above about a checksum looks interesting, and that could even be an index on attr1 in the same table.

Resources