I'm using R to do a statistical analysis on a SQL Server 2008 R2 database. My database client (aka driver) is JDBC and thereby I'm using RJDBC package.
My query is pretty simple and I'm sure that query would return a lot of rows (about 2 million rows).
SELECT * FROM [maindb].[dbo].[users]
My R script is as follows.
library(RJDBC);
javaPackageName <- "com.microsoft.sqlserver.jdbc.SQLServerDriver";
clientJarFile <- "/home/abforce/mystuff/sqljdbc_3.0/enu/sqljdbc4.jar";
driver <- JDBC(javaPackageName, clientJarFile);
conn <- dbConnect(driver, "jdbc:sqlserver://192.168.56.101", "username", "password");
query <- "SELECT * FROM [maindb].[dbo].[users]";
result <- dbSendQuery(conn, query);
dbHasCompleted(result)
In the codes above, the last line always returns TRUE. What could be wrong here?
The fact of function dbHasCompleted always returning TRUE seems to be a known issue as I've found other places in the Internet where people were struggling with this issue.
So, I came with a workaround. Instead of function dbHasCompleted, we can use conditional statement nrow(result) == 0.
For example:
result <- dbSendQuery(conn, query);
repeat {
chunk <- dbFetch(result, n = 10);
if(nrow(chunk) == 0){
break;
}
# Do something with 'chunk';
}
dbClearResult(result);
I'm connecting to a large postgresql database(300 million records) with libdbi and perform a SELECT * query, then show the result line by line. I'm getting full swap and memory so it seems autocommit is enabled and it loads the whole result set in the memory. Is there any option to disable autocommit or at least hold cursors after commit like ResultSet.HOLD_CURSORS_OVER_COMMIT in java? I didn't find any option for dbi_conn_set_option to do it. here is my code:
dbi_conn conn;
dbi_result result;
int64_t id;
dbi_initialize(NULL);
conn = dbi_conn_new("pgsql");
if (conn == NULL)
{
printf("connection error.\n");
return EXIT_FAILURE;
}
dbi_conn_set_option(conn, "host", "127.0.0.1");
dbi_conn_set_option(conn, "username", "postgres");
dbi_conn_set_option(conn, "password", "123456");
dbi_conn_set_option(conn, "dbname", "backup");
if (dbi_conn_connect(conn) < 0)
{
printf("could not connect to database.\n");
return EXIT_FAILURE;
}
result = dbi_conn_query(conn, "SELECT * FROM tbl");
if (result)
{
while (dbi_result_next_row(result))
{
id = dbi_result_get_longlong(result, "_id");
printf("This is _id: %ld\n", id);
}
dbi_result_free(result);
}
dbi_conn_close(conn);
dbi_shutdown();
It's not related to autocommit.
The solution to the out-of-memory problem is indeed to use a cursor to fetch N results at a time as opposed to all results in one step.
libdbi doesn't provide an abstraction for SQL cursors, so it needs to be done with SQL queries.
The documentation's page on FETCH has a complete sequence of queries in its example that shows how it's done. You need to call these queries with libdbi in C with two loops: an outer loop calling FETCH N from cursor_name until there's nothing left to fetch, and an inner loop processing the results of that FETCH, just like your current code processing the results of the select * itself.
SQL Server CE 4 (SQL Server Compact Edition 4.0) is not news already (If it is, you could read this article)
But it is very interesting to see SQL Server CE 4 performance comparison to other databases.
Especially with:
SQLite
SQL Server (1)
SQL Server Express *
maybe Firebird
(1) for applications where functionality is comparable.
Unfortunately there are not so much links about the subject that google provides right now. Actually I was unable to find any (for proper SQL CE version).
If one could find or share such information lets collect it here for future humanity.
In my opinion, it is incorrect to compare the embedded database (like SQL CE) versus server-side relational database (like all the rest, except for SQLite and the Embedded version of Firebird).
The main difference between them is that the general-purpose server-side relational databases (like MS SQL, MySQL, Firebird Classic and SuperServer etc.) are installed as an independent service and run outside of the scope of your main application. That is why they can perform much better because of the intrinsic support for multi-core and multi-CPU architectures, using OS features like pre-caching, VSS etc to increase the throughput in case of intensive database operation and can claim as much memory as your OS can provide for a single service/application. It also means that the performance indicators for them are more or less independent from your application, but largely depend upon your hardware. In this respect I would say that the server versions of any database are always more performance compared to the embedded ones.
SQL CE (along with Firebird Embedded, SQLite, TurboSQL and some other) are embedded DB engines, meaning that the complete database is packed into a single (or maximally 2) DLL-files that are distributed together with your application. Due to the evident size limitations (would you like to have to distribute a 30 MB DLL together with your 2-3 MB long application?) they also run directly in the context of your application and the total memory and performance for data access operations are shared with other parts of your application -- that regards both available memory, CPU time, disk throughput etc. Having a computation-intensive threads running in parallel with your data access thread might lead to dramatic decrease of your database performance.
Due to the different areas of application these databases have different palette of options: server-db provide extensive user and right management, support for views and stored procedures, whereas embedded database normally lack any support for users and rights management and have limited support for views and stored procedures (latter ones lose the majority of their benefits of running on server side). Data throughput is a usual bottlenecks of RDBMS, server versions are usually installed on striped RAID volumes, whereas embedded DB are often memory-oriented (try to keep all the actual data in the memory) and minimize the data storage access operations.
So, what would make sense probably is to compare different embedded RDBMS for .Net for their performance, like MS SQL CE 4.0, SQLite, Firebird Embedded, TurboSQL. I wouldn't expect drastic differences during usual non-peak operation, whereas some database may provide better support for large BLOBs due to better integration with OS.
-- update --
I have to take back my last words, for my quick implementation shows very interesting results.
I wrote a short console application to test both data providers, here is the source code for you if you want to experiment with them on your own.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.SQLite;
using System.Data.SqlServerCe;
using System.Data.Common;
namespace TestSQL
{
class Program
{
const int NUMBER_OF_TESTS = 1000;
private static string create_table;
private static string create_table_sqlce = "CREATE TABLE Test ( id integer not null identity primary key, textdata nvarchar(500));";
private static string create_table_sqlite = "CREATE TABLE Test ( id integer not null primary key, textdata nvarchar(500));";
private static string drop_table = "DROP TABLE Test";
private static string insert_data = "INSERT INTO Test (textdata) VALUES ('{0}');";
private static string read_data = "SELECT textdata FROM Test WHERE id = {0}";
private static string update_data = "UPDATE Test SET textdata = '{1}' WHERE id = {0}";
private static string delete_data = "DELETE FROM Test WHERE id = {0}";
static Action<DbConnection> ACreateTable = (a) => CreateTable(a);
static Action<DbConnection> ATestWrite = (a) => TestWrite(a, NUMBER_OF_TESTS);
static Action<DbConnection> ATestRead = (a) => TestRead(a, NUMBER_OF_TESTS);
static Action<DbConnection> ATestUpdate = (a) => TestUpdate(a, NUMBER_OF_TESTS);
static Action<DbConnection> ATestDelete = (a) => TestDelete(a, NUMBER_OF_TESTS);
static Action<DbConnection> ADropTable = (a) => DropTable(a);
static Func<Action<DbConnection>,DbConnection, TimeSpan> MeasureExecTime = (a,b) => { var start = DateTime.Now; a(b); var finish = DateTime.Now; return finish - start; };
static Action<string, TimeSpan> AMeasureAndOutput = (a, b) => Console.WriteLine(a, b.TotalMilliseconds);
static void Main(string[] args)
{
// opening databases
SQLiteConnection.CreateFile("sqlite.db");
SQLiteConnection sqliteconnect = new SQLiteConnection("Data Source=sqlite.db");
SqlCeConnection sqlceconnect = new SqlCeConnection("Data Source=sqlce.sdf");
sqlceconnect.Open();
sqliteconnect.Open();
Console.WriteLine("=Testing CRUD performance of embedded DBs=");
Console.WriteLine(" => Samplesize: {0}", NUMBER_OF_TESTS);
create_table = create_table_sqlite;
Console.WriteLine("==Testing SQLite==");
DoMeasures(sqliteconnect);
create_table = create_table_sqlce;
Console.WriteLine("==Testing SQL CE 4.0==");
DoMeasures(sqlceconnect);
Console.ReadKey();
}
static void DoMeasures(DbConnection con)
{
AMeasureAndOutput("Creating table: {0} ms", MeasureExecTime(ACreateTable, con));
AMeasureAndOutput("Writing data: {0} ms", MeasureExecTime(ATestWrite, con));
AMeasureAndOutput("Updating data: {0} ms", MeasureExecTime(ATestUpdate, con));
AMeasureAndOutput("Reading data: {0} ms", MeasureExecTime(ATestRead, con));
AMeasureAndOutput("Deleting data: {0} ms", MeasureExecTime(ATestDelete, con));
AMeasureAndOutput("Dropping table: {0} ms", MeasureExecTime(ADropTable, con));
}
static void CreateTable(DbConnection con)
{
var sqlcmd = con.CreateCommand();
sqlcmd.CommandText = create_table;
sqlcmd.ExecuteNonQuery();
}
static void TestWrite(DbConnection con, int num)
{
for (; num-- > 0; )
{
var sqlcmd = con.CreateCommand();
sqlcmd.CommandText = string.Format(insert_data,Guid.NewGuid().ToString());
sqlcmd.ExecuteNonQuery();
}
}
static void TestRead(DbConnection con, int num)
{
Random rnd = new Random(DateTime.Now.Millisecond);
for (var max = num; max-- > 0; )
{
var sqlcmd = con.CreateCommand();
sqlcmd.CommandText = string.Format(read_data, rnd.Next(1,num-1));
sqlcmd.ExecuteNonQuery();
}
}
static void TestUpdate(DbConnection con, int num)
{
Random rnd = new Random(DateTime.Now.Millisecond);
for (var max = num; max-- > 0; )
{
var sqlcmd = con.CreateCommand();
sqlcmd.CommandText = string.Format(update_data, rnd.Next(1, num - 1), Guid.NewGuid().ToString());
sqlcmd.ExecuteNonQuery();
}
}
static void TestDelete(DbConnection con, int num)
{
Random rnd = new Random(DateTime.Now.Millisecond);
var order = Enumerable.Range(1, num).ToArray<int>();
Action<int[], int, int> swap = (arr, a, b) => { int c = arr[a]; arr[a] = arr[b]; arr[b] = c; };
// shuffling the array
for (var max=num; max-- > 0; ) swap(order, rnd.Next(0, num - 1), rnd.Next(0, num - 1));
foreach(int index in order)
{
var sqlcmd = con.CreateCommand();
sqlcmd.CommandText = string.Format(delete_data, index);
sqlcmd.ExecuteNonQuery();
}
}
static void DropTable(DbConnection con)
{
var sqlcmd = con.CreateCommand();
sqlcmd.CommandText = drop_table;
sqlcmd.ExecuteNonQuery();
}
}
}
Necessary disclaimer:
I got these results on my machine: Dell Precision WorkStation T7400 equipped with 2 Intel Xeon E5420 CPUs and 8GB of RAM, running 64bit Win7 Enterprise.
I used the default settings for both DBs with connection string "Data Source=database_file_name".
I used the latest versions of both SQL CE 4.0 and SQLite/System.Data.SQLite (from today, June 3rd 2011).
Here are the results for two different samples:
> =Testing CRUD performance of embedded DBs=
> => Samplesize: 200
> ==Testing SQLite==
> Creating table: 396.0396 ms
> Writing data: 22189.2187 ms
> Updating data: 23591.3589 ms
> Reading data: 21.0021 ms
> Deleting data: 20963.0961 ms
> Dropping table: 85.0085 ms
> ==Testing SQL CE 4.0==
> Creating table: 16.0016 ms
> Writing data: 25.0025 ms
> Updating data: 56.0056 ms
> Reading data: 28.0028 ms
> Deleting data: 53.0053 ms
> Dropping table: 11.0011 ms
... and a bigger sample:
=Testing CRUD performance of embedded DBs=
=> Samplesize: 1000
==Testing SQLite==
Creating table: 93.0093 ms
Writing data: 116632.6621 ms
Updating data: 104967.4957 ms
Reading data: 134.0134 ms
Deleting data: 107666.7656 ms
Dropping table: 83.0083 ms
==Testing SQL CE 4.0==
Creating table: 16.0016 ms
Writing data: 128.0128 ms
Updating data: 307.0307 ms
Reading data: 164.0164 ms
Deleting data: 306.0306 ms
Dropping table: 13.0013 ms
So, as you can see, any writing operations (create, update, delete) require almost 1000x more time in SQLite compared to SQLCE. It does not necessarily reflect the general bad performance of this database and might be due to the following:
The data provider I use for SQLite is the System.Data.SQLite, that is a mixed assembly containing both managed and unmanaged code (SQLite is originally written completely in C and the DLL only provides bindings). Probably P/Invoke and data marshaling eats up a good piece of the operation time.
Most likely SQLCE 4.0 caches all the data in memory by default, whereas SQLite flushes most of the data changes directly to the disk storage every time the change happens. One can supply hundreds of parameters for both databases via connection string and tune them appropriately.
I used a series of single queries to test the DB. At least SQLCE supports bulk operations via special .Net classes that would be better suited here. If SQLite supports them too (sorry, I am not an expert here and my quick search yielded nothing promising) it would be nice to compare them as well.
I have observed many problems with SQLite on x64 machines (using the same .net adapter): from data connection being closed unexpectedly to database file corruption. I presume there is some stability problems either with the data adapter or with the library itself.
Here is my freshly baked article about the benchmarking on CodeProject webpage:
Benchmarking the performance of embedded DB for .Net: SQL CE 4.0 vs SQLite
(the article has a pending status now, you need to be logged-in on CodeProject to access its content)
P.S.: I mistakenly marked my previous answer as a community wiki entry and won't get any reputation for it. This encouraged me to write the article for Code Project on this topic, with a somewhat optimized code, more additional information about embedded dbs and statistical analysis of the results. So, please, vote this answer up if you like the article and my second answer here.
Because I'm having a real hard time with Alaudo's tests, test results, and ultimately with his conclusion, I went ahead and played around a bit with his program and came up with a modified version.
It tests each of the following 10 times and outputs the average times:
sqlite without transactions, using default jounal_mode
sqlite with transactions, using default journal_mode
sqlite without transactions, using WAL jounal_mode
sqlite with transactions, using WAL journal_mode
sqlce without transactions
sqlce with transactions
Here is the program (it's a class actually):
using System;
using System.Collections.Generic;
using System.Data.Common;
using System.Data.SqlServerCe;
using System.Data.SQLite;
using System.Diagnostics;
using System.IO;
using System.Linq;
class SqliteAndSqlceSpeedTesting
{
class Results
{
public string test_details;
public long create_table_time, insert_time, update_time, select_time, delete_time, drop_table_time;
}
enum DbType { Sqlite, Sqlce };
const int NUMBER_OF_TESTS = 200;
const string create_table_sqlite = "CREATE TABLE Test (id integer not null primary key, textdata nvarchar(500));";
const string create_table_sqlce = "CREATE TABLE Test (id integer not null identity primary key, textdata nvarchar(500));";
const string drop_table = "DROP TABLE Test";
const string insert_data = "INSERT INTO Test (textdata) VALUES ('{0}');";
const string read_data = "SELECT textdata FROM Test WHERE id = {0}";
const string update_data = "UPDATE Test SET textdata = '{1}' WHERE id = {0}";
const string delete_data = "DELETE FROM Test WHERE id = {0}";
public static void RunTests()
{
List<Results> results_list = new List<Results>();
for (int i = 0; i < 10; i++) {
results_list.Add(RunTest(DbType.Sqlite, false, false));
results_list.Add(RunTest(DbType.Sqlite, false, true));
results_list.Add(RunTest(DbType.Sqlite, true, false));
results_list.Add(RunTest(DbType.Sqlite, true, true));
results_list.Add(RunTest(DbType.Sqlce, false));
results_list.Add(RunTest(DbType.Sqlce, true));
}
foreach (var test_detail in results_list.GroupBy(r => r.test_details)) {
Console.WriteLine(test_detail.Key);
Console.WriteLine("Creating table: {0} ms", test_detail.Average(r => r.create_table_time));
Console.WriteLine("Inserting data: {0} ms", test_detail.Average(r => r.insert_time));
Console.WriteLine("Updating data: {0} ms", test_detail.Average(r => r.update_time));
Console.WriteLine("Selecting data: {0} ms", test_detail.Average(r => r.select_time));
Console.WriteLine("Deleting data: {0} ms", test_detail.Average(r => r.delete_time));
Console.WriteLine("Dropping table: {0} ms", test_detail.Average(r => r.drop_table_time));
Console.WriteLine();
}
}
static Results RunTest(DbType db_type, bool use_trx, bool use_wal = false)
{
DbConnection conn = null;
if (db_type == DbType.Sqlite)
conn = GetConnectionSqlite(use_wal);
else
conn = GetConnectionSqlce();
Results results = new Results();
results.test_details = string.Format("Testing: {0}, transactions: {1}, WAL: {2}", db_type, use_trx, use_wal);
results.create_table_time = CreateTable(conn, db_type);
results.insert_time = InsertTime(conn, use_trx);
results.update_time = UpdateTime(conn, use_trx);
results.select_time = SelectTime(conn, use_trx);
results.delete_time = DeleteTime(conn, use_trx);
results.drop_table_time = DropTableTime(conn);
conn.Close();
return results;
}
static DbConnection GetConnectionSqlite(bool use_wal)
{
SQLiteConnection conn = new SQLiteConnection("Data Source=sqlite.db");
if (!File.Exists(conn.Database))
SQLiteConnection.CreateFile("sqlite.db");
conn.Open();
if (use_wal) {
var command = conn.CreateCommand();
command.CommandText = "PRAGMA journal_mode=WAL";
command.ExecuteNonQuery();
}
return conn;
}
static DbConnection GetConnectionSqlce()
{
SqlCeConnection conn = new SqlCeConnection("Data Source=sqlce.sdf");
if (!File.Exists(conn.Database))
using (var sqlCeEngine = new SqlCeEngine("Data Source=sqlce.sdf"))
sqlCeEngine.CreateDatabase();
conn.Open();
return conn;
}
static long CreateTable(DbConnection con, DbType db_type)
{
Stopwatch sw = Stopwatch.StartNew();
var sqlcmd = con.CreateCommand();
if (db_type == DbType.Sqlite)
sqlcmd.CommandText = create_table_sqlite;
else
sqlcmd.CommandText = create_table_sqlce;
sqlcmd.ExecuteNonQuery();
return sw.ElapsedMilliseconds;
}
static long DropTableTime(DbConnection con)
{
Stopwatch sw = Stopwatch.StartNew();
var sqlcmd = con.CreateCommand();
sqlcmd.CommandText = drop_table;
sqlcmd.ExecuteNonQuery();
return sw.ElapsedMilliseconds;
}
static long InsertTime(DbConnection con, bool use_trx)
{
Stopwatch sw = Stopwatch.StartNew();
var sqlcmd = con.CreateCommand();
DbTransaction trx = null;
if (use_trx) {
trx = con.BeginTransaction();
sqlcmd.Transaction = trx;
}
for (int i = 0; i < NUMBER_OF_TESTS; i++) {
sqlcmd.CommandText = string.Format(insert_data, Guid.NewGuid().ToString());
sqlcmd.ExecuteNonQuery();
}
if (trx != null)
trx.Commit();
return sw.ElapsedMilliseconds;
}
static long SelectTime(DbConnection con, bool use_trx)
{
Stopwatch sw = Stopwatch.StartNew();
var sqlcmd = con.CreateCommand();
DbTransaction trx = null;
if (use_trx) {
trx = con.BeginTransaction();
sqlcmd.Transaction = trx;
}
Random rnd = new Random(DateTime.Now.Millisecond);
for (var max = NUMBER_OF_TESTS; max-- > 0; ) {
sqlcmd.CommandText = string.Format(read_data, rnd.Next(1, NUMBER_OF_TESTS - 1));
sqlcmd.ExecuteNonQuery();
}
if (trx != null)
trx.Commit();
return sw.ElapsedMilliseconds;
}
static long UpdateTime(DbConnection con, bool use_trx)
{
Stopwatch sw = Stopwatch.StartNew();
var sqlcmd = con.CreateCommand();
DbTransaction trx = null;
if (use_trx) {
trx = con.BeginTransaction();
sqlcmd.Transaction = trx;
}
Random rnd = new Random(DateTime.Now.Millisecond);
for (var max = NUMBER_OF_TESTS; max-- > 0; ) {
sqlcmd.CommandText = string.Format(update_data, rnd.Next(1, NUMBER_OF_TESTS - 1), Guid.NewGuid().ToString());
sqlcmd.ExecuteNonQuery();
}
if (trx != null)
trx.Commit();
return sw.ElapsedMilliseconds;
}
static long DeleteTime(DbConnection con, bool use_trx)
{
Stopwatch sw = Stopwatch.StartNew();
Random rnd = new Random(DateTime.Now.Millisecond);
var order = Enumerable.Range(1, NUMBER_OF_TESTS).ToArray<int>();
Action<int[], int, int> swap = (arr, a, b) => { int c = arr[a]; arr[a] = arr[b]; arr[b] = c; };
var sqlcmd = con.CreateCommand();
DbTransaction trx = null;
if (use_trx) {
trx = con.BeginTransaction();
sqlcmd.Transaction = trx;
}
// shuffling the array
for (var max = NUMBER_OF_TESTS; max-- > 0; ) swap(order, rnd.Next(0, NUMBER_OF_TESTS - 1), rnd.Next(0, NUMBER_OF_TESTS - 1));
foreach (int index in order) {
sqlcmd.CommandText = string.Format(delete_data, index);
sqlcmd.ExecuteNonQuery();
}
if (trx != null)
trx.Commit();
return sw.ElapsedMilliseconds;
}
}
Here are the numbers I get:
Testing: Sqlite, transactions: False, WAL: False
Creating table: 24.4 ms
Inserting data: 3084.7 ms
Updating data: 3147.8 ms
Selecting data: 30 ms
Deleting data: 3182.6 ms
Dropping table: 14.5 ms
Testing: Sqlite, transactions: False, WAL: True
Creating table: 2.3 ms
Inserting data: 14 ms
Updating data: 12.2 ms
Selecting data: 6.8 ms
Deleting data: 11.7 ms
Dropping table: 0 ms
Testing: Sqlite, transactions: True, WAL: False
Creating table: 13.5 ms
Inserting data: 20.3 ms
Updating data: 24.5 ms
Selecting data: 7.8 ms
Deleting data: 22.3 ms
Dropping table: 16.7 ms
Testing: Sqlite, transactions: True, WAL: True
Creating table: 3.2 ms
Inserting data: 5.8 ms
Updating data: 4.9 ms
Selecting data: 4.4 ms
Deleting data: 3.8 ms
Dropping table: 0 ms
Testing: Sqlce, transactions: False, WAL: False
Creating table: 2.8 ms
Inserting data: 24.4 ms
Updating data: 42.8 ms
Selecting data: 30.4 ms
Deleting data: 38.3 ms
Dropping table: 3.3 ms
Testing: Sqlce, transactions: True, WAL: False
Creating table: 2.1 ms
Inserting data: 24.6 ms
Updating data: 44.2 ms
Selecting data: 32 ms
Deleting data: 37.8 ms
Dropping table: 3.2 ms
~3 seconds for 200 inserts or updates using sqlite might still seem a little high, but at least it's more reasonable than 23 seconds. Conversely, one might be worried how SqlCe takes too little time to complete the same 200 inserts or update, especially since there seems to be no real speed difference between having each SQL query in individual transactions, or together in one transaction. I don't know enough about SqlCe to explain this, but it worries me. Would it mean that when .Commit() returns, you are not assured that the changes are actually written to disk?
I have recently worked on a project using SQL CE 4 and NHibernate and I found the performance to be really good. With SQL CE 4 we were able to insert 8000 records in a second. With Oracle over the network we were only able to insert 100 records per second even batch-size and seqhilo approaches were used.
I did not test it but looking at some of the performance reports for NoSQL products for .NET, SQL CE 4 seems to be one of the best solutions for stand-alone applications.
Just avoid using Identity columns, we noticed the performance was 40 times better if they are not used. The same 8000 records were taking 40 secs to insert when an Identity column was used as PK.