How can I merge many SQLite databases? - database

If I have a large number of SQLite databases, all with the same schema, what is the best way to merge them together in order to perform a query on all databases?
I know it is possible to use ATTACH to do this but it has a limit of 32 and 64 databases depending on the memory system on the machine.

To summarize from the Nabble post in DavidM's answer:
attach 'c:\test\b.db3' as toMerge;
BEGIN;
insert into AuditRecords select * from toMerge.AuditRecords;
COMMIT;
detach toMerge;
Repeat as needed.
Note: added detach toMerge; as per mike's comment.

Although a very old thread, this is still a relevant question in today's programming needs. I am posting this here because none of the answers provided yet is concise, easy, and straight-to-point. This is for sake of Googlers that end up on this page. GUI we go:
Download Sqlitestudio
Add all your database files by using the Ctrl + O keyboard shortcut
Double-click each now-loaded db file to open/activate/expand them all
Fun part: simply right-click on each of the tables and click on Copy, and then go to the target database in the list of the loaded database files (or create new one if required) and right-click on the target db and click on Paste
I was wowed to realize that such a daunting task can be solved using the ancient programming skill called: copy-and-paste :)

Here is a simple python code to either merge two database files or scan a directory to find all database files and merge them all together (by simply inserting all data in other files to the first database file found).Note that this code just attaches the databases with the same schema.
import sqlite3
import os
def merge_databases(db1, db2):
con3 = sqlite3.connect(db1)
con3.execute("ATTACH '" + db2 + "' as dba")
con3.execute("BEGIN")
for row in con3.execute("SELECT * FROM dba.sqlite_master WHERE type='table'"):
combine = "INSERT OR IGNORE INTO "+ row[1] + " SELECT * FROM dba." + row[1]
print(combine)
con3.execute(combine)
con3.commit()
con3.execute("detach database dba")
def read_files(directory):
fname = []
for root,d_names,f_names in os.walk(directory):
for f in f_names:
c_name = os.path.join(root, f)
filename, file_extension = os.path.splitext(c_name)
if (file_extension == '.sqlitedb'):
fname.append(c_name)
return fname
def batch_merge(directory):
db_files = read_files(directory)
for db_file in db_files[1:]:
merge_databases(db_files[0], db_file)
if __name__ == '__main__':
batch_merge('/directory/to/database/files')

Late answer, but you can use:
#!/usr/bin/python
import sys, sqlite3
class sqlMerge(object):
"""Basic python script to merge data of 2 !!!IDENTICAL!!!! SQL tables"""
def __init__(self, parent=None):
super(sqlMerge, self).__init__()
self.db_a = None
self.db_b = None
def loadTables(self, file_a, file_b):
self.db_a = sqlite3.connect(file_a)
self.db_b = sqlite3.connect(file_b)
cursor_a = self.db_a.cursor()
cursor_a.execute("SELECT name FROM sqlite_master WHERE type='table';")
table_counter = 0
print("SQL Tables available: \n===================================================\n")
for table_item in cursor_a.fetchall():
current_table = table_item[0]
table_counter += 1
print("-> " + current_table)
print("\n===================================================\n")
if table_counter == 1:
table_to_merge = current_table
else:
table_to_merge = input("Table to Merge: ")
return table_to_merge
def merge(self, table_name):
cursor_a = self.db_a.cursor()
cursor_b = self.db_b.cursor()
new_table_name = table_name + "_new"
try:
cursor_a.execute("CREATE TABLE IF NOT EXISTS " + new_table_name + " AS SELECT * FROM " + table_name)
for row in cursor_b.execute("SELECT * FROM " + table_name):
print(row)
cursor_a.execute("INSERT INTO " + new_table_name + " VALUES" + str(row) +";")
cursor_a.execute("DROP TABLE IF EXISTS " + table_name);
cursor_a.execute("ALTER TABLE " + new_table_name + " RENAME TO " + table_name);
self.db_a.commit()
print("\n\nMerge Successful!\n")
except sqlite3.OperationalError:
print("ERROR!: Merge Failed")
cursor_a.execute("DROP TABLE IF EXISTS " + new_table_name);
finally:
self.db_a.close()
self.db_b.close()
return
def main(self):
print("Please enter name of db file")
file_name_a = input("File Name A:")
file_name_b = input("File Name B:")
table_name = self.loadTables(file_name_a, file_name_b)
self.merge(table_name)
return
if __name__ == '__main__':
app = sqlMerge()
app.main()
SRC : Tool to merge identical SQLite3 databases

If you only need to do this merge operation once (to create a new bigger database), you could create a script/program that will loop all your sqlite databases and then insert the data into your main (big) database.

If you have reached the bottom of this feed and yet didn't find your solution, here is also a way to merge the tables of 2 or more sqlite databases.
First try to download and install DB browser for sqlite database. Then try to open your databases in 2 windows and try merging them by simply drag and drop tables from one to another. But the problem is that you can just drag and drop only one table at a time and therefore its not really a solution for this answer specifically but yet it can used to save some time from further searches if your database is small.

With no offense, just as one developer to another, I'm afraid that your idea seems terribly inefficient.
It seems to me that instead of uniting SQLite databases you should probably be storing several tables within the same Database file.
However if I'm mistaken I guess you could ATTACH the databases and then use a VIEW to simplify your queries. Or make an in-memory table and copy over all the data (but that's even worse performance wise, especially if you have large databases)

Related

How to use SQL Server Bulk Insert in Kedro Node?

I am managing a data pipeline using Kedro and at the last step I have a huge csv file stored in a S3 bucket and I need to load it back to SQL Server.
I'd normally go about that with a bulk insert, but not quite sure how to fit that into the kedro templates. This are the destination table and the S3 Bucket as configured in the catalog.yml
flp_test:
type: pandas.SQLTableDataSet
credentials: dw_dev_credentials
table_name: flp_tst
load_args:
schema: 'dwschema'
save_args:
schema: 'dwschema'
if_exists: 'replace'
bulk_insert_input:
type: pandas.CSVDataSet
filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv
credentials: dev_s3
def insert_data(self, conn, csv_file_nm, db_table_nm):
qry = "BULK INSERT " + db_table_nm + " FROM '" + csv_file_nm + "' WITH (FORMAT = 'CSV')"
# Execute the query
cursor = conn.cursor()
success = cursor.execute(qry)
conn.commit()
cursor.close
How do I point csv_file_nm to my bulk_insert_input S3 catalog?
Is there a proper way to indirectly access dw_dev_credentials to do the insert?
Kedro's pandas.SQLTableDataSet.html uses the pandas.to_sql method as is. To use this as is you would need one pandas.CSVDataSet into a node which then writes to a target pandas.SQLDataTable dataset in order to write it to SQL. If you have Spark available this will be faster than Pandas.
In order to use the built in BULK INSERT query I think you will need to define a custom dataset.

SQL CLR Trigger - get source table

I am creating a DB synchronization engine using SQL CLR Triggers in Microsoft SQL Server 2012. These triggers do not call a stored procedure or function (and thereby have access to the INSERTED and DELETED pseudo-tables but do not have access to the ##procid).
Differences here, for reference.
This "sync engine" uses mapping tables to determine what the table and field maps are for this sync job. In order to determine the target table and fields (from my mapping table) I need to get the source table name from the trigger itself. I have come across many answers on Stack Overflow and other sites that say that this isn't possible. But, I've found one website that provides a clue:
Potential Solution:
using (SqlConnection lConnection = new SqlConnection(#"context connection=true")) {
SqlCommand cmd = new SqlCommand("SELECT object_name(resource_associated_entity_id) FROM sys.dm_tran_locks WHERE request_session_id = ##spid and resource_type = 'OBJECT'", lConnection);
cmd.CommandType = CommandType.Text;
var obj = cmd.ExecuteScalar();
}
This does in fact return the correct table name.
Question:
My question is, how reliable is this potential solution? Is the ##spid actually limited to this single trigger execution? Or is it possible that other simultaneous triggers will overlap within this process id? Will it stand up to multiple executions of the same and/or different triggers within the database?
From these sites, it seems the process Id is in fact limited to the open connection, which doesn't overlap: here, here, and here.
Will this be a safe method to get my source table?
Why?
As I've noticed similar questions, but all without a valid answer for my specific situation (except that one). Most of the comments on those sites ask "Why?", and in order to preempt that, here is why:
This synchronization engine operates on a single DB and can push changes to target tables, transforming the data with user-defined transformations, automatic source-to-target type casting and parsing and can even use the CSharpCodeProvider to execute methods also stored in those mapping tables for transforming data. It is already built, quite robust and has good performance metrics for what we are doing. I'm now trying to build it out to allow for 1:n table changes (including extension tables requiring the same Id as the 'master' table) and am trying to "genericise" the code. Previously each trigger had a "target table" definition hard coded in it and I was using my mapping tables to determine the source. Now I'd like to get the source table and use my mapping tables to determine all the target tables. This is used in a medium-load environment and pushes changes to a "Change Order Book" which a separate server process picks up to finish the CRUD operation.
Edit
As mentioned in the comments, the query listed above is quite "iffy". It will often (after a SQL Server restart, for example) return system objects like syscolpars or sysidxstats. But, it seems that in the dm_tran_locks table there's always an associated resource_type of 'RID' (Row ID) with the same object_name. My current query which works reliably so far is the following (will update if this changes or doesn't work under high load testing):
select t1.ObjectName FROM (
SELECT object_name(resource_associated_entity_id) as ObjectName
FROM sys.dm_tran_locks WHERE resource_type = 'OBJECT' and request_session_id = ##spid
) t1 inner join (
SELECT OBJECT_NAME(partitions.OBJECT_ID) as ObjectName
FROM sys.dm_tran_locks
INNER JOIN sys.partitions ON partitions.hobt_id = dm_tran_locks.resource_associated_entity_id
WHERE resource_type = 'RID'
) t2 on t1.ObjectName = t2.ObjectName
If this is always the case, I'll have to find that out during testing.
How reliable is this potential solution?
While I do not have time to set up a test case to show it not working, I find this approach (even taking into account the query in the Edit section) "iffy" (i.e. not guaranteed to always be reliable).
The main concerns are:
cascading (whether recursive or not) Trigger executions
User (i.e. Explicit / Implicit) transactions
Sub-processes (i.e. EXEC and sp_executesql)
These scenarios allow for multiple objects to be locked, all at the same time.
Is the ##SPID actually limited to this single trigger execution? Or is it possible that other simultaneous triggers will overlap within this process id?
and (from a comment on the question):
I think I can join my query up with the sys.partitions and get a dm_trans_lock that has a type of 'RID' with an object name that will match up to the one in my original query.
And here is why it shouldn't be entirely reliable: the Session ID (i.e. ##SPID) is constant for all of the requests on that Connection). So all sub-processes (i.e. EXEC calls, sp_executesql, Triggers, etc) will all be on the same ##SPID / session_id. So, between sub-processes and User Transactions, you can very easily get locks on multiple resources, all on the same Session ID.
The reason I say "resources" instead of "OBJECT" or even "RID" is that locks can occur on: rows, pages, keys, tables, schemas, stored procedures, the database itself, etc. More than one thing can be considered an "OBJECT", and it is possible that you will have page locks instead of row locks.
Will it stand up to multiple executions of the same and/or different triggers within the database?
As long as these executions occur in different Sessions, then they are a non-issue.
ALL THAT BEING SAID, I can see where simple testing would show that your current method is reliable. However, it should also be easy enough to add more detailed tests that include an explicit transaction that first does some DML on another table, or have a trigger on one table do some DML on one of these tables, etc.
Unfortunately, there is no built-in mechanism that provides the same functionality that ##PROCID does for T-SQL Triggers. I have come up with a scheme that should allow for getting the parent table for a SQLCLR Trigger (that takes into account these various issues), but haven't had a chance to test it out. It requires using a T-SQL trigger, set as the "first" trigger, to set info that can be discovered by the SQLCLR Trigger.
A simpler form can be constructed using CONTEXT_INFO, if you are not already using it for something else (and if you don't already have a "first" Trigger set). In this approach you would still create a T-SQL Trigger, and then set it as the "first" Trigger using sp_settriggerorder. In this Trigger you SET CONTEXT_INFO to the table name that is the parent of ##PROCID. You can then read CONTEXT_INFO() on a Context Connection in a SQLCLR Trigger. If there are multiple levels of Triggers then the value of CONTEXT INFO will get overwritten, so reading that value must be the first thing you do in each SQLCLR Trigger.
This is an old thread, but it is an FAQ and I think I have a better solution. Essentially it uses the schema of the inserted or deleted table to find the base table by doing a hash of the column names and comparing the hash with the hashes of tables with a CLR trigger on them.
Code snippet below - at some point I will probably put the whole solution on Git (it sends a message to Azure Service Bus when the trigger fires).
private const string colqry = "select top 1 * from inserted union all select top 1 * from deleted";
private const string hashqry = "WITH cols as ( "+
"select top 100000 c.object_id, column_id, c.[name] "+
"from sys.columns c "+
"JOIN sys.objects ot on (c.object_id= ot.parent_object_id and ot.type= 'TA') " +
"order by c.object_id, column_id ) "+
"SELECT s.[name] + '.' + o.[name] as 'TableName', CONVERT(NCHAR(32), HASHBYTES('MD5',STRING_AGG(CONVERT(NCHAR(32), HASHBYTES('MD5', cols.[name]), 2), '|')),2) as 'MD5Hash' " +
"FROM cols "+
"JOIN sys.objects o on (cols.object_id= o.object_id) "+
"JOIN sys.schemas s on (o.schema_id= s.schema_id) "+
"WHERE o.is_ms_shipped = 0 "+
"GROUP BY s.[name], o.[name]";
public static void trgSendSBMsg()
{
string table = "";
SqlCommand cmd;
SqlDataReader rdr;
SqlTriggerContext trigContxt = SqlContext.TriggerContext;
SqlPipe p = SqlContext.Pipe;
using (SqlConnection con = new SqlConnection("context connection=true"))
{
try
{
con.Open();
string tblhash = "";
using (cmd = new SqlCommand(colqry, con))
{
using (rdr = cmd.ExecuteReader(CommandBehavior.SingleResult))
{
if (rdr.Read())
{
MD5 hash = MD5.Create();
StringBuilder hashstr = new StringBuilder(250);
for (int i=0; i < rdr.FieldCount; i++)
{
if (i > 0) hashstr.Append("|");
hashstr.Append(GetMD5Hash(hash, rdr.GetName(i)));
}
tblhash = GetMD5Hash(hash, hashstr.ToString().ToUpper()).ToUpper();
}
rdr.Close();
}
}
using (cmd = new SqlCommand(hashqry, con))
{
using (rdr = cmd.ExecuteReader(CommandBehavior.SingleResult))
{
while (rdr.Read())
{
string hash = rdr.GetString(1).ToUpper();
if (hash == tblhash)
{
table = rdr.GetString(0);
break;
}
}
rdr.Close();
}
}
if (table.Length == 0)
{
p.Send("Error: Unable to find table that CLR trigger is on. Message not sent!");
return;
}
….
HTH

RODBC::sqlSave - problems creating/appending to a table

Related to several other questions on the RODBC package, I'm having problems using RODBC::sqlSave to write to a table on a SQL Server database. I'm using MS SQL Server 2008 and 64-bit R on a Windows RDP.
The solution in the 3rd link (questions) does work [sqlSave(ch, df)]. But in this case, it writes to the wrong data base. That is, my default DB is "C2G" but I want to write to "BI_Sandbox". And it doesn't allow for options such as rownames, etc. So there still seems to be a problem in the package.
Obviously, a possible solution would be to change my ODBC solution to the specified database, but it seems there should be a better method. And this wouldn't solve the problem of unusable parameters in the sqlSave command--such as rownames, varTypes, etc.
I have the following ODBC- System DSN connnection:
Microsoft SQL Server Native Client Version 11.00.3000
Data Source Name: c2g
Data Source Description: c2g
Server: DC01-WIN-SQLEDW\BISQL01,29537
Use Integrated Security: Yes
Database: C2G
Language: (Default)
Data Encryption: No
Trust Server Certificate: No
Multiple Active Result Sets(MARS): No
Mirror Server:
Translate Character Data: Yes
Log Long Running Queries: No
Log Driver Statistics: No
Use Regional Settings: No
Use ANSI Quoted Identifiers: Yes
Use ANSI Null, Paddings and Warnings: Yes
R code:
R> ch <- odbcConnect("c2g")
R> sqlSave(ch, zinq_scores, tablename = "[bi_sandbox].[dbo].[table1]",
append= FALSE, rownames= FALSE, colnames= FALSE)
Error in sqlColumns(channel, tablename) :
‘[bi_sandbox].[dbo].[table1]’: table not found on channel
# after error, try again:
R> sqlDrop(ch, "[bi_sandbox].[dbo].[table1]", errors = FALSE)
R> sqlSave(ch, zinq_scores, tablename = "[bi_sandbox].[dbo].[table1]",
append= FALSE, rownames= FALSE, colnames= FALSE)
Error in sqlSave(ch, zinq_scores, tablename = "[bi_sandbox].[dbo].[table1]", :
42S01 2714 [Microsoft][SQL Server Native Client 11.0][SQL Server]There is already an object named 'table1' in the database.
[RODBC] ERROR: Could not SQLExecDirect 'CREATE TABLE [bi_sandbox].[dbo].[table1] ("credibility_review" float, "creditbuilder" float, "no_product" float, "duns" varchar(255), "pos_credrev" varchar(5), "pos_credbuild" varchar(5))'
In the past, I've gotten around this by running the supremely inefficient sqlQuery with insert into row-by-row to get around this. But I tried this time and no data was written. Although the sqlQuery statement did not have an error or warning message.
temp <-"INSERT INTO [bi_sandbox].[dbo].[table1]
+ (credibility_review, creditbuilder, no_product, duns, pos_credrev, pos_credbuild) VALUES ("
>
> for(i in 1:nrow(zinq_scores)) {
+ sqlQuery(ch, paste(temp, "'", zinq_scores[i, 1], "'",",", " ",
+ "'", zinq_scores[i, 2], "'", ",",
+ "'", zinq_scores[i, 3], "'", ",",
+ "'", zinq_scores[i, 4], "'", ",",
+ "'", zinq_scores[i, 5], "'", ",",
+ "'", zinq_scores[i, 6], "'", ")"))
+ }
> str(sqlQuery(ch, "select * from [bi_sandbox].[dbo].[table1]"))
'data.frame': 0 obs. of 6 variables:
$ credibility_review: chr
$ creditbuilder : chr
$ no_product : chr
$ duns : chr
$ pos_credrev : chr
$ pos_credbuild : chr
Any help would be greatly appreciated.
Also, if there is any missing detail, please let me know and I'll edit the question.
My apologies up front. This is not exactly a "simple example." It's pretty trivial, but there are a lot of parts. And by the end, you'll probably think I'm crazy for doing it this way.
Starting in SQL Server Management Studio
First, I've created a database on SQL Server called mtcars with default schema dbo. I've also added myself as a user. Under my own user name, I am the database owner, so I can do anything I want to the database, but from R, I will connect using a generic account that only has EXECUTE privileges.
The predefined table in the database that we are going to write to is called mtcars. (So the full path to the table is mtcars.dbo.mtcars; it's lazy, I know). The code to define the table is
USE [mtcars]
GO
/****** Object: Table [dbo].[mtcars] Script Date: 2/22/2016 11:56:53 AM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[mtcars](
[OID] [int] IDENTITY(1,1) NOT NULL,
[mpg] [numeric](18, 0) NULL,
[cyl] [numeric](18, 0) NULL,
[disp] [numeric](18, 0) NULL,
[hp] [numeric](18, 0) NULL
) ON [PRIMARY]
GO
Stored Procedures
I'm going to use two stored procedures. The first is an "UPSERT" procedure, that will first try to update a row in a table. If that fails, it will insert the row into the table.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE dbo.sample_procedure
#OID int = 0,
#mpg numeric(18,0) = 0,
#cyl numeric(18,0) = 0,
#disp numeric(18,0) = 0,
#hp numeric(18,0) = 0
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- TRANSACTION code borrowed from
-- http://stackoverflow.com/a/21209131/1017276
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
UPDATE dbo.mtcars
SET mpg = #mpg,
cyl = #cyl,
disp = #disp,
hp = #hp
WHERE OID = #OID;
IF ##ROWCOUNT = 0
BEGIN
INSERT dbo.mtcars (mpg, cyl, disp, hp)
VALUES (#mpg, #cyl, #disp, #hp)
END
COMMIT TRANSACTION;
END
GO
Another stored procedure I will use is just the equivalent of RODBC::sqlFetch. As far as I can tell, sqlFetch depends on SQL injection, and I'm not allowed to use it. Just to be on the safe side of our data security policies, I write little procedures like this (Data security is pretty tight here, you may or may not need this)
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE dbo.get_mtcars
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
SELECT * FROM dbo.mtcars
END
GO
Now, from R
I have a utility function I use to help me manage inputting data into the stored procedures. sqlSave would do a lot of this automatically, so I'm kind of reinventing the wheel. The gist of the utility function is to determine if the value I'm pushing to the database needs to be nested in quotes or not.
#* Utility function. This does a couple helpful things like
#* Convert NA and NULL into a SQL NULL
#* wrap character strings and dates in single quotes
sqlNullString <- function(value, numeric=FALSE)
{
if (is.null(value)) value <- "NULL"
if (is.na(value)) value <- "NULL"
if (inherits(value, "Date")) value <- format(x = value, format = "%Y-%m-%d")
if (value == "NULL") return(value)
else if (numeric) return(value)
else return(paste0("'", value, "'"))
}
This next step isn't strictly necessary, but I'm going to do it just so that my R table is similar to my SQL table. This is organizational strategy on my part.
mtcars$OID <- NA
Now let's establish our connection:
server <- "[server_name]"
uid <- "[generic_user_name]"
pwd <- "[password]"
library(RODBC)
channel <- odbcDriverConnect(paste0("driver=SQL Server;",
"server=", server, ";",
"database=mtcars;",
"uid=", uid, ";",
"pwd=", pwd))
Now this next part is pure laziness. I'm going to use a for loop to push each row of the data frame the to SQL table one at a time. As noted in the original question, this is kind of inefficient. I'm sure I could write a stored procedure to accept several vectors of data, compile them into a temporary table, and do the UPSERT in SQL, but I don't work with large data sets when I'm doing this, and so it hasn't yet been worth it to me to write such a procedure. Instead, I prefer to stick with the code that is a little easier for me to reason with on my limited SQL skills.
Here, we're just going to push the first 5 rows of mtcars
#* Insert the first 5 rows into the SQL Table
for (i in 1:5)
{
sqlQuery(channel = channel,
query = paste0("EXECUTE dbo.sample_procedure ",
"#OID = ", sqlNullString(mtcars$OID[i]), ", ",
"#mpg = ", mtcars$mpg[i], ", ",
"#cyl = ", mtcars$cyl[i], ", ",
"#disp = ", mtcars$disp[i], ", ",
"#hp = ", mtcars$hp[i]))
}
And now we'll take a look at the table from SQL
sqlQuery(channel = channel,
query = "EXECUTE dbo.get_mtcars")
This next line is just to match up the OIDs in R and SQL for illustration purposes. Normally, I would do this manually.
mtcars$OID[1:5] <- 1:5
This next for loop will UPSERT all 32 rows. We already have 5, we're UPSERTing 32, and the SQL table at the end should have 32 if we've done it correctly. (That is, SQL will recognize the 5 rows that already exist)
#* Update/Insert (UPSERT) the entire table
for (i in 1:nrow(mtcars))
{
sqlQuery(channel = channel,
query = paste0("EXECUTE dbo.sample_procedure ",
"#OID = ", sqlNullString(mtcars$OID[i]), ", ",
"#mpg = ", mtcars$mpg[i], ", ",
"#cyl = ", mtcars$cyl[i], ", ",
"#disp = ", mtcars$disp[i], ", ",
"#hp = ", mtcars$hp[i]))
}
#* Notice that the first 5 rows were unchanged (though they would have changed
#* if we had changed the data...the point being that the stored procedure
#* correctly identified that these records already existed)
sqlQuery(channel = channel,
query = "EXECUTE dbo.get_mtcars")
Recap
The stored procedure approach has a major disadvantage in that it is blatantly reinventing the wheel. It also requires that you learn SQL. SQL is pretty easy to learn for simple tasks, but some of the code I've written for more complex tasks is pretty difficult to interpret. Some of my procedures have taken me the better part of a day to get right. (once they are done, however, they work incredibly well)
The other big disadvantage to the stored procedure is, I've noticed, it does require a little bit more code work and organization. I'd say it's probably been about 10% more code work and documentation than if I were just using SQL Injection.
The chief advantages of the stored procedures approach are
you have massive flexibility for what you want to do
You can store your SQL code into the database and not pollute your R code with potentially huge strings of SQL code
Avoiding SQL injection (again, this is a data security thing, and may not be an issue depending on your employer's policies. I'm strictly forbidden from using SQL injection, so stored procedures are my only option)
It should also be noted that I've not yet explored using Table-Valued parameters in my stored procedures, which might simplify things for me a bit.
In the past, I've gotten around this by running the supremely inefficient sqlQuery with insert into row-by-row to get around this. But I tried this time and no data was written. Although the sqlQuery statement did not have an error or warning message.
Faced it yesterday: in my case the issue was in scheme. The table was actually created but in my user own scheme.
First time you can create it and than you have this error (that object already exists)
After the investigation I found that some packages does not work correctly with schemes.
In the end I used "insert by line" solution. The solution is available here and here

Accessing SQLite Database in R

I want to access and manipulate a large data set in R. Since it's a large CSV file (~ 0.5 GB), I plan to import it
to SQLite and then access it from R. I know the sqldf and RSQLite packages can do this but I went
over their manuals and they are not helpful. Being a newbie to SQL doesn't help either.
I want to know do I have to set the R directory to SQLite's and then go from there? How do I read in the database in R then?
Heck, if you know how to access the DB from R without using SQL, please tell me.
Thanks!
It really is rather easy -- the path and filename to the sqlite db file is passed as the 'database' parameter. Here is CRANberries does:
databasefile <- "/home/edd/cranberries/cranberries.sqlite"
## ...
## main worker function
dailyUpdate <- function() {
stopifnot(all.equal(system("fping cran.r-project.org", intern=TRUE),
"cran.r-project.org is alive"))
setwd("/home/edd/cranberries")
dbcon <- dbConnect(dbDriver("SQLite"), dbname = databasefile)
repos <- dbGetQuery(dbcon,
paste("select max(id) as id, desc, url ",
"from repos where desc!='omegahat' group by desc")
# ...
That's really all there is. Of course, there are other queries later on...
You easily test all SQL queries in the sqlite client before trying from R, or trying directly from R.
Edit: As the above was apparently too terse, here is an example straight from the documentation:
con <- dbConnect(SQLite(), ":memory:") ## in-memory, replace with file
data(USArrests)
dbWriteTable(con, "arrests", USArrests)
res <- dbSendQuery(con, "SELECT * from arrests")
data <- fetch(res, n = 2)
data
dbClearResult(res)
dbGetQuery(con, "SELECT * from arrests limit 3")

SQLite to Oracle

I have a SQLite database in one system, I need to extract the data stored in SQLite to Oracle database. How do I do this?
Oracle provides product called the Oracle Database Mobile Server (previously called Oracle Database Lite) which allows you to synchronize between a SQLite and an Oracle database. It provides scalable bi-directional sync, schema mapping, security, etc. The Mobile Server supports both synchronous and asynchronous data sync. If this is more than a one-time export and you need to keep your SQLite and Oracle Databases in sync, this is a great tool!
Disclaimer: I'm one of the Product Managers for Oracle Database Mobile Server, so I'm a bit biased. However, the Mobile Server really is a great tool to use for keeping your SQLite (or Berkeley DB) and Oracle Databases in sync.
You'll have to convert the SQLite to a text file (not certain of the format) and then use Oracle to load the database from text (source is http://www.orafaq.com/wiki/SQLite). You can use the .dump command from the SQLite interactive shell to dump to a text file (see the docs for syntax).
SQL Loader is a utility that will read a delimited text file and import it into an oracle database. You will need to map out how each column from your flat file out of sqlite matches to the corresponding one in the Oracle database. Here is a good FAQ that should help you get started.
If you are a developer, you could develop an application to perform the sync. You would do
SELECT name FROM sqlite_master WHERE type='table'
to get the table names, then you could re-create them in Oracle (you can do DROP TABLE tablename in Oracle first, to avoid a conflict, assuming SQLite will be authoritative) with CREATE TABLE commands. Getting the columns for each one takes
SELECT sql FROM sqlite_master WHERE type='table' and name='MyTable'
And then you have to parse the result:
string columnNames = sql.replace(/^[^\(]+\(([^\)]+)\)/g, '$1').replace(/ [^,]+/g, '').split(',');
string[] columnArray = columnNames.Split(',');
foreach (string s in columnArray)
{
// Add column to table using:
// ALTER TABLE MyTable ADD COLUMN s NVARCHAR(250)
}
A StringBuilder can be used to collect the table name with its columns to create your INSERT command. To add the values, it would just be a matter of doing SELECT * FROM MyTable for each of the tables during your loop through the table names you got back from the initial query. You would iterate the columns of the rows of the datatable you were returned and add the values to the StringBuilder:
INSERT INTO MyTable ( + columnA, columnB, etc. + ) VALUES ( datarow[0], datarow[1], etc. + ).
Not exactly like that, though - you fill in the data by appending the column name and its data as you run through the loops. You can get the column names by appending s in that foreach loop, above. Each column value is then set using a foreach loop that gives you each object obj in drData.ItemArray. If all you have are string fields, it's easy, you just add obj.ToString() to your StringBuilder for each column value in your query like I have below. Then you run the query after collecting all of the column values for each row. You use a new StringBuilder for each row - it needs to get reset to INSERT INTO MyTable ( + columnA, columnB, etc. + ) VALUES ( prior to each new row, so the new column values can be appended.
If you have mixed datatypes (i.e. DATE, BLOB, etc.), you'll need to determine the column types along the way, store it in a list or array, then use a counter to determine the index of that list/array slot and get the type, so you know how to translate your object into something Oracle can use - whether that means simply adding to_date() to the result, with formatting, for a date (since SQLite stores these as date strings with the format yyyy-MM-dd HH:mm:ss), or adding it to an OracleParameter for a BLOB and sending that along to a RunOracleCommand function. (I did not go into this, below.)
Putting all of this together yields this:
string[] columnArray = null;
DataTable dtTableNames = GetSQLiteTable("SELECT name FROM sqlite_master WHERE type='table'");
if (dtTableNames != null && dtTableNames.Rows != null)
{
if (dtTableNames.Rows.Count > 0)
{
// We have tables
foreach (DataRow dr in dtTableNames.Rows)
{
// Do everything about this table here
StringBuilder sb = new StringBuilder();
sb.Append("INSERT INTO " + tableName + " ("); // we will collect column names here
string tableName = dr["NAME"] != null ? dr["NAME"].ToString() : String.Empty;
if (!String.IsNullOrEmpty(tableName))
{
RunOracleCommand("DROP TABLE " + tableName);
RunOracleCommand("CREATE TABLE " + tableName);
}
DataTable dtColumnNames = GetSQLiteTable("SELECT sql FROM sqlite_master WHERE type='table' AND name='"+tableName+"'");
if (dtColumnNames != null && dtColumnNames.Rows != null)
{
if (dtColumnNames.Rows.Count > 0)
{
// We have columns
foreach (DataRow drCol in dtTableNames.Rows)
{
string sql = drCol["SQL"] != null ? drCol["SQL"].ToString() : String.Empty;
if (!String.IsNullOrEmpty(sql))
{
string columnNames = sql.replace(/^[^\(]+\(([^\)]+)\)/g, '$1').replace(/ [^,]+/g, '').split(',');
columnArray = columnNames.Split(',');
foreach (string s in columnArray)
{
// Add column to table using:
RunOracleCommand("ALTER TABLE " + tableName + " ADD COLUMN " + s + " NVARCHAR(250)"); // can hard-code like this or use logic to determine the datatype/column width
sb.Append("'" + s + "',");
}
sb.TrimEnd(",");
sb.Append(") VALUES (");
}
}
}
}
// Get SQLite Table data for insertion to Oracle
DataTable dtTableData = GetSQLiteTable("SELECT * FROM " + tableName);
if (dtTableData != null && dtTableData.Rows != null)
{
if (dtTableData.Rows.Count > 0)
{
// We have data
foreach (DataRow drData in dtTableData.Rows)
{
StringBuilder sbRow = sb; // resets to baseline for each row
foreach (object obj in drData.ItemArray)
{
// This is simplistic and assumes you have string data for an NVARCHAR field
sbRow.Append("'" + obj.ToString() + "',");
}
sbRow.TrimEnd(",");
sbRow.Append(")");
RunOracleCommand(sbRow.ToString());
}
}
}
}
}
}
All of this assumes you have a RunOracleCommand() void function that can take a SQL command and run it against an Oracle DB, and a GetSQLiteTable() function that can return a DataTable from your SQLite DB by passing it a SQL command.
Note that this code is untested, as I wrote it directly in this post, but it is based heavily on code I wrote to sync Oracle into SQLite, which has been tested and works.

Resources