Accessing SQLite Database in R - database

I want to access and manipulate a large data set in R. Since it's a large CSV file (~ 0.5 GB), I plan to import it
to SQLite and then access it from R. I know the sqldf and RSQLite packages can do this but I went
over their manuals and they are not helpful. Being a newbie to SQL doesn't help either.
I want to know do I have to set the R directory to SQLite's and then go from there? How do I read in the database in R then?
Heck, if you know how to access the DB from R without using SQL, please tell me.
Thanks!

It really is rather easy -- the path and filename to the sqlite db file is passed as the 'database' parameter. Here is CRANberries does:
databasefile <- "/home/edd/cranberries/cranberries.sqlite"
## ...
## main worker function
dailyUpdate <- function() {
stopifnot(all.equal(system("fping cran.r-project.org", intern=TRUE),
"cran.r-project.org is alive"))
setwd("/home/edd/cranberries")
dbcon <- dbConnect(dbDriver("SQLite"), dbname = databasefile)
repos <- dbGetQuery(dbcon,
paste("select max(id) as id, desc, url ",
"from repos where desc!='omegahat' group by desc")
# ...
That's really all there is. Of course, there are other queries later on...
You easily test all SQL queries in the sqlite client before trying from R, or trying directly from R.
Edit: As the above was apparently too terse, here is an example straight from the documentation:
con <- dbConnect(SQLite(), ":memory:") ## in-memory, replace with file
data(USArrests)
dbWriteTable(con, "arrests", USArrests)
res <- dbSendQuery(con, "SELECT * from arrests")
data <- fetch(res, n = 2)
data
dbClearResult(res)
dbGetQuery(con, "SELECT * from arrests limit 3")

Related

R Converting SQL Server Query with Geometry Datatype to spatialpolygonsdataframe

I am trying to plot geometry (binary) polygon data from an SQL Server data source. What I want to do is use the Geometry Data Type from the SQL query for the polygons, and also the rest of the columns in the query as the #data attribute table within the SpatialPolygonsDataFrame class.
This is my code so far, to get the SQL query data into a simple data.frame and convert the binary datatype using wkb::readWKB().
From this stage, I do not know how to create the SpatialPolygonsDataFrame dataframe.
library(RODBC)
library(maptools)
library(rgdal)
library(ggplot2)
dbhandle <- odbcDriverConnect("connection string",rows_at_time = 1)
sqlStatement <- "SELECT ID
, shape.STAsBinary() as shape
, meshblock_number
, areaunit_code
, dpz_code
, catchment_id
FROM [primary_parcels] hp "
sqlStatement <- gsub("[\r\n]", "", sqlStatement)
parcelData <- sqlQuery(dbhandle,sqlStatement )
odbcClose(dbhandle)
parcelData$shape <- wkb::readWKB(parcelData$shape)
This might be too late, however, with the help of a friend and this I found a solution for that. It seems a bit funny, but I am also working on a similar set of data and had the same problem. Please remember, it is difficult to replicate your approach as you did not provide a replicable example. You also need to adjust the projection. Please note that it is easier to build the connection in odbc once and call it each time, rather than writing the connection each time.
library(rgeos)
library(mapview)
library(raster)
library(dplyr)
library(sp)
#You may ignore this
odbcCh<-odbcConnect("Rtest")
sqlStatement=sqlQuery(odbcCh, 'SELECT ID , shape.STAsBinary() as shape, meshblock_number , areaunit_code, dpz_code, catchment_id FROM [primary_parcels] hp')
parcelData <- sqlQuery(odbcCh,sqlStatement )
#untill here
things <- vector("list", 1)
z = 0
for(line in parcelData$shape)
{
{
things[[z+1]]<-readWKT(line)
}
z = z + 1
}
Things <- do.call(bind,things)
Things.df= SpatialPolygonsDataFrame(Things,data.frame(parcelData$ID,parcelData$catchment_id))
plot(Things.df)
#you may not need the rest
Things.df#proj4string= CRS("+proj=nzmg +lat=-41.0 +lon=173.0 +x_0=2510000.0 +y_0=6023150.0 +ellps=intl +units=m")
mapview(Things.df)

R RODBCext and Parameterizing IN statement?

I've been working to parameterize a SQL Statement that uses the IN statement in the WHERE clause. I'm using rodbcext library for parameterizing but it seems to lack expansion of a list.
I was hoping to write code such as
sqlExecute("SELECT * FROM table WHERE name IN (?)", c("paul","ringo","john", "george")
I'm using the following code but wondered if there's an easier way.
library(RODBC)
library(RODBCext)
# Search inputs
names <- c("paul", "ringo", "john", "george")
# Build SQL statement
qmarks <- replicate(length(names), "?")
stringmarks <- paste(qmarks, collapse = ",")
sql <- paste("SELECT * FROM tableA WHERE name IN (", stringmarks, ")")
# expand to Columns - seems to be the magic step required
bindnames <- rbind(names)
# Execute SQL statement
dbhandle <- RODBC::odbcDriverConnect(connectionString)
result <- RODBCext::sqlExecute(dbhandle, sql, bindnames, fetch = TRUE)
RODBC::odbcClose(dbhandle)
It works but feel I'm using R to expand the strings in the wrong way (bit new to R - so many ways to do the same thing wrong). Somebody will probably say "that creates factors - never do that" :-)
I found this article which suggest I'm on the right track but it doesn't discuss having to expand the "?" and turn the list into columns of a data.frame
R RODBC putting list of numbers into an IN() statement
Thank you.
UPDATE: As Benjamin shows below - the sqlExecute function can handle a list() of inputs. However upon inspection of the resulting SQL I discovered that it uses cursors to rollup the results. This significantly increases the CPU and I/O over the sample code I show above.
While the library can indeed solve this for you, for large results it may be too expensive. There are two answers and it depends upon your needs.
Since your only parameter in the query is in collection for IN, you could get away with
sqlExecute(dbhandle,
"SELECT * FROM table WHERE name IN (?)",
list(c("paul","ringo","john", "george")),
fetch = TRUE)
sqlExecute will bind the values in the list to the question mark. Here, it will actually repeat the query four times, once for each value in the vector. It may seem kind of silly to do it this way, but when trying to pass strings, it's a lot safer in many ways to let the binding take care of setting up the appropriate quote structure rather than trying to paste it in yourself. You will generate fewer errors this way and avoid a lot of database security concerns.
What if you declare a variable table in a character object and then concatenate with the query.
library(RODBC)
library(RODBCext)
# Search inputs
names <- c("paul", "ringo", "john", "george")
# Build SQL statement
sql_top <- paste0( "SET NOCOUNT ON \r\n DECLARE #LST_NAMES TABLE (ID NVARCHAR(20)) \r\n INSERT INTO #LST_NAMES VALUES ('", paste(names, collapse = "'), ('" ) , "')")
sql_body <- paste("SELECT * FROM tableA WHERE name IN (SELECT id FROM #LST_NAMES)")
sql <- paste0(sql_top, "\r\n", sql_body)
# Execute SQL statement
dbhandle <- RODBC::odbcDriverConnect(connectionString)
result <- RODBCext::sqlExecute(dbhandle, sql, bindnames, fetch = TRUE)
RODBC::odbcClose(dbhandle)
The query will be (the set no count on is important to retrieve the results)
SET NOCOUNT ON
DECLARE #LST_NAMES TABLE (ID NVARCHAR(20))
INSERT INTO #LST_NAMES VALUES ('paul'), ('ringo'), ('john'), ('george')
SELECT * FROM tableA WHERE name IN (SELECT id FROM #LST_NAMES)

For "large" data, is it better to use sql connection or import a csv file

So I'm trying to connect to a database using dplyr and execute commands on that data. However, the process if taking far too long (> 10 minutes). In SQL Server, it takes around 2 minutes, so I could just export it as a csv and then import it into R or Python. So as a general rule, do you suggest using sql connections from R or Python, or exporting a csv file directly from the sql database.
Here's the R code I'm using:
library(dplyr)
aw <- RSQLServer::src_sqlserver("****", database = "****")
dept <- tbl(aw, sql("select work_dt, campaign, keyword,
impressions, clicks, cost
from abidwise_detail
where work_dt between '2014-01-01' and '2014-05-01'")))
(dept <- tbl(aw, sql("select work_dt, campaign, keyword,
sum(impressions) as impressions,
sum(clicks) as clicks,
sum(cost) as cost
from abidwise_detail
where work_dt between '2014-01-01' and '2014-02-01'
group by work_dt, campaign, keyword")))
rd <- dept %>%
filter(campaign == "ask")
# Bring the full data set back to R
dat <- collect(rd)
What should I do. both of these queries take too long. Should I just export as a csv file and just read the files in from a directory.
Thanks!

Bind variables in R DBI

In R's DBI package, I'm not finding a facility for using bound variables. I did find a document (the original vignette from 2002) that says about bound variables, "Perhaps the DBI could at some point in the future implement this feature", but it looks like so far that's left undone.
What do people in R use for a substitute? Just concatenate strings right into the SQL? That's got some obvious problems for safety & performance.
EDIT:
Here's an example of how placeholders could work:
query <- "SELECT numlegs FROM animals WHERE color=?"
result <- dbGetQuery(caseinfo, query, bind="green")
That's not a very well-thought-out interface, but the idea is that you can use a value for bind and the driver handles the details of escaping (if the underlying API doesn't handle bound variables natively) without the caller having to reimplement it [badly].
For anyone coming to this question like I just did after googling for rsqlite and dbgetpreparedquery, it seems that in the latest version of rsqlite you can run a SELECT query with bind variables. I just ran the following:
query <- "SELECT probe_type,next_base,color_channel FROM probes WHERE probeid=?"
probe.types.df <- dbGetPreparedQuery(con,que,bind.data=data.frame(probeids=ids))
This was relatively fast (selecting 2,000 rows out of a 450,000 row table) and is incredibly useful.
FYI.
Below is a summary of what's currently supported in RSQLite for bound
parameters. You are right that there is not currently support for
SELECT, but there is no good reason for this and I would like to add
support for it.
If you feel like hacking, you can get a read-only checkout of all of
the DBI related packages here:
use --user=readonly --password=readonly
https://hedgehog.fhcrc.org/compbio/r-dbi/trunk
https://hedgehog.fhcrc.org/compbio/r-dbi/trunk/DBI
https://hedgehog.fhcrc.org/compbio/r-dbi/trunk/SQLite/RSQLite
I like to receive patches, especially if they include tests and
documentation. Unified diff, please. I actually do all my
development using git and so best case is to create a git clone of say
RSQLite and then send me diffs as git format-patch -n
git-svn..
Anyhow, here are some examples:
library("RSQLite")
make_data <- function(n)
{
alpha <- c(letters, as.character(0:9))
make_key <- function(n)
{
paste(sample(alpha, n, replace = TRUE), collapse = "")
}
keys <- sapply(sample(1:5, replace=TRUE), function(x) make_key(x))
counts <- sample(seq_len(1e4), n, replace = TRUE)
data.frame(key = keys, count = counts, stringsAsFactors = FALSE)
}
key_counts <- make_data(100)
db <- dbConnect(SQLite(), dbname = ":memory:")
sql <- "
create table keys (key text, count integer)
"
dbGetQuery(db, sql)
bulk_insert <- function(sql, key_counts)
{
dbBeginTransaction(db)
dbGetPreparedQuery(db, sql, bind.data = key_counts)
dbCommit(db)
dbGetQuery(db, "select count(*) from keys")[[1]]
}
## for all styles, you can have up to 999 parameters
## anonymous
sql <- "insert into keys values (?, ?)"
bulk_insert(sql, key_counts)
## named w/ :, $, #
## names are matched against column names of bind.data
sql <- "insert into keys values (:key, :count)"
bulk_insert(sql, key_counts[ , 2:1])
sql <- "insert into keys values ($key, $count)"
bulk_insert(sql, key_counts)
sql <- "insert into keys values (#key, #count)"
bulk_insert(sql, key_counts)
## indexed (NOT CURRENTLY SUPPORTED)
## sql <- "insert into keys values (?1, ?2)"
## bulk_insert(sql)
Hey hey - I just discovered that RSQLite, which is what I'm using in this case, does indeed have bound-variable support:
http://cran.r-project.org/web/packages/RSQLite/NEWS
See the entry about dbSendPreparedQuery() and dbGetPreparedQuery().
So in theory, that turns this nastiness:
df <- data.frame()
for (x in data$guid) {
query <- paste("SELECT uuid, cites, score FROM mytab WHERE uuid='",
x, "'", sep="")
df <- rbind(df, dbGetQuery(con, query))
}
into this:
df <- dbGetPreparedQuery(
con, "SELECT uuid, cites, score FROM mytab WHERE uuid=:guid", data)
Unfortunately, when I actually try it, it seems that it's only for INSERT statements and the like, not for SELECT statements, because I get an error: RS-DBI driver: (cannot have bound parameters on a SELECT statement).
Providing that capability would be fantastic.
The next step would be to hoist this up into DBI itself so that all DBs can take advantage of it, and provide a default implementation that just pastes it into the string like we're all doing ourselves now.

How can I merge many SQLite databases?

If I have a large number of SQLite databases, all with the same schema, what is the best way to merge them together in order to perform a query on all databases?
I know it is possible to use ATTACH to do this but it has a limit of 32 and 64 databases depending on the memory system on the machine.
To summarize from the Nabble post in DavidM's answer:
attach 'c:\test\b.db3' as toMerge;
BEGIN;
insert into AuditRecords select * from toMerge.AuditRecords;
COMMIT;
detach toMerge;
Repeat as needed.
Note: added detach toMerge; as per mike's comment.
Although a very old thread, this is still a relevant question in today's programming needs. I am posting this here because none of the answers provided yet is concise, easy, and straight-to-point. This is for sake of Googlers that end up on this page. GUI we go:
Download Sqlitestudio
Add all your database files by using the Ctrl + O keyboard shortcut
Double-click each now-loaded db file to open/activate/expand them all
Fun part: simply right-click on each of the tables and click on Copy, and then go to the target database in the list of the loaded database files (or create new one if required) and right-click on the target db and click on Paste
I was wowed to realize that such a daunting task can be solved using the ancient programming skill called: copy-and-paste :)
Here is a simple python code to either merge two database files or scan a directory to find all database files and merge them all together (by simply inserting all data in other files to the first database file found).Note that this code just attaches the databases with the same schema.
import sqlite3
import os
def merge_databases(db1, db2):
con3 = sqlite3.connect(db1)
con3.execute("ATTACH '" + db2 + "' as dba")
con3.execute("BEGIN")
for row in con3.execute("SELECT * FROM dba.sqlite_master WHERE type='table'"):
combine = "INSERT OR IGNORE INTO "+ row[1] + " SELECT * FROM dba." + row[1]
print(combine)
con3.execute(combine)
con3.commit()
con3.execute("detach database dba")
def read_files(directory):
fname = []
for root,d_names,f_names in os.walk(directory):
for f in f_names:
c_name = os.path.join(root, f)
filename, file_extension = os.path.splitext(c_name)
if (file_extension == '.sqlitedb'):
fname.append(c_name)
return fname
def batch_merge(directory):
db_files = read_files(directory)
for db_file in db_files[1:]:
merge_databases(db_files[0], db_file)
if __name__ == '__main__':
batch_merge('/directory/to/database/files')
Late answer, but you can use:
#!/usr/bin/python
import sys, sqlite3
class sqlMerge(object):
"""Basic python script to merge data of 2 !!!IDENTICAL!!!! SQL tables"""
def __init__(self, parent=None):
super(sqlMerge, self).__init__()
self.db_a = None
self.db_b = None
def loadTables(self, file_a, file_b):
self.db_a = sqlite3.connect(file_a)
self.db_b = sqlite3.connect(file_b)
cursor_a = self.db_a.cursor()
cursor_a.execute("SELECT name FROM sqlite_master WHERE type='table';")
table_counter = 0
print("SQL Tables available: \n===================================================\n")
for table_item in cursor_a.fetchall():
current_table = table_item[0]
table_counter += 1
print("-> " + current_table)
print("\n===================================================\n")
if table_counter == 1:
table_to_merge = current_table
else:
table_to_merge = input("Table to Merge: ")
return table_to_merge
def merge(self, table_name):
cursor_a = self.db_a.cursor()
cursor_b = self.db_b.cursor()
new_table_name = table_name + "_new"
try:
cursor_a.execute("CREATE TABLE IF NOT EXISTS " + new_table_name + " AS SELECT * FROM " + table_name)
for row in cursor_b.execute("SELECT * FROM " + table_name):
print(row)
cursor_a.execute("INSERT INTO " + new_table_name + " VALUES" + str(row) +";")
cursor_a.execute("DROP TABLE IF EXISTS " + table_name);
cursor_a.execute("ALTER TABLE " + new_table_name + " RENAME TO " + table_name);
self.db_a.commit()
print("\n\nMerge Successful!\n")
except sqlite3.OperationalError:
print("ERROR!: Merge Failed")
cursor_a.execute("DROP TABLE IF EXISTS " + new_table_name);
finally:
self.db_a.close()
self.db_b.close()
return
def main(self):
print("Please enter name of db file")
file_name_a = input("File Name A:")
file_name_b = input("File Name B:")
table_name = self.loadTables(file_name_a, file_name_b)
self.merge(table_name)
return
if __name__ == '__main__':
app = sqlMerge()
app.main()
SRC : Tool to merge identical SQLite3 databases
If you only need to do this merge operation once (to create a new bigger database), you could create a script/program that will loop all your sqlite databases and then insert the data into your main (big) database.
If you have reached the bottom of this feed and yet didn't find your solution, here is also a way to merge the tables of 2 or more sqlite databases.
First try to download and install DB browser for sqlite database. Then try to open your databases in 2 windows and try merging them by simply drag and drop tables from one to another. But the problem is that you can just drag and drop only one table at a time and therefore its not really a solution for this answer specifically but yet it can used to save some time from further searches if your database is small.
With no offense, just as one developer to another, I'm afraid that your idea seems terribly inefficient.
It seems to me that instead of uniting SQLite databases you should probably be storing several tables within the same Database file.
However if I'm mistaken I guess you could ATTACH the databases and then use a VIEW to simplify your queries. Or make an in-memory table and copy over all the data (but that's even worse performance wise, especially if you have large databases)

Resources