How to write binary data into SQLite with R DBI's dbWriteTable()?

How to write binary data into SQLite with R DBI's dbWriteTable()? - database

For instance, how to execute the equivalent following SQL (which inserts into a BINARY(16) field)
INSERT INTO Table1 (MD5) VALUES (X'6717f2823d3202449201145073ab871A'),(X'6717f2823d3202449301145073ab371A')
using dbWriteTable()? Doing
dbWriteTable(db, "Table1", data.frame(MD5 = "X'6717f2823d3202449201145073ab871A'", ...), append = T, row.names = F)
doesn't seem to work - it writes the values as text.
In the end, I'm going to have a big data.frame of hashes that I want to write, and so perfect for using dbWriteTable. But I just can't figure out how to INSERT the data.frame into binary database fields.

So here are two possibilities that seem to work. The first uses dbSendQuery(...) in a loop (you've probably thought of this already...).
db.WriteTable = function(con,table,df) { # no error checking whatsoever...
require(DBI)
field <- colnames(df)[1]
for (i in 1:nrow(df)) {
query <- sprintf("INSERT INTO %s (%s) VALUES (X'%s')",table,field,df[i,1])
rs <- dbSendQuery(con,statement=query)
}
return(nrow(df))
}
library(DBI)
drv <- dbDriver("SQLite")
con <- dbConnect(drv)
rs <- dbSendQuery(con, statement="CREATE TABLE hash (MD5 BLOB)")
df <- data.frame(MD5=c("6717f2823d3202449201145073ab871A",
"6717f2823d3202449301145073ab371A"))
rs <- db.WriteTable(con,"hash",df)
result.1 <- dbReadTable(con,"hash")
result.1
# MD5
# 1 67, 17, f2, 82, 3d, 32, 02, 44, 92, 01, 14, 50, 73, ab, 87, 1a
# 2 67, 17, f2, 82, 3d, 32, 02, 44, 93, 01, 14, 50, 73, ab, 37, 1a
If your data frame of hashes is very large, then df.WriteFast(...) does the same thing as db.WriteTable(...) only it should be faster.
db.WriteFast = function(con.table,df) {
require(DBI)
field <- colnames(df)[1]
lapply(unlist(df[,1]),function(x){
dbSendQuery(con,
statement=sprintf("INSERT INTO %s (%s) VALUES (X'%s')",
table,field,x))})
}
Note that result.1 is a data frame, and if we use it in a call to dbWriteTable(...) we can successfully write the hashes to a BLOB. So it is possible.
str(result.1)
# 'data.frame': 2 obs. of 1 variable:
# $ MD5:List of 2
# ..$ : raw 67 17 f2 82 ...
# ..$ : raw 67 17 f2 82 ...
The second approach takes advantage of R's raw data type to create a data frame structured like result.1, and passes that to dbWriteTable(...). You'd think this would be easy, but no.
h2r = function(x) {
bytes <- substring(x, seq(1, nchar(x)-1, 2), seq(2, nchar(x), 2))
return(list(as.raw(as.hexmode(bytes))))
}
hash2raw = Vectorize(h2r)
df.raw=data.frame(MD5=list(1:nrow(df)))
colnames(df.raw)="MD5"
df.raw$MD5 = unname(hash2raw(as.character(df$MD5)))
dbWriteTable(con, "newHash",df.raw)
result.2 <- dbReadTable(con,"newHash")
result.2
all.equal(result.1$MD5,result.2$MD5)
# [1] TRUE
In this approach, we create a data frame df.raw which has one column, MD5, wherein each element is a list of raw bytes. The utility function h2r(...) takes a character representation of the hash, breaks it into a vector of char(2) (the bytes), then interprets each of those as hex (as.hexmode(...)), converts the result to raw (as.raw(...)), and finally returns the result as a list. Vectorize(...) is a wrapper that allows hash2raw(...) to take a vector as its argument.
Personally, I think you're better off using the first approach: it takes advantage of SQLite's internal mechanism for writing hex to BLOBs, and it's much easier to understand.

Related

pandas values change with numpy, but their memory locations are different

I created an array based on a dataframe. When I changed the value of the array the dataframe also changed, which means that both should be using the same address, but when I use id() to check it, it is different.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'column1': [11,22,33],
'column2': [44,55,66]
})
col1_arr = df['column1'].to_numpy()
col1_arr[0] = 100
col1_arr
array([100, 22, 33], dtype=int64)
df
index
column1
column2
0
100
44
1
22
55
2
33
66
When I changed the value of the array, the dataframe also changed to 100. If their values change to synchronize, it means they should be using the same memory address, but below shows that their addresses are different.
for i in df['column1']:
print(i)
print(hex(id(i)))
# 100
# 0x21c795a0d50
# 22
# 0x21c795a0390
# 33
# 0x21c795a04f0
for i in col1_arr:
print(i)
print(hex(id(i)))
# 100
# 0x21c00e36c70
# 22
# 0x21c00e36d10
# 33
# 0x21c00e36c70
Another strange thing is that the address of col1_arr[0] is equal to col1_arr[2].

One column of the frame is a Series:
In [675]: S = df['column1']
In [676]: type(S)
Out[676]: pandas.core.series.Series
While the storage details of a DataFrame, or Series, may vary, here it looks like the values are stored in a numpy array:
In [677]: S._values
Out[677]: array([11, 22, 33], dtype=int64)
In [678]: id(S._values)
Out[678]: 2737259230384
And that array is exactly the same one that you get with to_numpy():
In [679]: id(col1_arr)
Out[679]: 2737259230384
So when you change an element of col1_arr, you see that change in S, and df.
Data in an array is not stored by reference (not like list):
col1_arr[0] creates a numpy.int64 object that has the same value, but it is not in any way the "address" of the value. Note the different id values below:
In [683]: id(col1_arr[0])
Out[683]: 2737348615568
In [684]: id(col1_arr[0])
Out[684]: 2737348615280
To find "where" values are stored you have to look at something like:
In [686]: col1_arr.__array_interface__
Out[686]:
{'data': (2737080083824, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (3,),
'version': 3}

Sort a Julia 1.1 matrix by one of its columns, that contains strings

As the title suggests, I need to sort the rows of a certain matrix by one of its columns, preferably in place if at all possible. Said column contains Strings (the array being of type Array{Union{Float64,String}}), and ideally the rows should end up in an alphabetial order, determined by this column. The line
sorted_rows = sort!(data, by = i -> data[i,2]),
where data is my matrix, produces the error ERROR: LoadError: UndefKeywordError: keyword argument dims not assigned. Specifying which part of the matrix I want sorted and adding the parameter dims=2 (which I assume is the dimension I want to sort along), namely
sorted_rows = sort!(data[2:end-1,:], by = i -> data[i,2],dims=2)
simply changes the error message to ERROR: LoadError: ArgumentError: invalid index: 01 Suurin yhteinen tekijä ja pienin yhteinen jaettava of type String. So the compiler is complainig about a string being an invalid index.
Any ideas on how this type of sorting cound be done? I should say that in this case the string in the column can be expected to start with a number, but I wouldn't mind finding a solution that works in the general case.
I'm using Julia 1.1.

You want sortslices, not sort — the latter just sorts all columns independently, whereas the former rearranges whole slices. Secondly, the by function doesn't take an index, it takes the value that is about to be compared (and allows you to transform it in some way). Thus:
julia> using Random
data = Union{Float64, String}[randn(100) [randstring(10) for _ in 1:100]]
100×2 Array{Union{Float64, String},2}:
0.211015 "6VPQbWU5f9"
-0.292298 "HgvHLkufqI"
1.74231 "zTCu1U5Vdl"
0.195822 "O3j43sbhKV"
⋮
-0.369007 "VzFH2OpWfU"
-1.30459 "6C68G64AWg"
-1.02434 "rldaQ3e0GE"
1.61653 "vjvn1SX3FW"
julia> sortslices(data, by=x->x[2], dims=1)
100×2 Array{Union{Float64, String},2}:
0.229143 "0syMQ7AFgQ"
-0.642065 "0wUew61bI5"
1.16888 "12PUn4V4gL"
-0.266574 "1Z2ONSBP04"
⋮
1.85761 "y2DDANcFCe"
1.53337 "yZju1uQqMM"
1.74231 "zTCu1U5Vdl"
0.974607 "zdiU0sVOZt"
Unfortunately we don't have an in-place sortslices! yet, but you can easily construct a sorted view with sortperm. This probably won't be as fast to use, but if you need the in-place-ness for semantic reasons it'll do just the trick.
julia> p = sortperm(data[:,2]);
julia> #view data[p, :]
100×2 view(::Array{Union{Float64, String},2}, [26, 45, 90, 87, 6, 96, 82, 75, 12, 27 … 53, 69, 100, 93, 36, 37, 39, 8, 3, 61], :) with eltype Union{Float64, String}:
0.229143 "0syMQ7AFgQ"
-0.642065 "0wUew61bI5"
1.16888 "12PUn4V4gL"
-0.266574 "1Z2ONSBP04"
⋮
1.85761 "y2DDANcFCe"
1.53337 "yZju1uQqMM"
1.74231 "zTCu1U5Vdl"
0.974607 "zdiU0sVOZt"
(If you want the in-place-ness for performance reasons, I'd recommend using a DataFrame or similar structure that holds its columns as independent homogenous vectors — a Union{Float64, String} will be slower than two separate well-typed vectors, and sort!ing a DataFrame works on whole rows like you want.)

you may want to look at SortingLab.jls fast string sort functions.
]add SortingLab
using SortingLab
idx = fsortperm(data[:,2])
new_data = data[idx]

How to search and find from one data-frame and substitute in another? [duplicate]

This question already has an answer here:
Subsetting a data frame based on contents of another data frame
(1 answer)
Closed 5 years ago.
I have two data frames as following:
df1 = data.frame(a = c(1, 2, 3, 4, 507, 505), b = c(10, 20, 30, 40, 50, 60))
df2 = data.frame(A = c(501, 502, 503, 504, 505, 506, 507),
B = c(601, 602, 603, 604, 605, 606, 607))
I want to find the values of df1$b for which df2$A is equal to df1$a. In this simple example I'm looking for 50 and 60 in which df1$a == df2$A.
I have tried the following:
df1$b[which(df2$a %in% df1$A)]
which doesn't always returns the values in order for bigger datasets. For example, when I use the following analogous with above with more sophisticated datasets i.e. FTN and top_list similar to df1 and df2 as:
top_list$performance <- FTN$YIELD_MEAN_NBR[which(as.character(FTN$PLOT_GRID_ID) %in% top_list$gridID)]
The following lines don't return the same values as I would expect.
ID = "62927530"
FTN$YIELD_MEAN_NBR[FTN$PLOT_GRID_ID == ID]
top_list$performance[top_list$gridID == ID]
This puzzles me!

You need to check df1$a in df2$A, not the other way around:
df1$b[df1$a %in% df2$A]
# [1] 50 60

Element by Element Comparison of Multiple Arrays in MATLAB

I have a multiple input arrays and I want to generate one output array where the value is 0 if all elements in a column are the same and the value is 1 if all elements in a column are different.
For example, if there are three arrays :
A = [28, 28, 43, 43]
B = [28, 43, 43, 28]
C = [28, 28, 43, 43]
Output = [0, 1, 0, 1]
The arrays can be of any size and any number, but the arrays are also the same size.

A none loopy way is to use diff and any to advantage:
A = [28, 28, 43,43];
B = [28, 43, 43,28];
C = [28, 28, 43,43];
D = any(diff([A;B;C])) %Combine all three (or all N) vectors into a matrix. Using the Diff to find the difference between each element from row to row. If any of them is non-zero, then return 1, else return 0.
D = 0 1 0 1

There are several easy ways to do it.
Let's start by putting the relevant vectors in a matrix:
M = [A; B; C];
Now we can do things like:
idx = min(M)==max(M);
or
idx = ~var(M);

No one seems to have addressed that you have a variable amount of arrays. In your case, you have three in your example but you said you could have a variable amount. I'd also like to take a stab at this using broadcasting.
You can create a function that will take a variable number of arrays, and the output will give you an array of an equal number of columns shared among all arrays that conform to the output you're speaking of.
First create a larger matrix that concatenates all of the arrays together, then use bsxfun to take advantage of broadcasting the first row and ensuring that you find columns that are all equal. You can use all to complete this step:
function out = array_compare(varargin)
matrix = vertcat(varargin{:});
out = ~all(bsxfun(#eq, matrix(1,:), matrix), 1);
end
This will take the first row of the stacked matrix and see if this row is the same among all of the rows in the stacked matrix for every column and returns a corresponding vector where 0 denotes each column being all equal and 1 otherwise.
Save this function in MATLAB and call it array_compare.m, then you can call it in MATLAB like so:
A = [28, 28, 43, 43];
B = [28, 43, 43, 28];
C = [28, 28, 43, 43];
Output = array_compare(A, B, C);
We get in MATLAB:
>> Output
Output =
0 1 0 1

Not fancy but will do the trick
Output=nan(length(A),1); %preallocation and check if an index isn't reached
for i=1:length(A)
Output(i)= ~isequal(A(i),B(i),C(i));
end
If someone has an answer without the loop take that, but i feel like performance is not an issue here.

Mathematica : Conditional Operations on Lists

I would like to average across "Rows" in a column. That is rows that have the same value in another column.
For example :
e= {{1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2},
{69, 7, 30, 38, 16, 70, 97, 50, 97, 31, 81, 96, 60, 52, 35, 6,
24, 65, 76, 100}}
I would like to average all the Value in the second column that have the same value in the first one.
So Here : The Average for Col 1 = 1 & Col 1 = 2
And then create a third column with the result of this operation. So the values in that columns should be the same for the first 10 lines an next 10.
Many Thanks for any help you could provide !
LA
Output Ideal Format :

Interesting problem. This is the first thing that came into my mind:
e[[All, {1}]] /. Reap[Sow[#2, #] & ### e, _, # -> Mean##2 &][[2]];
ArrayFlatten[{{e, %}}] // TableForm
To get rounding you may simply add Round# before Mean in the code above: Round#Mean##2
Here is a slightly faster method, but I actually prefer the Sow/Reap one above:
#[[1, 1]] -> Round#Mean##[[All, 2]] & /# GatherBy[e, First];
ArrayFlatten[{{e, e[[All, {1}]] /. %}}] // TableForm
If you have many different elements in the first column, either of the solutions above can be made faster by applying Dispatch to the rule list that is produced, before the replacement (/.) is done. This command tells Mathematica to build and use an optimized internal format for the rules list.
Here is a variant that is slower, but I like it enough to share anyway:
Module[{q},
Reap[{#, Sow[#2,#], q##} & ### e, _, (q## = Mean##2) &][[1]]
]
Also, general tips, you can replace:
Table[RandomInteger[{1, 100}], {20}] with RandomInteger[{1, 100}, 20]
and Join[{c}, {d}] // Transpose with Transpose[{c, d}].

What the heck, I'll join the party. Here is my version:
Flatten/#Flatten[Thread/#Transpose#{#,Mean/##[[All,All,2]]}&#GatherBy[e,First],1]
Should be fast enough I guess.
EDIT
In response to the critique of #Mr.Wizard (my first solution was reordering the list), and to explore a bit the high-performance corner of the problem, here are 2 alternative solutions:
getMeans[e_] :=
Module[{temp = ConstantArray[0, Max[#[[All, 1, 1]]]]},
temp[[#[[All, 1, 1]]]] = Mean /# #[[All, All, 2]];
List /# temp[[e[[All, 1]]]]] &[GatherBy[e, First]];
getMeansSparse[e_] :=
Module[{temp = SparseArray[{Max[#[[All, 1, 1]]] -> 0}]},
temp[[#[[All, 1, 1]]]] = Mean /# #[[All, All, 2]];
List /# Normal#temp[[e[[All, 1]]]]] &[GatherBy[e, First]];
The first one is the fastest, trading memory for speed, and can be applied when keys are all integers, and your maximal "key" value (2 in your example) is not too large. The second solution is free from the latter limitation, but is slower. Here is a large list of pairs:
In[303]:=
tst = RandomSample[#, Length[#]] &#
Flatten[Map[Thread[{#, RandomInteger[{1, 100}, 300]}] &,
RandomSample[Range[1000], 500]], 1];
In[310]:= Length[tst]
Out[310]= 150000
In[311]:= tst[[;; 10]]
Out[311]= {{947, 52}, {597, 81}, {508, 20}, {891, 81}, {414, 47},
{849, 45}, {659, 69}, {841, 29}, {700, 98}, {858, 35}}
The keys can be from 1 to 1000 here, 500 of them, and there are 300 random numbers for each key. Now, some benchmarks:
In[314]:= (res0 = getMeans[tst]); // Timing
Out[314]= {0.109, Null}
In[317]:= (res1 = getMeansSparse[tst]); // Timing
Out[317]= {0.219, Null}
In[318]:= (res2 = tst[[All, {1}]] /.
Reap[Sow[#2, #] & ### tst, _, # -> Mean##2 &][[2]]); // Timing
Out[318]= {5.687, Null}
In[319]:= (res3 = tst[[All, {1}]] /.
Dispatch[
Reap[Sow[#2, #] & ### tst, _, # -> Mean##2 &][[2]]]); // Timing
Out[319]= {0.391, Null}
In[320]:= res0 === res1 === res2 === res3
Out[320]= True
We can see that the getMeans is the fastest here, getMeansSparse the second fastest, and the solution of #Mr.Wizard is somewhat slower, but only when we use Dispatch, otherwise it is much slower. Mine and #Mr.Wizard's solutions (with Dispatch) are similar in spirit, the speed difference is due to (sparse) array indexing being more efficient than hash look-up. Of course, all this matters only when your list is really large.
EDIT 2
Here is a version of getMeans which uses Compile with a C target and returns numerical values (rather than rationals). It is about twice faster than getMeans, and the fastest of my solutions.
getMeansComp =
Compile[{{e, _Integer, 2}},
Module[{keys = e[[All, 1]], values = e[[All, 2]], sums = {0.} ,
lengths = {0}, , i = 1, means = {0.} , max = 0, key = -1 ,
len = Length[e]},
max = Max[keys];
sums = Table[0., {max}];
lengths = Table[0, {max}];
means = sums;
Do[key = keys[[i]];
sums[[key]] += values[[i]];
lengths[[key]]++, {i, len}];
means = sums/(lengths + (1 - Unitize[lengths]));
means[[keys]]], CompilationTarget -> "C", RuntimeOptions -> "Speed"]
getMeansC[e_] := List /# getMeansComp[e];
The code 1 - Unitize[lengths] protects against division by zero for unused keys. We need every number in a separate sublist, so we should call getMeansC, not getMeansComp directly. Here are some measurements:
In[180]:= (res1 = getMeans[tst]); // Timing
Out[180]= {0.11, Null}
In[181]:= (res2 = getMeansC[tst]); // Timing
Out[181]= {0.062, Null}
In[182]:= N#res1 == res2
Out[182]= True
This can probably be considered a heavily optimized numerical solution. The fact that the fully general, brief and beautiful solution of #Mr.Wizard is only about 6-8 times slower speaks very well for the latter general concise solution, so, unless you want to squeeze every microsecond out of it, I'd stick with #Mr.Wizard's one (with Dispatch). But it's important to know how to optimize code, and also to what degree it can be optimized (what can you expect).

A naive approach could be:
Table[
Join[ i, {Select[Mean /# SplitBy[e, First], First## == First#i &][[1, 2]]}]
, {i, e}] // TableForm
(*
1 59 297/5
1 72 297/5
1 90 297/5
1 63 297/5
1 77 297/5
1 98 297/5
1 3 297/5
1 99 297/5
1 28 297/5
1 5 297/5
2 87 127/2
2 80 127/2
2 29 127/2
2 70 127/2
2 83 127/2
2 75 127/2
2 68 127/2
2 65 127/2
2 1 127/2
2 77 127/2
*)
You could also create your original list by using for example:
e = Array[{Ceiling[#/10], RandomInteger[{1, 100}]} &, {20}]
Edit
Answering #Mr.'s comments
If the list is not sorted by its first element, you can do:
Table[Join[
i, {Select[
Mean /# SplitBy[SortBy[e, First], First], First## == First#i &][[1,2]]}],
{i, e}] //TableForm
But this is not necessary in your example

Why not pile on?
I thought this was the most straightforward/easy-to-read answer, though not necessarily the fastest. But it's really amazing how many ways you can think of a problem like this in Mathematica.
Mr. Wizard's is obviously very cool as others have pointed out.
#Nasser, your solution doesn't generalize to n-classes, although it easily could be modified to do so.
meanbygroup[table_] := Join ## Table[
Module[
{sublistmean},
sublistmean = Mean[sublist[[All, 2]]];
Table[Append[item, sublistmean], {item, sublist}]
]
, {sublist, GatherBy[table, #[[1]] &]}
]
(* On this dataset: *)
meanbygroup[e]

Wow, the answers here are so advanced and cool looking, Need more time to learn them.
Here is my answer, I am still matrix/vector/Matlab'ish guy in recovery and transition, so my solution is not functional like the experts solution here, I look at data as matrices and vectors (easier for me than looking at them as lists of lists etc...) so here it is
sizeOfList=10; (*given from the problem, along with e vector*)
m1 = Mean[e[[1;;sizeOfList,2]]];
m2 = Mean[e[[sizeOfList+1;;2 sizeOfList,2]]];
r = {Flatten[{a,b}], d , Flatten[{Table[m1,{sizeOfList}],Table[m2,{sizeOfList}]}]} //Transpose;
MatrixForm[r]
Clearly not as a good a solution as the functional ones.
Ok, I will go now and hide away from the functional programmers :)
--Nasser

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to write binary data into SQLite with R DBI's dbWriteTable()? - database

Related

pandas values change with numpy, but their memory locations are different

Sort a Julia 1.1 matrix by one of its columns, that contains strings

How to search and find from one data-frame and substitute in another? [duplicate]

Element by Element Comparison of Multiple Arrays in MATLAB

Mathematica : Conditional Operations on Lists

Categories

Resources