Loading data into R with rsqlserver package - sql-server

I've just installed rsqlserver like so (no errors)
install_github('rsqlserver', 'agstudy',args = '--no-multiarch')
And created a connection to my database:
> library(rClr)
> library(rsqlserver)
Warning message:
multiple methods tables found for ‘dbCallProc’
> drv <- dbDriver("SqlServer")
> conn <- dbConnect(drv, url = "Server=MyServer;Database=MyDB;Trusted_Connection=True;")
>
Now when I try to get data using dbGetQuery, I get this error:
> df <- dbGetQuery(conn, "select top 100 * from public2013.dim_Date")
Error in clrCall(sqlDataHelper, "GetConnectionProperty", conn, prop) :
Type: System.MissingMethodException
Message: Method not found: 'System.Object System.Reflection.PropertyInfo.GetValue(System.Object)'.
Method: System.Object GetConnectionProperty(System.Data.SqlClient.SqlConnection, System.String)
Stack trace:
at rsqlserver.net.SqlDataHelper.GetConnectionProperty(SqlConnection _conn, String prop)
>
When I try to fetch results using dbSendQuery, I also get an error.
> res <- dbSendQuery(conn, "select top 100 * from public2013.dim_Date")
> df <- fetch(res, n = -1)
Error in clrCall(sqlDataHelper, "Fetch", stride) :
Type: System.InvalidCastException
Message: Object cannot be stored in an array of this type.
Method: Void InternalSetValue(Void*, System.Object)
Stack trace:
at System.Array.InternalSetValue(Void* target, Object value)
at System.Array.SetValue(Object value, Int32 index)
at rsqlserver.net.SqlDataHelper.Fetch(Int32 capacity) in c:\projects\R\rsqlserver\src\rsqlserver.net\src\SqlDataHelper.cs:line 116
Strangely, the file c:\projects\R\rsqlserver\src\rsqlserver.net\src\SqlDataHelper.cs doesn't actually exist on my computer.
Am I doing something wrong?

I am agstudy the creator of rsqlserver package. Sorry for the late but I finally I get some time to fix this bug. ( actually it was a not yet implemented feature). I demonstrate here how you can read/write data.frame with missing values in Sql server.
First I create a data.frame with missing values. It is important to distinguish the difference between numeric and character variables.
library(rsqlserver)
url = "Server=localhost;Database=TEST_RSQLSERVER;Trusted_Connection=True;"
conn <- dbConnect('SqlServer',url=url)
## create a table with some missing value
dat <- data.frame(txt=c('a',NA,'b',NA),
value =c(1L,NA,NA,2))
My input looks like this :
# txt value
# 1 a 1
# 2 <NA> NA
# 3 b NA
# 4 <NA> 2
I insert dat in my data base with the handy function dbWriteTable:
dbWriteTable(conn,name='T_TABLE_WITH_MISSINGS',
dat,row.names=FALSE,overwrite=TRUE)
Then I will read it using 2 methods:
dbSendQuery
res = dbSendQuery(conn,'SELECT *
FROM T_TABLE_WITH_MISSINGS')
fetch(res,n=-1)
dbDisconnect(conn)
txt value
1 a 1
2 <NA> NaN
3 b NaN
4 <NA> 2
dbReadTable:
rsqlserver is DBI compliant and implement many convenient functions to deal at least at possible with SQL.
conn <- dbConnect('SqlServer',url=url)
dbReadTable(conn,name='T_TABLE_WITH_MISSINGS')
dbDisconnect(conn)
txt value
1 a 1
2 <NA> NaN
3 b NaN
4 <NA> 2

(EDIT: I had missed something in your post (call to fetch). I can now reproduce the issue too.)
Short story is: do you have a NULL value in your database? this may be the cause.
Longer story, for a full repro:
I've used a sample DB reproducible by following the instructions at http://www.codeproject.com/Tips/326527/Create-a-Sample-SQL-Database-in-Less-Than-2-Minute
EDIT:
I can reproduce your issue with:
library(rClr)
library(rsqlserver)
drv <- dbDriver("SqlServer")
conn <- dbConnect(drv, url = "Server=Localhost\\somename;Database=Fabrics;Trusted_Connection=True;")
res <- dbSendQuery(conn, "SELECT TOP 100 * FROM [Fabrics].[dbo].[Client]")
str(res)
## Formal class 'SqlServerResult' [package "rsqlserver"] with 1 slots
..# Id:<externalptr>
> df <- fetch(res, n = -1)
Error in clrCall(sqlDataHelper, "Fetch", stride) :
Type: System.InvalidCastException
Message: Object cannot be stored in an array of this type.
Method: Void InternalSetValue(Void*, System.Object)
Stack trace:
at System.Array.InternalSetValue(Void* target, Object value)
at System.Array.SetValue(Object value, Int32 index)
at rsqlserver.net.SqlDataHelper.Fetch(Int32 capacity) in c:\projects\R\rsqlserver\src\rsqlserver.net\src\SqlDataHelper.cs:line 116
the following commands suggest things work as expected if using other commands.
> dbExistsTable(conn, name='Client')
Error in sqlServerExecScalar(conn, statement, ...) :
Message: There is already an open DataReader associated with this Command which must be closed first.
> dbClearResult(res)
[1] TRUE
> dbExistsTable(conn, name='Client')
[1] TRUE
> dbExistsTable(conn, name='SomeIncorrectColumn')
[1] FALSE
Note that I cannot reproduce the very odd one about MissingMethodException
df <- dbGetQuery(conn, "SELECT TOP 100 * FROM [Fabrics].[dbo].[Client]")
Error in clrCall(sqlDataHelper, "Fetch", stride) :
Type: System.InvalidCastException
Message: Object cannot be stored in an array of this type.
Method: Void InternalSetValue(Void*, System.Object)
Stack trace:
at System.Array.InternalSetValue(Void* target, Object value)
at System.Array.SetValue(Object value, Int32 index)
at rsqlserver.net.SqlDataHelper.Fetch(Int32 capacity) in c:\projects\R\rsqlserver\src\rsqlserver.net\src\SqlDataHelper.cs:line 116
Since the debug symbols seem present, I can debug it further through visual studio. It bombs in SqlDataHelper.Fetch at
_resultSet[_cnames[i]].SetValue(_reader.GetValue(i), cnt);
and the variable watch gives me:
i 11 int
_cnames[i] "Street2" string
_reader.GetValue(i) {} object {System.DBNull}
_reader.GetValue(i-1) "806 West Sir Francis Drake St" object {string}
_reader.GetValue(i+1) "Spokane" object {string}
The entry for Street2 is indeed a NULL:
ClientId FirstName MiddleName LastName Gender DateOfBirth CreditRating XCode OccupationId TelephoneNumber Street1 Street2 City ZipCode Longitude Latitude Notes
1 Nicholas Pat Kane M 1975-10-07 00:00:00.000 3 ZU8 5ML 4 (279) 459 - 2707 2870 North Cherry Blvd. NULL Carlsbad 64906 32.7608137325835 117.112738329071
For information, sessionInfo() output includes:
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
other attached packages:
[1] rsqlserver_1.0 rClr_0.5-2
loaded via a namespace (and not attached):
[1] DBI_0.2-7 tools_3.0.2
Hope this helps.

Related

ORM golang Relation with where clause

I am stuck trying to make a query using the golang orm. What I want to do is get all Messages with Name x or y, in the Request with RequestID = request_id, but I want too to get the Requests with Method = z even the related messages name are not name_x or name_y. An example about what I want to get is, having this database:
request
id method created_at
0 z xxx
1 z xxx
2 m xxx
3 m xxx
trace
id request_id created_at
0 0 xxx
1 1 xxx
2 2 xxx
3 2 xxx
4 3 xxx
message
id trace_id name more_info
0 0 n name_n_0_method_z_0
1 0 x name_x_1_method_z_0
2 2 x name_x_2_method_m_2
3 2 n name_n_3_method_m_2
4 3 y name_y_4_method_m_3
So after my query, the messages I should have are:
name_n_0_method_z_0 (method is z)
name_x_1_method_z_0 (name is x and method is z)
name_x_2_method_m_2 (name is x)
name_y_4_method_m_3 (name is y)
and the not selected entries in this case is only one:
name_n_3_method_m_2 (the name is not x or y, and the method is not z)
What I am doing is:
type Request struct {
ID int64
Method string
RequestID string
Traces []Trace
CreatedAt time.Time `sql:",null"`
}
type Trace struct {
ID int64
RequestID int64
Messages []*Message
CreatedAt time.Time `sql:",null"`
}
type Message struct {
ID int64 `sql:"id"`
TraceID int64 `sql:"trace_id"`
Name string `sql:"name"`
MoreInfo string
}
func GetMessagesByNamesAndMethod(names []string, method, requestID string) {
query := db.Model(&requests).
Where("request.request_id = ?", requestID).
Order("created_at ASC")
query.
Relation("Traces", func(q *orm.Query) (*orm.Query, error) {
return q, nil
}).
Relation("Traces.Messages", func(q *orm.Query) (*orm.Query, error) {
// 1-> q.Where("name in (?) OR request.method = ?", pg.In(names),
q.Where("name in (?)", pg.In(names)) method) // <- 1.b
return q.OrderExpr("message.created_at ASC"), nil
})
// 2-> query.WhereOr("request.method = ?", method)
}
The commented lines are the last attempts I tried:
1 -> In this case I remove the line 1.b. Doing this, I get an error.
2 -> There is no error in the execution, but I am not able to get the method = z entries
I hope to be clear.
Thanks in advance.

evaluating test dataset using eval() in LightGBM

I have trained a ranking model with LightGBM with the objective 'lambdarank'.
I want to evaluate my model to get the nDCG score for my test dataset using the best iteration, but I have never been able to use the lightgbm.Booster.eval() nor lightgbm.Booster.eval_train() function.
First, I have created 3 dataset instances, namely the train set, valid set and test set:
lgb_train = lgb.Dataset(x_train, y_train, group=query_train, free_raw_data=False)
lgb_valid = lgb.Dataset(x_valid, y_valid, reference=lgb_train, group=query_valid, free_raw_data=False)
lgb_test = lgb.Dataset(x_test, y_test, group=query_test)
I then train my model using lgb_train and lgb_valid:
gbm = lgb.train(params,
lgb_train,
num_boost_round=1500,
categorical_feature=chosen_cate_features,
valid_sets=[lgb_train, lgb_valid],
evals_result=evals_result,
early_stopping_rounds=150
)
When I call the eval() or the eval_train() functions after training, it returns an error:
gbm.eval(data=lgb_test,name='test')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-122-7ff5ef5136b8> in <module>()
----> 1 gbm.eval(data=lgb_test,name='test')
/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in eval(self, data,
name, feval)
1925 raise TypeError("Can only eval for Dataset instance")
1926 data_idx = -1
-> 1927 if data is self.train_set:
1928 data_idx = 0
1929 else:
AttributeError: 'Booster' object has no attribute 'train_set'
gbm.eval_train()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-123-0ce5fa3139f5> in <module>()
----> 1 gbm.eval_train()
/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in eval_train(self,
feval)
1956 List with evaluation results.
1957 """
-> 1958 return self.__inner_eval(self.__train_data_name, 0, feval)
1959
1960 def eval_valid(self, feval=None):
/usr/local/lib/python3.6/dist-packages/lightgbm/basic.py in
__inner_eval(self, data_name, data_idx, feval)
2352 """Evaluate training or validation data."""
2353 if data_idx >= self.__num_dataset:
-> 2354 raise ValueError("Data_idx should be smaller than number
of dataset")
2355 self.__get_eval_info()
2356 ret = []
ValueError: Data_idx should be smaller than number of dataset
and when i called the eval_valid() function, it returns an empty list.
Can anyone tell me how to evaluate a LightGBM model and get the nDCG score using test set properly? Thanks.
If you add keep_training_booster=True as an argument to your lgb.train, the returned booster object would be able to execute eval and eval_train (though eval_valid would still return an empty list for some reason even when valid_sets is provided in lgb.train).
Documentation says:
keep_training_booster (bool, optional (default=False)) – Whether the returned Booster will be used to keep training. If False, the returned value will be converted into _InnerPredictor before returning.

Python3 - IndexError when trying to save a text file

i'm trying to follow this tutorial with my own local data files:
CNTK tutorial
i have the following function to save my data array into a txt file feedable to CNTK:
# Save the data files into a format compatible with CNTK text reader
def savetxt(filename, ndarray):
dir = os.path.dirname(filename)
if not os.path.exists(dir):
os.makedirs(dir)
if not os.path.isfile(filename):
print("Saving", filename )
with open(filename, 'w') as f:
labels = list(map(' '.join, np.eye(11, dtype=np.uint).astype(str)))
for row in ndarray:
row_str = row.astype(str)
label_str = labels[row[-1]]
feature_str = ' '.join(row_str[:-1])
f.write('|labels {} |features {}\n'.format(label_str, feature_str))
else:
print("File already exists", filename)
i have 2 ndarrays of the following shape that i want to feed the model:
train.shape
(1976L, 15104L)
test.shape
(1976L, 15104L)
Then i try to implement the fucntion like this:
# Save the train and test files (prefer our default path for the data)
data_dir = os.path.join("C:/Users", 'myself', "OneDrive", "IA Project", 'data', 'train')
if not os.path.exists(data_dir):
data_dir = os.path.join("data", "IA Project")
print ('Writing train text file...')
savetxt(os.path.join(data_dir, "Train-128x118_cntk_text.txt"), train)
print ('Writing test text file...')
savetxt(os.path.join(data_dir, "Test-128x118_cntk_text.txt"), test)
print('Done')
and then i get the following error:
Writing train text file...
Saving C:/Users\A702628\OneDrive - Atos\Microsoft Capstone IA\Capstone data\train\Train-128x118_cntk_text.txt
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-24-b53d3c69b8d2> in <module>()
6
7 print ('Writing train text file...')
----> 8 savetxt(os.path.join(data_dir, "Train-128x118_cntk_text.txt"), train)
9
10 print ('Writing test text file...')
<ipython-input-23-610c077db694> in savetxt(filename, ndarray)
12 for row in ndarray:
13 row_str = row.astype(str)
---> 14 label_str = labels[row[-1]]
15 feature_str = ' '.join(row_str[:-1])
16 f.write('|labels {} |features {}\n'.format(label_str, feature_str))
IndexError: list index out of range
Can somebody please tell me what's going wrong with this part of the code? And how could i fix it? Thank you very much in advance.
Since you're using your own input data -- are they labelled in the range 0 to 9? The labels array only has 10 entries in it, so that could cause an out-of-range problem.

Mongoose - update object from double nested array [duplicate]

This question is closely related to this one and I will consider the advice given with respect to schema design in a NoSQL context, yet I'm curious to understand this:
Actual questions
Suppose you have the following document:
_id : 2 abcd
name : 2 unittest.com
paths : 4
0 : 3
path : 2 home
queries : 4
0 : 3
name : 2 query1
url : 2 www.unittest.com/home?query1
requests: 4
1 : 3
name : 2 query2
url : 2 www.unittest.com/home?query2
requests: 4
Basically, I'd like to know
if it is possible to use MongoDB's positional $ operator (details) multiple times, or put differently, in update scenarios that involve array/document structures with a "degree of nestedness" greater than 1:
{ <update operator>: { "paths.$.queries.$.requests" : value } } (doesn't work)
instead of "only" be able to use $ once for a top-level array and being bound to use explicit indexes for arrays on "higher levels":
{ <update operator>: { "paths.$.queries.0.requests" : value } }) (works)
if possible at all, how the corresponding R syntax would look like.
Below you'll find a reproducible example. I tried to be as concise as possible.
Code example
Database connection
require("rmongodb")
db <- "__unittest"
ns <- paste(db, "hosts", sep=".")
# CONNCETION OBJECT
con <- mongo.create(db=db)
# ENSURE EMPTY DB
mongo.remove(mongo=con, ns=ns)
Example document
q <- list("_id"="abcd")
b <- list("_id"="abcd", name="unittest.com")
mongo.insert(mongo=con, ns=ns, b=b)
q <- list("_id"="abcd")
b <- list("$push"=list(paths=list(path="home")))
mongo.update(mongo=con, ns, criteria=q, objNew=b)
q <- list("_id"="abcd", paths.path="home")
b <- list("$push"=list("paths.$.queries"=list(
name="query1", url="www.unittest.com/home?query1")))
mongo.update(mongo=con, ns, criteria=q, objNew=b)
b <- list("$push"=list("paths.$.queries"=list(
name="query2", url="www.unittest.com/home?query2")))
mongo.update(mongo=con, ns, criteria=q, objNew=b)
Update of nested arrays with explicit position index (works)
This works, but it involves an explicit index for the second-level array queries (nested in a subdoc element of array paths):
q <- list("_id"="abcd", paths.path="home", paths.queries.name="query1")
b <- list("$push"=list("paths.$.queries.0.requests"=list(time="2013-02-13")))
> mongo.bson.from.list(b)
$push : 3
paths.$.queries.0.requests : 3
time : 2 2013-02-13
mongo.update(mongo=con, ns, criteria=q, objNew=b)
res <- mongo.find.one(mongo=con, ns=ns, query=q)
> res
_id : 2 abcd
name : 2 unittest.com
paths : 4
0 : 3
path : 2 home
queries : 4
0 : 3
name : 2 query1
requests : 4
0 : 3
time : 2 2013-02-13
url : 2 www.unittest.com/home?query1
1 : 3
name : 2 query2
url : 2 www.unittest.com/home?query2
Update of nested arrays with positional $ indexes (doesn't work)
Now, I'd like to substitute the explicit 0 with the positional $ operator just like I did in order to have the server find the desired subdoc element of array paths (paths.$.queries).
AFAIU the documentation, this should work as the crucial thing is to specify a "correct" query selector:
The positional $ operator, when used with the update() method and acts as a placeholder for the first match of the update query selector:
I think I specified a query selector that does find the correct nested element (due to the paths.queries.name="query1" part):
q <- list("_id"="abcd", paths.path="home", paths.queries.name="query1")
I guess translated to "plain MongoDB" syntax, the query selector looks somewhat like this
{ _id: abcd, paths.path: home, paths.queries.name: query1 }
which seems like a valid query selector to me. In fact it does match the desired element/doc:
> !is.null(mongo.find.one(mongo=con, ns=ns, query=q))
[1] TRUE
My thought was that if it works on the top-level, why shouldn't it work for higher levels as well (as long as the query selector points to the right nested components)?
However, the server doesn't seem to like a nested or multiple use of $:
b <- list("$push"=list("paths.$.queries.$.requests"=list(time="2013-02-14")))
> mongo.bson.from.list(b)
$push : 3
paths.$.queries.$.requests : 3
time : 2 2013-02-14
> mongo.update(mongo=con, ns, criteria=q, objNew=b)
[1] FALSE
I'm not sure if it doesn't work because MongoDB doesn't support this or if I didn't get the R syntax right.
The positional operator only supports one level deep and only the first matching element.
There is a JIRA trackable for the sort of behaviour you want here: https://jira.mongodb.org/browse/SERVER-831
I am unsure if it will allow for more than one match but I believe it will due to the dynamics of how it will need to work.
In case you can execute your query from the MongoDB shell you can bypass this limitation by taking advantage of MongoDB cursor's forEach function (http://docs.mongodb.org/manual/reference/method/cursor.forEach/)
Here is an example with 3 nested arrays:
var collectionNameCursor = db.collection_name.find({...});
collectionNameCursor.forEach(function(collectionDocument) {
var firstArray = collectionDocument.firstArray;
for(var i = 0; i < firstArray.length; i++) {
var secondArray = firstArray[i].secondArray;
for(var j = 0; j < secondArray.length; j++) {
var thirdArray = secondArray[j].thirdArray;
for(var k = 0; k < thirdArray.length; k++) {
//... do some logic here with thirdArray's elements
db.collection_name.save(collectionDocument);
}
}
}
});
Note that this is more of a one time solution then a production code but it's going to do the job if you have to write a fix-up script.
As #FooBar mentioned in the comments of the accepted answer, this feature was implemented in 2017 with MongoDB 3.6.
To do so, you must to use positional filters with arrayFilters conditions.
Applied to your example:
updateOne(
{ "paths.home": "home" },
{ $push : {
"paths.$.queries.$[q].requests": { time: "2022-11-15" }
}
},
{ arrayFilters: [{ "q.name": "name" }] }
)
The postional operator $ refers to the filter { "paths.home": "home" }. Then, the positional filter $[q] refers to the arrayFilter { "q.name": "name" }.
Using this method, you can add as many positional filters as needed, as long as you put the condition in arrayFilters.
However, looking through the documentation of rmongodb, using arrayFilters is not possible at the moment. Alternatively, you could use another R package that has this feature implemented, such as Mongolite.

Django/pyodbc error: not enough arguments for format string

I have a Dictionary model defined in Django (1.6.5). One method (called get_topentities) returns the top names in my dictionary (entity names are defined by Entity model):
def get_topentities(self,n):
entities = self.entity_set.select_related().filter(in_dico=True,table_type=0).order_by("rank")[0:n]
return entities
When I call the function (say with n=2), it returns the top 2 elements but I cannot access the second one because of this "not enough arguments to format string" error:
In [5]: d = Dictionary.objects.get(code='USA')
In [6]: top2 = d.get_topentities(2)
In [7]: top2
Out[7]: [<Entity: BARACK OBAMA>, <Entity: GOVERNMENT>]
In [8]: top2[0]
Out[8]: <Entity: BARACK OBAMA>
In [9]: top2[1]
.
.
/usr/local/lib/python2.7/dist-packages/django_pyodbc/compiler.pyc in as_sql(self, with_limits, with_col_aliases)
172 # Lop off ORDER... and the initial "SELECT"
173 inner_select = _remove_order_limit_offset(raw_sql)
--> 174 outer_fields, inner_select = self._alias_columns(inner_select)
175
176 order = _get_order_limit_offset(raw_sql)[0]
/usr/local/lib/python2.7/dist-packages/django_pyodbc/compiler.pyc in _alias_columns(self, sql)
339
340 # store the expanded paren string
--> 341 parens[key] = buf% parens
342 #cannot use {} because IBM's DB2 uses {} as quotes
343 paren_buf[paren_depth] += '(%(' + key + ')s)'
TypeError: not enough arguments for format string
In [10]:
My server backend is MSSQL and I'm using pyodbc as the database driver. If I try the same on a PC with engine sqlserver_ado, it works. Can someone help?
Regards,
Patrick

Resources