I am setting up a SQL Azure database. I need to write data into the database on daily basis. I am using 64-bit R version 3.3.3 on Windows10. Some of the columns contain text (more than 4000 characters). Initially, I have imported some data from a csv into the SQL Azure database using Microsoft SQL Server Management Studios. I set up the text columns as ntext format, because when I tried using nvarchar the max was 4000 and some of the values got truncated even though they were about 1100 characters long.
In order to append to the database I am first saving the records in a temp table when I have predefined the varTypes:
varTypesNewFile <- c("Numeric", rep("NTEXT", ncol(newFileToAppend) - 1))
names(varTypesNewFile) <- names(newFileToAppend)
sqlSave(dbhandle, newFileToAppend, "newFileToAppendTmp", rownames = F, varTypes = varTypesNewFile, safer = F)
and then append them by using:
insert into mainTable select * from newFileToAppendTmp
If the text is not too long, the above does work. However, sometimes I get the following error during the sqlSave command:
Error in odbcUpdate(channel, query, mydata, coldata[m, ], test = test, :
'Calloc' could not allocate memory (1073741824 of 1 bytes)
My questions are:
How can I counter this issue?
Is this the format I should be using?
Additionally, even when the above works, it takes about an hour to upload about 5k of records. Is it not too long? Is this the normal amount of time it should take? If not, what could I do better.
RODBC is very old, and can be a bit flaky with NVARCHAR columns. Try using the RSQLServer package instead, which offers an alternative means to connect to SQL Server (and also provides a dplyr backend).
Related
I transferred an Oracle database to SQL Server and all seems to have went well. The various ID columns are large numbers so I had to use Decimal as they were too large for BigInt.
I am now trying to read the data using pandas.read_sql using pyodbc connection with ODBC Driver 17 for SQL Server. df = pandas.read_sql("SELECT * FROM table1"),con)
The numbers are coming out as float64 and when I try to print them our use them in SQL statements they come out in scientific notation and when I try to use '{:.0f}'.format(df.loc[i,'Id']) It turns several numbers into the same number such as 90300111000003078520832. It is like precision is lost when it goes to scientific notation.
I also tried pd.options.display.float_format = '{:.0f}'.format before the read_sql but this did not help.
Clearly I must be doing something wrong as the Ids in the database are correct.
Any help is appreciated Thanks
pandas' read_sql method has an option named coerce_float which defaults to True and it …
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
However, in your case it is not useful, so simply specify coerce_float=False.
I've had this problem too, especially working with long ids: read_sql works fine for the primary key, but not for other columns (like the retweeted_status_id from Twitter API calls). Setting coerce_float to false does nothing for me, so instead I cast retweeted_status_id to a character format in my sql query.
Using psql, I do:
df = pandas.read_sql("SELECT *, Id::text FROM table1"),con)
But in SQL server it'd be something like
df = pandas.read_sql("SELECT *, CONVERT(text, Id) FROM table1"),con)
or
df = pandas.read_sql("SELECT *, CAST(Id AS varchar) FROM table1"),con)
Obviously there's a cost here if you're asking to cast many rows, and a more efficient option might be to pull from SQL server without using pandas (as a nested list or JSON or something else) which will also preserve your long integer formats.
I'm having an issue with inserting rows into a database. Just wondering if anyone has any ideas why this is happening? It works when I avoid using fast_executemany but then inserts become very slow.
driver = 'ODBC Driver 17 for SQL Server'
conn = pyodbc.connect('DRIVER=' + driver + ';SERVER=' + server+ \
';UID=' + user+ ';PWD=' + password)
cursor = conn.cursor()
cursor.fast_executemany = True
insert_sql = """
INSERT INTO table (a, b, c)
VALUES (?, ?, ?)
"""
cursor.executemany(insert_sql, insert_params)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-12-e7e82e4d8c2d> in <module>
2 start_time = time.time()
3
----> 4 cursor.executemany(insert_sql, insert_params)
MemoryError:
There is a known issue with fast_executemany when working with TEXT or NTEXT columns, as described on GitHub here.
The problem is that when pyodbc queries the database metadata to determine the maximum size of the column the driver returns 2 GB (instead of 0, as would be returned for a [n]varchar(max) column).
pyodbc allocates 2 GB of memory for each [N]TEXT element in the parameter array, and the Python app quickly runs out of memory.
The workaround is to use cursor.setinputsizes([(pyodbc.SQL_WVARCHAR, 0, 0)]) (as described here) to coax pyodbc into treating [N]TEXT columns like [n]varchar(max) columns.
(Given that [N]TEXT is a deprecated column type for SQL Server it is unlikely that there will be a formal fix for this issue.)
While this issue was solved for the OP by Gord Thompson's answer, I wanted to note that the question as written applies to other cases where a MemoryError may occur, and fast_executemany actually can throw that in other circumstances beyond just usage of [N]TEXT columns.
In my case a MemoryError was thrown during an attempt to INSERT several million records at once, and as noted here, "parameter values are held in memory, so very large numbers of records (tens of millions or more) may cause memory issues". It doesn't necessarily require tens of millions to trigger, so YMMV.
An easy solution to this is to identify a sane number of records to batch per each execute. Here's an example if using a Pandas dataframe as a source (establish your insert_query as usual):
batch_size = 5000 # Set to a desirable batch size
with connection.cursor() as cursor:
try:
cursor.fast_executemany = True
# Iterate each batch chunk using numpy's split
for chunk in np.array_split(df, batch_size):
cursor.executemany(insert_query,
chunk.values.tolist())
# Run a single commit at the end of the transaction
connection.commit()
except Exception as e:
# Rollback on any exception
connection.rollback()
raise e
Hope this helps anyone who hits this issue and doesn't have any [N]TEXT columns on their target!
In my case, the MemoryError was because I was using a very old driver 'SQL Server'. Switched to the newer driver ('ODBC Driver 17 for SQL Server') as described in the link below and it worked:
link
I'm trying to query a table of my MS SQL Server from R and work with the data. Somewhere along the way some of my characters are apparently lost or transformed. What am I doing wrong?
R code for querying the data:
library("RODBC")
dbhandle <- odbcConnect("Local MSSQL db", DBMSencoding= "windows-1252")
response <- sqlQuery(dbhandle, "select NEM from databasename.dbo.tablename")
I tried omitting the DBMSencoding parameter, as well as setting it to utf-8, windows-1250 and windows-1251 with no success.
When I write and view the result in a csv (without any transformation afterwards) it looks like this:
No accents
(I am aware that RStudio has limited capability in displaying Unicode characters, so I'm verifying the success of the query by writing the data to a csv)
First off this is my setup:
Windows 7
MS SQL Server 2008
Python 3.6 Anaconda Distribution
I am working in a Jupyter notebook and trying to import a column of data from a MS SQL Server database using SQLAlchemy. The column in question contains cells which store long strings of text (datatype is nvarchar(max)). This is my code:
engine = create_engine('mssql+pyodbc://user:password#server:port/db_name?driver=SQL+Server+Native+Client+11.0'
stmt = 'SELECT componenttext FROM TranscriptComponent WHERE transcriptId=1265293'
connection = engine.connect()
results = connection.execute(stmt).fetchall()
This executes fine, and imports a list of strings. However when I examine the strings they are truncated, and in the middle of the strings the following message seems to have been inserted:
... (8326 characters truncated) ...
With the number of characters varying from string to string. I did a check on how long the strings that got imported are, and the ones that have been truncated are all limited at either 339 or 340 characters.
Is this a limitation in SQLAlchemy, Python or something else entirely?
Any help appreciated!
Same problem here!
Set up :
Windows Server 2012
MS SQL Server 2016/PostgreSQL 10.1
Python 3.6 Anaconda Distribution
I've tested everything I could, but can't overpass this 33x limitation in field length. Either varchar/text seems to be affected and the DBMS/driver doesn't seem to have any influence.
EDIT:
Found the source of the "problem": https://bitbucket.org/zzzeek/sqlalchemy/issues/2837
Seems like fetchall() is affected by this feature.
The only workaround i found was :
empty_list=[]
connection = engine.connect()
results = connection.execute(stmt)
for row in results:
empty_list.append(row['componenttext'])
This way i haven't found any truncation in my long string field(>3000 ch).
I have a big problem when I try to save an object that's bigger than 400KB in a varbinary(max) column, calling ODBC from C++.
Here's my basic workflow of calling SqlPrepare, SQLBindParameter, SQLExecute, SQLPutData (the last one various times):
SqlPrepare:
StatementHandle 0x019141f0
StatementText "UPDATE DT460 SET DI024543 = ?, DI024541 = ?, DI024542 = ? WHERE DI006397 = ? AND DI008098 = ?"
TextLength 93
Binding of first parameter (BLOB field):
SQLBindParameter:
StatementHandle 0x019141f0
ParameterNumber 1
InputOutputType 1
ValueType -2 (SQL_C_BINARY)
ParameterType -4 (SQL_LONGVARBINARY)
ColumnSize 427078
DecimalDigits 0
ParameterValPtr 1
BufferLength 4
StrLenOrIndPtr -427178 (result of SQL_LEN_DATA_AT_EXEC(427078))
SQLExecute:
StatementHandle 0x019141f0
Attempt to save blob in chunks of 32K by calling SQLPutData a number of times:
SQLPutData:
StatementHandle 0x019141f0
DataPtr address of a std::vector with 32768 chars
StrLen_or_Ind 32768
During the very first SQLPutData-operation with the first 32KB of data, I get the following SQL Server error:
[HY000][Microsoft][ODBC SQL Server Driver]Warning: Partial insert/update. The insert/update of a text or image column(s) did not succeed.
This happens always when I try to save an object with a size of more than 400KB. Saving something that's smaller than 400KB works just fine.
I found out the critical parameter is ColumSize of SQLBindParemter. The parameter StrLenOrIndPtr during SQLBindParameter can have lower values (like 32K),
it still results in the same error.
But according to SQL Server API, I don't see why this should be problematic as long as I call SQLPutData with chunks of data that are smaller than 32KB.
Does anyone have an idea what the problem could be?
Any help would be greatly appreciated.
Ok, I just found out this was actually an sql driver problem!
After installing the newest version of Microsoft® SQL Server® 2012 Native Client (from http://www.microsoft.com/de-de/download/details.aspx?id=29065), saving bigger BLOBs works with exactly these parameters from above.