Pandas read_sql changing large number IDs when reading - sql-server

I transferred an Oracle database to SQL Server and all seems to have went well. The various ID columns are large numbers so I had to use Decimal as they were too large for BigInt.
I am now trying to read the data using pandas.read_sql using pyodbc connection with ODBC Driver 17 for SQL Server. df = pandas.read_sql("SELECT * FROM table1"),con)
The numbers are coming out as float64 and when I try to print them our use them in SQL statements they come out in scientific notation and when I try to use '{:.0f}'.format(df.loc[i,'Id']) It turns several numbers into the same number such as 90300111000003078520832. It is like precision is lost when it goes to scientific notation.
I also tried pd.options.display.float_format = '{:.0f}'.format before the read_sql but this did not help.
Clearly I must be doing something wrong as the Ids in the database are correct.
Any help is appreciated Thanks

pandas' read_sql method has an option named coerce_float which defaults to True and it …
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
However, in your case it is not useful, so simply specify coerce_float=False.

I've had this problem too, especially working with long ids: read_sql works fine for the primary key, but not for other columns (like the retweeted_status_id from Twitter API calls). Setting coerce_float to false does nothing for me, so instead I cast retweeted_status_id to a character format in my sql query.
Using psql, I do:
df = pandas.read_sql("SELECT *, Id::text FROM table1"),con)
But in SQL server it'd be something like
df = pandas.read_sql("SELECT *, CONVERT(text, Id) FROM table1"),con)
or
df = pandas.read_sql("SELECT *, CAST(Id AS varchar) FROM table1"),con)
Obviously there's a cost here if you're asking to cast many rows, and a more efficient option might be to pull from SQL server without using pandas (as a nested list or JSON or something else) which will also preserve your long integer formats.

Related

Convert DT_BYTES to Integer

I need to read a Microfocus CoBOL data file (on PC) containing COMP fields. FYI, a COMP stores an integer in binary format.
If I transfer the raw binary in SQL Server, I can convert it to a BigInt using
CONVERT(bigint,compField,1).
That way, CONVERT(bigint,0x0000002B17,1) will become 11031.
I also need to deal with negative values. In T-SQL it looks like this:
CONVERT(bigint,0xFFFFFFD4E9,1) - CONVERT(bigint,0xFFFFFFFFFF,1)-0x0000000001
will give -11031.
Is there a way to do this directly in the data flow? I'm sure the info is out there somewhere, but I'm too dumb to find it.
I'm working with SSIS 2019 btw.
Thank you!
Simon.

RODBC ERROR: 'Calloc' could not allocate memory

I am setting up a SQL Azure database. I need to write data into the database on daily basis. I am using 64-bit R version 3.3.3 on Windows10. Some of the columns contain text (more than 4000 characters). Initially, I have imported some data from a csv into the SQL Azure database using Microsoft SQL Server Management Studios. I set up the text columns as ntext format, because when I tried using nvarchar the max was 4000 and some of the values got truncated even though they were about 1100 characters long.
In order to append to the database I am first saving the records in a temp table when I have predefined the varTypes:
varTypesNewFile <- c("Numeric", rep("NTEXT", ncol(newFileToAppend) - 1))
names(varTypesNewFile) <- names(newFileToAppend)
sqlSave(dbhandle, newFileToAppend, "newFileToAppendTmp", rownames = F, varTypes = varTypesNewFile, safer = F)
and then append them by using:
insert into mainTable select * from newFileToAppendTmp
If the text is not too long, the above does work. However, sometimes I get the following error during the sqlSave command:
Error in odbcUpdate(channel, query, mydata, coldata[m, ], test = test, :
'Calloc' could not allocate memory (1073741824 of 1 bytes)
My questions are:
How can I counter this issue?
Is this the format I should be using?
Additionally, even when the above works, it takes about an hour to upload about 5k of records. Is it not too long? Is this the normal amount of time it should take? If not, what could I do better.
RODBC is very old, and can be a bit flaky with NVARCHAR columns. Try using the RSQLServer package instead, which offers an alternative means to connect to SQL Server (and also provides a dplyr backend).

How is build the format of geography data type in sql server?

I'm not being able to understand how is the data type geography in SQL server...
For example I have the following data:
0xE6100000010CCEAACFD556484340B2F336363BCA21C0
what I know:
0x is prefix for hexadecimal
last 16 numbers are longitude: B2F336363BCA21C0 (double of decimal format)
16 numbers before the last 16 are latitude: CEAACFD556484340 (double of decimal format)
4 first numbers are SRID: E610 (hexadecimal for WGS84)
what I don't understand:
numbers from 5 to 12 : 0000010C
what is this?
From what I read this seems linked to WKB(Well Known Binary) or EWKB(Extended Well Known Binary) anyway i was not abble to find a definition for EWKB...
And for WKB this is supposed to be geometry type (4-byte integer) but the value doesn't match with the Geometry types codes (this example is for one point coordinate)
Can you help to understand this format?
The spatial types (geometry and geography) in SQL Server are implemented as CLR data types. As with any such data types, you get a binary representation when you query the value directly. Unfortunately, it's not (as far as I know) WKB but rather whatever format Microsoft decided was best for their implementation. For us (the users), we should work with the published interface of methods that have been published by MS (for instance the geography method reference). Which is to say that you should only try to decipher the MS binary representation if you're curious (and not for actually working with it).
That said, if you need/want to work with WKB, you can! For example, you can use the STGeomFromWKB() static method to create a geography instance from WKB that you provide and STAsBinary() can be called on a geography instance to return WKB to you.
The Format spec can be found here:
https://msdn.microsoft.com/en-us/library/ee320529(v=sql.105).aspx
As that page shows, it used to change very frequently, but has slowed down significantly over the past 2 years
I am currently needing to dig into the spec to serialize from JVM code into a bcp file so that I can use SQLServerBulkCopy rather than plain JDBC to upload data into tables (it is about 7x faster to write a bcp file than using JDBC), but this is proving to be more complicated than what I originally anticipated.
After testing with bcp, you can upload geographies by specifying an off row format ( varchar(max) ) and store the well known text, SQL server will see this and assume you wanted a geography based on the WKT it sees.
In my case converting to nvarchar resolved the issue.

Error converting Varchar to Decimal SQL Server

I have a staging table loaded with data from a SAS dataset containing 5M records. All the columns are varchar. I am trying to convert a couple of columns to decimal(32,10). But it generates an error. I tried cast, I tried convert and even splitting the data up before and after decimal - same result.
I looked at the IsNumeric flag of the column and there are 0 records <> 1 meaning the data is numeric.
case
when wtd_count = '.' THEN NULL
when wtd_count = '' THEN NULL
else convert(decimal(32, 10), wtd_count)
end
Error:
Msg 8114, Level 16, State 5, Line 99
Error converting data type varchar to numeric.
So I'm wondering what else I can do to convert the data to decimal? Any idea?
Any help will greatly be appreciated.
If you are in SQL Server 2012 and above try to use try_parse or try_convert
ISNUMERIC is not reliable for what you're doing. It will flag values with things like monetary symbols and commas in them as valid.
It seems quite likely that there is some non-numeric data present. TRY_CONVERT or TRY_PARSE are your friend here. As an FYI, SQL Server version 11.0.x is SQL Server 2012, so you should be able to use these.
I also find it hard to believe that converting to numeric works, but not decimal. I can find no information that suggests the actual implementation of these two data types is different, and as such neither should work.
I would do some more in depth analysis of your data to make sure it looks like you're expecting it to.
After read your case statement i suppose that you have coma separated values. I'm pretty sure that you should use: CONVERT(DECIMAL(32,10),REPLACE(wtd_count,',','.'))

What datatype should I bind as query parameter to use with NUMBER(15) column in Oracle ODBC?

I have just been bitten by issue described in SO question Binding int64 (SQL_BIGINT) as query parameter causes error during execution in Oracle 10g ODBC.
I'm porting a C/C++ application using ODBC 2 from SQL Server to Oracle. For numeric fields exceeding NUMBER(9) it uses __int64 datatype which is bound to queries as SQL_C_SBIGINT. Apparently such binding is not supported by Oracle ODBC. I must now do an application wide conversion to another method. Since I don't have much time---it's an unexpected issue---I would rather use proved solution, not trial and error.
What datatype should be used to bind as e.g. NUMBER(15) in Oracle? Is there documented recommended solution? What are you using? Any suggestions?
I'm especially interested in solutions that do not require any additional conversions. I can easily provide and consume numbers in form of __int64 or char* (normal non-exponential form without thousands separator or decimal point). Any other format requires additional conversion on my part.
What I have tried so far:
SQL_C_CHAR
Looks like it's going to work for me. I was worried about variability of number format. But in my use case it doesn't seem to matter. Apparently only fraction point character changes with system language settings.
And I don't see why I should use explicit cast (e.g. TO_NUMERIC) in SQL INSERT or UPDATE command. Everything works fine when I bind parameter with SQL_C_CHAR as C type and SQL_NUMERIC (with proper precision and scale) as SQL type. I couldn't reproduce any data corruption effect.
SQL_NUMERIC_STRUCT
I've noticed SQL_NUMERIC_STRUCT added with ODBC 3.0 and decided to give it a try. I am disappointed.
In my situation it is enough, as the application doesn't really use fractional numbers. But as a general solution... Simply, I don't get it. I mean, I finally understood how it is supposed to be used. What I don't get is: why anyone would introduce new struct of this kind and then make it work this way.
SQL_NUMERIC_STRUCT has all the needed fields to represent any NUMERIC (or NUMBER, or DECIMAL) value with it's precision and scale. Only they are not used.
When reading, ODBC sets precision of the number (based on precision of the column; except that Oracle returns bigger precision, e.g. 20 for NUMBER(15)). But if your column has fractional part (scale > 0) it is by default truncated. To read number with proper scale you need to set precision and scale yourself with SQLSetDescField call before fetching data.
When writing, Oracle thankfully respects scale contained in SQL_NUMERIC_STRUCT. But ODBC spec doesn't mandate it and MS SQL Server ignores this value. So, back to SQLSetDescField again.
See HOWTO: Retrieving Numeric Data with SQL_NUMERIC_STRUCT and INF: How to Use SQL_C_NUMERIC Data Type with Numeric Data for more information.
Why ODBC doesn't fully use its own SQL_NUMERIC_STRUCT? I don't know. It looks like it works but I think it's just too much work.
I guess I'll use SQL_C_CHAR.
My personal preference is to make the bind variables character strings (VARCHAR2), and let Oracle do the conversion from character to it's own internal storage format. It's easy enough (in C) to get data values represented as null terminated strings, in an acceptable format.
So, instead of writing SQL like this:
SET MY_NUMBER_COL = :b1
, MY_DATE_COL = :b2
I write the SQL like this:
SET MY_NUMBER_COL = TO_NUMBER( :b1 )
, MY_DATE_COL = TO_DATE( :b2 , 'YYYY-MM-DD HH24:MI:SS')
and supply character strings as the bind variables.
There are a couple of advantages to this approach.
One is that works around the issues and bugs one encounters with binding other data types.
Another advantage is that bind values are easier to decipher on an Oracle event 10046 trace.
Also, an EXPLAIN PLAN (I believe) expects all bind variables to be VARCHAR2, so that means the statement being explained is slightly different than the actual statement being executed (due to the implicit data conversions when the datatypes of the bind arguments in the actual statement are not VARCHAR2.)
And (less important) when I'm testing of the statement in TOAD, it's easier just to be able to type in strings in the input boxes, and not have to muck with changing the datatype in a dropdown list box.
I also let the buitin TO_NUMBER and TO_DATE functions validate the data. (In earlier versions of Oracle at least, I encountered issues with binding a DATE value directly, and it bypassed (at least some of) the validity checking, and allowed invalid date values to be stored in the database.
This is just a personal preference, based on past experience. I use this same approach with Perl DBD.
I wonder what Tom Kyte (asktom.oracle.com) has to say about this topic?

Resources