DbUnit: Setting tolerance value - compare SQL Server vs SAP HANA - sql-server

Most important
DB Unit returns a difference for a double value in row 78:
Exception in thread "main" junit.framework.ComparisonFailure: value (table=dataset, row=78, col=DirtyValue) expected:<4901232.27291950[7]> but was:<4901232.27291950[6]>
So I assume that SQL Server returns 4901232.272919507 while HANA returns 4901232.272919506
(Based on the answer to JUnit assertEquals Changes String)
Then I tried to set the tolerated delta acording to the FAQ Is there an equivalent to JUnit's assertEquals(double expected, double actual, double delta) to define a tolerance level when comparing numeric values?
But I do still get the same error - any ideas?
Additional information
Maybe this is the reason:?
[main] WARN org.dbunit.dataset.AbstractTableMetaData - Potential problem found: The configured data type factory 'class org.dbunit.dataset.datatype.DefaultDataTypeFactory' might cause problems with the current database 'Microsoft SQL Server' (e.g. some datatypes may not be supported properly). In rare cases you might see this message because the list of supported database products is incomplete (list=[derby]). If so please request a java-class update via the forums.If you are using your own IDataTypeFactory extending DefaultDataTypeFactory, ensure that you override getValidDbProducts() to specify the supported database products.
[main] WARN org.dbunit.dataset.AbstractTableMetaData - Potential problem found: The configured data type factory 'class org.dbunit.dataset.datatype.DefaultDataTypeFactory' might cause problems with the current database 'HDB' (e.g. some datatypes may not be supported properly). In rare cases you might see this message because the list of supported database products is incomplete (list=[derby]). If so please request a java-class update via the forums.If you are using your own IDataTypeFactory extending DefaultDataTypeFactory, ensure that you override getValidDbProducts() to specify the supported database products.
DbUnit Version 2.5.4
DirtyValue is calculated from 3 double vales in both systems
SQL Server
SELECT TypeOfGroup, Segment, Portfolio, UniqueID, JobId, DirtyValue, PosUnits, FX_RATE, THEO_Value
FROM DATASET_PL
order by JobId, TypeOfGroup, Segment, Portfolio, UniqueID COLLATE Latin1_General_bin
HANA
SELECT "TypeOfGroup", "Segment", "Portfolio", "UniqueID", "JobId", "DirtyValue", Pos_Units as "PosUnits", FX_RATE, THEO_Value as "THEO_Value"
FROM "_SYS_BIC"."meag.app.h4q.metadata.dataset.pnl/06_COMPARE_CUBES_AND_CALC_ATTR"
order by "JobId", "TypeOfGroup", "Segment", "Portfolio", "UniqueID"

Work-around
Use a diffhandler and handle the differences there:
DiffCollectingFailureHandler diffHandler = new DiffCollectingFailureHandler();
Assertion.assertEquals(expectedTable, actualTable);
List<Difference> diffList = diffHandler.getDiffList();
for (Difference diff: diffList) {
if (diff.getColumnName().equals("DirtyValue")) {
double actual = (double) diff.getActualValue();
double expected = (double) diff.getExpectedValue();
if (Math.abs(Math.abs(actual) - Math.abs(expected)) > 0.00001) {
logDiff(diff);
} else {
logDebugDiff(diff);
}
} else {
logDiff(diff);
}
}
private void logDiff(Difference diff) {
logger.error(String.format("Diff found in row:%s, col:%s expected:%s, actual:%s", diff.getRowIndex(), diff.getColumnName(), diff.getExpectedValue(), diff.getActualValue()));
}
private void logDebugDiff(Difference diff) {
logger.debug(String.format("Diff found in row:%s, col:%s expected:%s, actual:%s", diff.getRowIndex(), diff.getColumnName(), diff.getExpectedValue(), diff.getActualValue()));
}

The question was "Any idea?", so maybe it helps to understand why the difference occurrs.
HANA truncates if needed, see "HANA SQL and System Views Reference", numeric types. In HANA the following Statement results in 123.45:
select cast( '123.456' as decimal(6,2)) from dummy;
SQL-Server rounds if needed, at least if the target data type is numeric, see e.g. here at "Truncating and rounding results".
The same SQL statement as above results in 123.46 in SQL-Server.
And SQL-Standard seems to leave it open, whether to round or to truncate, see answer on SO .
I am not aware of any settings that change the rounding behavior in HANA, but maybe there is.

Related

Fetching ElasticSearch Results into SQL Server by calling Web Service using SQL CLR

Code Migration due to Performance Issues :-
SQL Server LIKE Condition ( BEFORE )
SQL Server Full Text Search --> CONTAINS ( BEFORE )
Elastic Search ( CURRENTLY )
Achieved So Far :-
We have a web page created in ASP.Net Core which has a Auto Complete Drop Down of 2.5+ Million Companies Indexed in Elastic Search https://www.99corporates.com/
Due to performance issues we have successfully shifted our code from SQL Server Full Text Search to Elastic Search and using NEST v7.2.1 and Elasticsearch.Net v7.2.1 in our .Net Code.
Still looking for a solution :-
If the user does not select a company from the Auto Complete List and simply enters a few characters and clicks on go then a list should be displayed which we had done earlier by using the SQL Server Full Text Search --> CONTAINS
Can we call the ASP.Net Web Service which we have created using SQL CLR and code like SELECT * FROM dbo.Table WHERE Name IN( dbo.SQLWebRequest('') )
[System.Web.Script.Services.ScriptMethod()]
[System.Web.Services.WebMethod]
public static List<string> SearchCompany(string prefixText, int count)
{
}
Any better or alternate option
While that solution (i.e. the SQL-APIConsumer SQLCLR project) "works", it is not scalable. It also requires setting the database to TRUSTWORTHY ON (a security risk), and loads a few assemblies as UNSAFE, such as Json.NET, which is risky if any of them use static variables for caching, expecting each caller to be isolated / have their own App Domain, because SQLCLR is a single, shared App Domain, hence static variables are shared across all callers, and multiple concurrent threads can cause race-conditions (this is not to say that this is something that is definitely happening since I haven't seen the code, but if you haven't either reviewed the code or conducted testing with multiple concurrent threads to ensure that it doesn't pose a problem, then it's definitely a gamble with regards to stability and ensuring predictable, expected behavior).
To a slight degree I am biased given that I do sell a SQLCLR library, SQL#, in which the Full version contains a stored procedure that also does this but a) handles security properly via signatures (it does not enable TRUSTWORTHY), b) allows for handling scalability, c) does not require any UNSAFE assemblies, and d) handles more scenarios (better header handling, etc). It doesn't handle any JSON, it just returns the web service response and you can unpack that using OPENJSON or something else if you prefer. (yes, there is a Free version of SQL#, but it does not contain INET_GetWebPages).
HOWEVER, I don't think SQLCLR is a good fit for this scenario in the first place. In your first two versions of this project (using LIKE and then CONTAINS) it made sense to send the user input directly into the query. But now that you are using a web service to get a list of matching values from that user input, you are no longer confined to that approach. You can, and should, handle the web service / Elastic Search portion of this separately, in the app layer.
Rather than passing the user input into the query, only to have the query pause to get that list of 0 or more matching values, you should do the following:
Before executing any query, get the list of matching values directly in the app layer.
If no matching values are returned, you can skip the database call entirely as you already have your answer, and respond immediately to the user (much faster response time when no matches return)
If there are matches, then execute the search stored procedure, sending that list of matches as-is via Table-Valued Parameter (TVP) which becomes a table variable in the stored procedure. Use that table variable to INNER JOIN against the table rather than doing an IN list since IN lists do not scale well. Also, be sure to send the TVP values to SQL Server using the IEnumerable<SqlDataRecord> method, not the DataTable approach as that merely wastes CPU / time and memory.
For example code on how to accomplish this correctly, please see my answer to Pass Dictionary to Stored Procedure T-SQL
In C#-style pseudo-code, this would be something along the lines of the following:
List<string> = companies;
companies = SearchCompany(PrefixText, Count);
if (companies.Length == 0)
{
Response.Write("Nope");
}
else
{
using(SqlConnection db = new SqlConnection(connectionString))
{
using(SqlCommand batch = db.CreateCommand())
{
batch.CommandType = CommandType.StoredProcedure;
batch.CommandText = "ProcName";
SqlParameter tvp = new SqlParameter("ParamName", SqlDbType.Structured);
tvp.Value = MethodThatYieldReturnsList(companies);
batch.Paramaters.Add(tvp);
db.Open();
using(SqlDataReader results = db.ExecuteReader())
{
if (results.HasRows)
{
// deal with results
Response.Write(results....);
}
}
}
}
}
Done. Got the solution.
Used SQL CLR https://github.com/geral2/SQL-APIConsumer
exec [dbo].[APICaller_POST]
#URL = 'https://www.-----/SearchCompany'
,#JsonBody = '{"searchText":"GOOG","count":10}'
Let me know if there is any other / better options to achieve this.

"Value already exists as a correcting value" error when cleansing data with DQS

In SQL Server 2012 Data Quality Services, I need to clean the data in Term Based Relation as follows:
String Replaceto**
Wal walmart**
Wlr walmart**
Wlt walmart**
Walmart
That is the words "wal","wlr", and "wlt" have to be replaced with "walmart" and finally "walmart" is replaced with a empty space.
it shows the error as
SQL Server Data Quality Services
--------------------------------------------------------------------------------
2/1/2013 2:48:37 PM
Message Id: DataValueServiceTermBasedRelationCorrectedValueAlreadyCorrectingValue
Term Based Relation (walmart, ) cannot be added for domain 'keywordphrase' because 'walmart' value already exists as a correcting value.
--------------------------------------------------------------------------------
Microsoft.Ssdqs.DataValueService.Service.DataValueServiceException: Term Based Relation (walmart, ) cannot be added for domain 'keywordphrase' because 'walmart' value already exists as a correcting value.
at Microsoft.Ssdqs.DataValueService.Managers.DomainTermBasedRelationManager.PreapareAndValidateRelation(DomainTermBasedRelation relation, IMasterContext context)
at Microsoft.Ssdqs.DataValueService.Managers.DomainTermBasedRelationManager.Add(IMasterContext context, ServiceDefinitionBase data)
at Microsoft.Ssdqs.DataValueService.Service.DataValueServiceConcrete.Add(IMasterContext context, ReadOnlyCollection`1 data)
any suggestions for the solution
Thanks,
It is my understanding that DQS does not support multi-level replacements (i.e, a->b then b->c). Why not go straight to blanks for the firts three terms?

Analysis Services stored procedure performance

I'm writing a stored procedure in .NET to do some complex calculations that can't be written easily in pure MDX. The first problem I'm having is how to retrieve a set of data in a tabular form to pass to my calculation.
My code so far is written below. I would have thought that after we retrieve our value at position **1, we would have all the data in memory to interact with. However, it seems that at position **2, a Query Subcube is issued to the storage engine for each and every day in our range. This is devastating to performance.
Is there something I'm doing wrong? Is there another method I can call to evaluate the set all at once?
// First get the date range that we'd like to calculate over.
// (These values are constant here for example only)
DateTime date = new DateTime(2012, 4, 1);
int dateFrom = KeyFromDate(date.AddDays(-360));
int dateTo = KeyFromDate(date);
string dateRange = string.Format(
"[Date].[Date].&[{0}]:[Date].[Date].&[{1}]",
dateFrom,
dateTo
);
Expression expression = new Expression(dateRange + "*[Measures].[My Measure]");
MDXValue value = expression.CalculateMdxObject(null); // ***1
foreach (var tuple in value.ToSet().Tuples)
{
MDXValue tupleValue = MDXValue.FromTuple(tuple).ToInt32(); // ***2
}
Run SQL Profiler, connect to analysis services, on tab "event selection" check "show all events" and select "get data from aggredations", "get data from cache", "query subsube" and "query subcube verbose".
First read this document http://www.microsoft.com/en-us/download/details.aspx?id=17303 - see page 18 - in order to understand how "query subcube verbose" is working.
Then in Visual Studio (where you're debugging your procedure) in debug mode pass through line **1
and see in SQL Profiler what is queried in verbose - what measure group and what attributes.
Then pass through ***2 and see again in SQL Profiler in verbose events what is queried.
I believe that the set of attributes is different, so it may so happen that in **1 it uses some aggregate, and in place **2 when "value" is present in the tuple - there are no aggregate for this set of attributes, so instead of making "read from aggregations" once it makes "read from measure group cache" several times.
I can't tell more exactly cause I don't have your cube. Try to find this out by "query subcube verbose" events, and try to use BIDS Helper to create necessary aggregations manually (with specific set of attributes) - it may help.

SOQL - Convert Date To Owner Locale

We use the DBAmp for integrating Salesforce.com with SQL Server (which basically adds a linked server), and are running queries against our SF data using OPENQUERY.
I'm trying to do some reporting against opportunities and want to return the created date of the opportunity in the opportunity owners local date time (i.e. the date time the user will see in salesforce).
Our dbamp configuration forces the dates to be UTC.
I stumbled across a date function (in the Salesforce documentation) that I thought might be some help, but I get an error when I try an use it so can't prove it, below is the example useage for the convertTimezone function:
SELECT HOUR_IN_DAY(convertTimezone(CreatedDate)), SUM(Amount)
FROM Opportunity
GROUP BY HOUR_IN_DAY(convertTimezone(CreatedDate))
Below is the error returned:
OLE DB provider "DBAmp.DBAmp" for linked server "SALESFORCE" returned message "Error 13005 : Error translating SQL statement: line 1:37: expecting "from", found '('".
Msg 7350, Level 16, State 2, Line 1
Cannot get the column information from OLE DB provider "DBAmp.DBAmp" for linked server "SALESFORCE".
Can you not use SOQL functions in OPENQUERY as below?
SELECT
*
FROM
OPENQUERY(SALESFORCE,'
SELECT HOUR_IN_DAY(convertTimezone(CreatedDate)), SUM(Amount)
FROM Opportunity
GROUP BY HOUR_IN_DAY(convertTimezone(CreatedDate))')
UPDATE:
I've just had some correspondence with Bill Emerson (I believe he is the creator of the DBAmp Integration Tool):
You should be able to use SOQL functions so I am not sure why you are
getting the parsing failure. I'll setup a test case and report back.
I'll update the post again when I hear back. Thanks
A new version of DBAmp (2.14.4) has just been released that fixes the issue with using ConvertTimezone in openquery.
Version 2.14.4
Code modified for better memory utilization
Added support for API 24.0 (SPRING 12)
Fixed issue with embedded question marks in string literals
Fixed issue with using ConvertTimezone in openquery
Fixed issue with "Invalid Numeric" when using aggregate functions in openquery
I'm fairly sure that because DBAmp uses SQL and not SOQL, SOQL functions would not be available, sorry.
You would need to expose this data some other way. Perhaps it's possible with a Salesforce report, web-service, or compiling the data through the program you are using to access the (DBAmp) SQL Server.
If you were to create a Salesforce web service, the following example might be helpful.
global class MyWebService
{
webservice static AggregateResult MyWebServiceMethod()
{
AggregateResult ar = [
SELECT
HOUR_IN_DAY(convertTimezone(CreatedDate)) Hour,
SUM(Amount) Amount
FROM Opportunity
GROUP BY HOUR_IN_DAY(convertTimezone(CreatedDate))];
system.debug(ar);
return ar;
}
}

Implementing check constraints with SQL CLR integration

I'm implementing 'check' constraints that simply call a CLR function for each constrained column.
Each CLR function is one or two lines of code that attempts to construct an instance of the user-defined C# data class associated with that column. For example, a "Score" class has a constructor which throws a meaningful error message when construction fails (i.e. when the score is outside a valid range).
First, what do you think of that approach? For me, it centralizes my data types in C#, making them available throughout my application, while also enforcing the same constraints within the database, so it prevents invalid manual edits in management studio that non-programmers may try to make. It's working well so far, although updating the assembly causes constraints to be disabled, requiring a recheck of all constraints (which is perfectly reasonable). I use DBCC CHECKCONSTRAINTS WITH ALL_CONSTRAINTS to make sure the data in all tables is still valid for enabled and disabled constraints, making corrections as necessary, until there are no errors. Then I re-enable the constraints on all the tables via ALTER TABLE [tablename] WITH CHECK CHECK CONSTRAINT ALL. Is there a T-SQL statement to re-enable with check all check constraints on ALL tables, or do I have to re-enable them table by table?
Finally, for the CLR functions used in the check constraints, I can either:
Include a try/catch in each function to catch data construction errors, returning false on error, and true on success, so that the CLR doesn't raise an error in the database engine, or...
Leave out the try/catch, just construct the instance and return true, allowing that aforementioned 'meaningful' error message to be raised in the database engine.
I prefer 2, because my functions are simpler without the error code, and when someone using management studio makes an invalid column edit, they'll get the meaningful message from the CLR like "Value for type X didn't match regular expression '^p[1-9]\d?$'" instead of some generic SQL error like "constraint violated". Are there any severe negative consequences of allowing CLR errors through to SQL Server, or is it just like any other insert/update failure resulting from a constraint violation?
For example, a "Score" class has a constructor which throws a meaningful error message when construction fails (i.e. when the score is outside a valid range). First, what do you think of that approach?
It worries me a bit, because calling a ctor requires memory allocation, which is relatively expensive. For each row inserted, you're calling a ctor -- and only for its side-effects.
Also expensive are exceptions. They're great when you need them, but this is a case where you vould use them in a ctor context, but not in a check context.
A refactoring could reduce both costs, by having the check exist as a class static or free function, then both the check constraint and the ctor could call that:
class Score {
private:
int score;
public:
static bool valid( int score ) {
return score > 0 ;
}
Score( int s ) {
if( ! valid( s ) ) {
throw InvalidParameter();
}
score = s;
}
}
Check constraint calls Score::valid(), no construction or exception needed.
Of course, you still have the overhead, for each row, of a CLR call. Whether that's acceptable is something you'll have to decide.
Is there a T-SQL statement to re-enable with check all check constraints on ALL tables, or do I have to re-enable them table by table?
No, but you can do this to generate the commands:
select 'ALTER TABLE ' || name || ' WITH CHECK CHECK CONSTRAINT ALL;'
from sys.tables ;
and then run the resultset against the database.
Comments from the OP:
I use base classes called ConstrainedNumber and RegexConstrainedString for all my data types. I could easily move those two classes' simple constructor code to a separate public boolean IsValueValid method as you suggested, and probably will.
The CLR overhead (and memory allocation) would only occur for inserts and updates. Given the simplicity of the methods, and rate at which table updates will occur, I don't think the performance impact will anything to worry about for my system.
I still really want to raise exceptions for the information they'll provide to management studio users. I like the IsValueValid method, because it gives me the 'option' of not throwing errors. Within applications using my data types, I could still get the exception by constructing an instance :)
I'm not sure I agree with the exception throwing, but again, the "take-home message" is that by decomposing the problem into parts, you can select what parts you're wiling to pay for, without paying for parts you don't use. The ctor you don't use, because you were only calling it to get the side-effect. So we decomposed creation and checking. We can further decompose throwing:
class Score {
private:
int score;
public:
static bool IsValid( int score ) {
return score > 0 ;
}
static checkValid( int score ) {
if( ! isValid( s ) ) {
throw InvalidParameter();
}
Score( int s ) {
checkValid( s ) ;
score = s;
}
}
Now a user can call the ctor, and get the check and possible exception and construction, call checkValid and get the check and exception, or isValid to just get the validity, paying the runtime cost for only what he needs.
Some clarification. These data classes set one level above the primitives types, constraining data to make it meaningful.
Actually, they sit just above the RegexConstrainedString and ConstrainedNumber<T> classes, which is where we're talking about refactoring the constructor's validation code into a separate method.
The problem with refactoring the validation code, is that the Regex necessary for validation exists only in the subclasses of RegexConstrainedString, since each subclass has a different Regex. This means that the validation data is only available to the RegexConstrainedString's constructor, not any of it's methods. So, if I factor out the validation code, callers would need access to the Regex.
public class Password: RegexConstrainedString
{
internal static readonly Regex regex = CreateRegex_CS_SL_EC( #"^[\w!""#\$%&'\(\)\*\+,-\./:;<=>\?#\[\\\]\^_`{}~]{3,20}$" );
public Password( string value ): base( value.TrimEnd(), regex ) {} //length enforced by regex, so no min/max values specified
public Password( Password original ): base( original ) {}
public static explicit operator Password( string value ) {return new Password( value );}
}
So, when reading a value from the database or reading user input, the Password constructor forwards the Regex to the base class to handle the validation. Another trick is that it trims the end characters automatically, in case the database type is char rather than varchar, so I don't have to remember to do it. Anyway, here is what the main constructor for RegexConstrainedString looks like:
protected RegexConstrainedString( string value, Regex subclasses_static_regex, int? min_length, int? max_length )
{
_value = (value ?? String.Empty);
if (min_length != null)
if (_value.Length < min_length)
throw new Exception( "Value doesn't meet minimum length of " + min_length + " characters." );
if (max_length != null)
if (_value.Length > max_length)
throw new Exception( "Value exceeds maximum length of " + max_length + " characters." );
value_match = subclasses_static_regex.Match( _value ); //Match.Synchronized( subclasses_static_regex.Match( _value ) );
if (!value_match.Success)
throw new Exception( "Invalid value specified (" + _value + "). \nValue must match regex:" + subclasses_static_regex.ToString() );
}
Since callers would need access to the subclass's Regex, I think my best bet is to implement a IsValueValid method in the subclass, which forwards the data to the IsValueValid method in the RegexConstrainedString base class. In other words, I would add this line to the Password class:
public static bool IsValueValid( string value ) {return IsValueValid( value.TrimEnd(), regex, min_length, max_length );}
I don't like this however, because I'm replicating the subclasses constructor code, having to remember to trim the string again and pass the same min/max lengths when necessary. This requirement would be forced upon all subclasses of RegexConstrainedString, and it's not something I want to do. These data classes like Password is so simple, because RegexConstrainedString handles most of the work, implementing operators, comparisons, cloning, etc.
Furthermore, there are other complications with factoring out the code. The validation involves running and storing a Regex match in the instance, since some data types may have properties that report on specific elements of the string. For example, my SessionID class contains properties like TimeStamp, which return a matched group from the Match stored in the data class instance. The bottom line is that this static method is an entirely different context. Since it's essentially incompatible with the constructor context, the constructor cannot use it, so I would end up replicating code once again.
So... I could factor out the validation code by replicating it and tweaking it for a static context and imposing requirements on subclasses, or I could keep things much simpler and just perform the object construction. The relative extra memory allocated would be minimal, as only a string and Match reference is stored in the instance. Everything else, such as the Match and the string itself would still be generated by the validation anyway, so there's no way around that. I could worry about the performance all day, but my experience has been that correctness is more important, because correctness often leads to numerous other optimizations. For example, I don't ever have to worry about improperly formatted or sized data flowing through my application, because only meaningful data types are used, which forces validation to the point-of-entry into the application from other tiers, be it database or UI. 99% of my validation code was removed as a no-longer-necessary artifact, and I find myself only checking for nulls nowadays. Incidentally, having reached this point, I now understand why including nulls was the billion dollar mistake. Seems to be the only thing I have to check for anymore, even though they are essentially non-existent in my system. Complex objects having these data types as fields cannot be constructed with nulls, but I have to enforce that in the property setters, which is irritating, because they otherwise would never need validation code... only code that runs in response to changes in values.
UPDATE:
I simulated the CLR function calls both ways, and found that when all data is valid, the performance difference is only fractions of a millisecond per thousand calls, which is negligible. However, when roughly half the passwords are invalid, throwing exceptions in the "instantiation" version, it's three orders of magnitude slower, which equates to about 1 extra sec per 1000 calls. The magnitudes of difference will of course multiple as multiple CLR calls are made for multiple columns in the table, but that's a factor of 3 to 5 for my project. So, is an extra 3 - 5 second per 1000 updates acceptable to me, as a trade off for keeping my code very simple and clean? Well that depends on the update rate. If my application were getting 1000 updates per second, a 3 - 5 second delay would be devastating. If, on the other hand, I was getting 1000 updates a minute or an hour, it may be perfectly acceptable. In my situation, I can tell you now that it's quite acceptable, so I think I'll just go with the instantiation method, and allow the errors through. Of course, in this test I handled the errors in the CLR instead of letting SQL Server handle them. Marshalling the error info to SQL Server, and then possibly back to the application, could definitely slow things down much more. I guess I will have to fully implement this to get a real test, but from this preliminary test, I'm pretty sure what the results will be.

Resources