Differentiating between two similar entries in SQL Server based on special characters

Differentiating between two similar entries in SQL Server based on special characters - sql-server

My table in SQL server has some entries like shown below.
2934046 Kellogg’s Share Your Breakfast 74672 2407522 Kellogg?s Share Your Breakfast ACTIVE 2015-09-01 9999-12-31
2934046 Kellogg?s Share Your Breakfast 74672 2407522 Kellogg?s Share Your Breakfast ACTIVE 2015-09-01 9999-12-31
Another example could be
2939508 UOL Ação Social 81534 1527484 UOL Ac?o Social ACTIVE 2015-09-01 9999-12-31
2939508 UOL Ac?o Social 81534 1527484 UOL Ac?o Social ACTIVE 2015-09-01 9999-12-31
As it can be seen that both the entries are same, except for the question mark character in the second entry. Even if I do something like
SELECT DISTINCT * from my_table
it is not useful. I have to figure out a way to remove such kinds of duplicate entries based on special characters. My manager says that the entries with question marks are basically bad data and I should remove them. Does anyone have an idea how to do so ?

You can implement damerau-levenshtein algorithm which evaluates how similar two strings are in a clr project and utilize it for t-sql.
You can experiment with your data to find the proper threshold value in order to accept two strings as duplicates.
A c# example of algorithm implementation can be found here:
Damerau - Levenshtein Distance, adding a threshold

I have face the same problem this year during data load activity but my manager provide me hint to use SSIS fuzzy grouping transformation to find identical records. Please create small SSIS package , add data flow task. Inside data flow task add (source + fuzzy grouping + destination).
Visit -
Adding Fuzzy Group Transform to Identify Duplicates
https://msdn.microsoft.com/en-us/library/jj819770(v=sql.120).aspx

Related

Issue with slicer Filtering from different data sets/ columns

I am having a problem trying to understand how to accomplish this. I want to use one set of slicers in my Excel spreadsheet to drill down to specific information. The problem is that I have duplicated Model names under the "Intel" worksheet. The reason is that Model Name could have one or two controllers. I have created all the queries, Power Pivots, and relationships. The link to the file is available here (this is all public data) if someone is willing to take a look and provide the guideline.
PROBLEM:
Due to Model Name's duplication under the "intel" worksheet, I have created a "DUP" column to identify duplicates in my data with the "X." I thought if I made a column “RELATED -Devide by 2” in the Power Pivot “Intel” with the formula =IF([DUP]="X," [RELATED - 12 Month Volume]/2, [RELATED - 12 Month Volume])", I would be able to show correct 12Month Volume based on Volume worksheet. This is partially true. I came to an understanding that I need to use both, “RELATED - 12 Month Volume” and “RELATED -Devide by 2” depending on what slicer I am filtering with
If Filtered by FORM Factor or Vendor, I can use RELATED - Divide by 2 (Orange color as shown below).
Now, if I filter above with Controller (like X710-TM4), this is not good. For Controller Filter, I would need to use “RELATED - 12 Month Volume” (Blue color as shown below), which is NOT suitable for above
How do I accomplish this? One set of slicers and be able to drill down and show correct value based on slicer used
enter image description here

Never mind... I figured it out with the CROSSFILTER measure

Google Data Studio : how to obtain a SUM related to a COUNT_DISTINCT?

I have a dataset including 3 columns :
ID transac (The unique ID of the transaction - Dimension)
Source (The source of the transaction - Dimension)
Amount € (The amount of the transaction - Stat)
screenshot of my dataset
To Count the number of transactions (for one or more sources), i use COUNT_DISTINCT function
I want to make the sum of the transactions amounts (for one or more sources). But i don't want to additionate the amounts of the transactions with the same ID !
Is there a way to do this calcul with a DataStudio function ?
Thanks for your answers. :-)
EDIT : I saw that we could do this type of calculation via SQL here and I would like to do this in DataStudio (so that I don't have to pre-calculate the amounts per source.)

IMO, your dataset contains wrong data. Each value should be relative only to that line, but this is not the case: if the total is =20, each line should describe the participation of that line to the total. With 4 sources, each line should be =5 or something else that sums 20.
To solve it in DataStudio, you need something like CALCULATE function in PowerBI, but currently DataStudio doesn't support this feature.
But there are some options to consider to repair your data:
If you're sure there are always 4 sources, just create a new calculated field with the expression Amount/4 and SUM it. It is not an elegant solution, but it works.
If your data source is Google Sheets, you can easily repair the data using formulas, like in this example:
Link to spreadsheet
For this spreadsheet, I used this formula in adjusted_amount column: =C2/COUNTIF(A:A,A2). With this column in DataStudio, just use the usual SUM aggregation function to summarize it correctly.

Power BI combine results from two SQL-Server tables

While using Power BI for a few months now, we (the user group) encountered an issue that is not really clear to us.
We use Power-BI with a remote SQL-Server data source, we access the data source through direct query.
Let's pretend we have 2 Tables as below-
Table name: Issue
Column:
ResolutionTime(Date/Time)
IssueID(Unique Numbers)
Table Name: WorkItem
Column:
start (Date/Time)
end (Date/Time)
IssueID (Unique Numbers, Foreign Key to "Issue" table)
Table WorkItem also contain a calculated column "WorkTime" which uses this DAX-expression as below-
WorkTime = WorkItem[end] - WorkItem[start]
The two tables are configured through Power-Bi having a two-way 1:n relationship that can be queried to collect all "WorkItem"(s) assigned to an "Issue" entry, using the "IssueID" as correlation column.
To be able to compute the aggregated "work-time" for each "WorkItem", we use a new/calculated table with the following DAX expression to aggregate the total amount of time invested for a single "Issue":
SumWork =
SUMMARIZE(
WorkItem, WorkItem[IssueID], "All work per item", SUM(WorkItem[WorkTime])
)
The above table computes the total invested work-time for a particular issue, grouping/summarizing results based on the "IssueID" foreign key. This new calculated table is also configured to have a relationship with the "Issue" table, this time a "1:1" relationship, using the IssueID as correlation column.
Now to compute the time that the issue was worked on + the time for Resolution should be summarized in a calculated column inside "Issue", but this does not work:
ResolutionAndWorkTime = Issue[ResolutionTime] + SumWork["All work per item"]
But the above DAX expression fails to compile, as it always reports that it returns "more than one result", thus not being a singular result. But that is suprising, as the two table ("Issue" and "SumWork" are related to each other with a "1:1" relationship).
Tables:
Issues
IssueID ResolutionTime ResolutionAndWorkTime
1 03:20:20 ???
2 01:20:20 ???
3 00:20:20 ???
WorkItem
IssueID start end WorkTime
1 1-2-2020 3:20:20 1-2-2020 3:25:20 00:05:00
1 2-2-2020 6:20:20 2-2-2020 7:20:20 01:00:00
3 1-3-2020 3:20:20 1-3-2020 3:29:20 00:09:00
Any ideas what to look for? Data-types? Table-definition? Table-relationships? We checked other Stackoverflow questions/answers, but no good ideas retrieved so far.
NOTE that a lot of join/merge features of Power BI are not available if direct-query is used and thus joining the tables is not really an option (we think).

You need this following code for your new Calculated column.
Visit HERE To know more about RELATED.
ResolutionAndWorkTime = Issues[ResolutionTime] + RELATED(SumWork[All work per item])

Based on input provided by "mkRabbani" (see other answer) we investigated why "RELATED" does not function as expected. The problem originates in the access to the database. As suspected earlier the function delivers the expected results once the database access is switched to "import" instead of "direct-query".
As a workaround we now joins the data inside the SQL server by using traditional database views. Of course this only works for scenarios where the database is under control of the data analytics team.

Fuzzy name matching algorithm

I have a database containing names of certain blacklisted companies and individuals.
All transactions created, its detail needs to be scanned against these blacklisted names. The created transactions may have names not correctly spelled, for example one can write "Wilson" as "Wilson", "Vilson" or "Veelson". The Fuzzy search logic or utility should match against the name "Wilson" present in the blacklisted database and based on the required correctness / accuracy percentage set by the user, has to show the matching name within the percentage set.
The transactions will be sent in batches or real time to check against black listed names.
I would appreciate, if users who had similar requirement and has implemented them, could also give their views and implementation

T-SQL leaves a lot to be desired in the realm of fuzzy search. Your best options are third party libraries, but if you don't want to mess with that, your best best is using the DIFFERENCE function built in to SQL Server. For example:
SELECT * FROM tblUsers U WHERE DIFFERENCE(U.Name, #nameEntered) >= 3
A higher return value for DIFFERENCE indicates higher accuracy. A drawback of this is that the algorithm favors words that sound alike, which may not be your desired characteristic.
This next example shows how to get the best match out of a table:
DECLARE #users TABLE (Name VARCHAR(255))
INSERT INTO #users VALUES ('Dylan'), ('Bob'), ('Tester'), ('Dude')
SELECT *, MAX(DIFFERENCE(Name, 'Dillon')) AS SCORE FROM #users GROUP BY Name ORDER BY SCORE DESC
It returns:
Name | Score
Dylan 4
Dude 3
Bob 2
Tester 0

DB Design: Sort Order for Lookup Tables

I have an application where the database back-end has around 15 lookup tables. For instance there is a table for Counties like this:
CountyID(PK) County
49001 Beaver
49005 Cache
49007 Carbon
49009 Daggett
49011 Davis
49015 Emery
49029 Morgan
49031 Piute
49033 Rich
49035 Salt Lake
49037 San Juan
49041 Sevier
49043 Summit
49045 Tooele
49049 Utah
49051 Wasatch
49057 Weber
The UI for this app has a number of combo boxes in various places for these lookup tables, and my client has asked that the boxes list in this case:
CountyID(PK) County
49035 Salt Lake
49049 Utah
49011 Davis
49057 Weber
49045 Tooele
'The Rest Alphabetically
The best plan I have for accomplishing this is to add a column to each lookup table for SortOrder(numeric). I had a colleague tell me he thought that would cause the tables to violate 3rd-Normal-Form, but I think the sort order still depends on the key and only the key (even though the rest of the list is alphabetical).
Is adding the SortOrder column the best way to do this, or is there a better way I am just not seeing?

I agree with #cletus that a sort order column is a good way to go and it does not violate 3NF (because, as you said, the sort order column entries are functionally dependent on the candidate keys of the table).
I'm not sure I agree that alphanumeric is better than numeric. In the specific case of counties, there are seldom new ones created. But there is no requirement that the numbers assigned are sequential; you can allocate them with numbers that are a multiple of a hundred, for example, leaving ample room for insertions.

Yes I agree a sort order column is the best solution when the requirements call for a custom sort order like the one you cite. I wouldn't go with a numeric column however. If the data is alphanumeric, the sort order should be alphanumeric. That way you can seed the value with whatever is in the county field.
If you use a numeric field you'll have to resequence the entire table (potentially) whenever you add a new entry. So:
Columns: ID, County, SortOrder
Seed:
UPADTE County SET SortOrder = CONCAT('M-', County)
and for the special cases:
UPDATE County
SET SortOrder = CONCAT('E-' . County)
WHERE County IN ('Salt Lake', 'Utah', 'Davis', 'Weber', 'Tooele')
Arguably you may want to put another marker column in to indicate those entries are special.

I went with numeric and large multiples.
Even with the CONCAT('E-'.. example, I don't get the required sort order. That would give me Davis, SL, Tooele... and Salt Lake needs to be first.
I ended up using multiples of 10 and assigned the non-special-sort entries a value like 10000. That way the view for each lookup can have
ORDER BY SortOrder ASC, OtherField ASC
Another programmer suggested using DECODE in Oracle, or CASE statements in SQL Server, but this is a more general solution. YMMV.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight