Database design for storing LCS information? - sql-server

I have a table that contains 2 columns, one is a id, and other is column containing long strings
eg.
Id strings
1 AGTTAGGACCTTACTCTATATCTGTTCTGTTGGTATGGAG
2 GTACTTGTATTCTGATATCTAGGGTTTTCTAATTACTTCTG
3 GTATTCTCTTTCTAGCTGATCGTAATTAAATCTTATCTAA
when the user is performing a search, I would find the longest common subsequence in the search string and all the data in the table.
Eg. the search sequence is
TCTGTTCTG
1. Its a 100% match, with the whole match found.
2. The LCS is TCTGTTCTG, but with some gaps.
3. The LCS is TCTGTTCT, with some gaps in BTW.
Is there a way to store the information about the match that where exactly it started finding the match and then upto where it found the match, and then where it started again and so on?
So, that I can represent data in somewhat this format
First one =>
AGTTAGGACCTTACTCTATATCTGTTCTGTTGGTATGGAG
|||||||||
TCTGTTCTG
Second one =>
GTACTTGTATTCTGATATCTAGGGTTTTCTAATTACTTCTG
| || | |||||
T CT G TTCTG
Basically somehow I could store this, start and end position for each sequence for each subsequence found, so that when I show this page again in future, I don't have to compute
this match again and can somehow pick out this data about start and end from database and just show this in the format shown? I know the question might be a little hazy, but please
let me know how else I can elaborate if you have any doubts?

The first case is easy enough using PATINDEX.
Case 1:
select Id, PATINDEX('%TCTGTTCTG%', strings) FROM table
That should return the Id of all 'Full' matches and the starting position of the match.
Case 2:
select id, PATINDEX('%T%C%T%G%T%T%C%T%G%', strings) FROM table
This one appears to return a value for partial match, does not select 'Best' partial match)
Will get back to it when I can, lots of edge cases from what I see.
(Edge cases: What if there are multiple full matches, do you need to return a match with the least amount of gap in it or just a match with gaps? Same goes for partial matches)
That should give you a start, while I think about the rest of it.

Related

How to split a search string into parts, then check parts against a database

Here's what I'm dealing with:
We have a database of machines and their part lists are specified using strings. For example, one machine might be specified with the string &XXX&YYY-ZZZ, meaning the machine contains parts XXX and YYY and not ZZZ.
We use &XXX to specify that a part exists in a machine, and -XXX to specify that a part does not exist in a machine.
It's also possible that a part is not listed (i.e. not specified whether or not it exists in the machine). For example I might only have &XXX&YYY (ZZZ is not specified).
Additionally, the codes can be in any order, for example I might have &XXX&YYY-ZZZ or &XXX-ZZZ&YYY.
In order to search for machines, I get a string like this: &XXX-YYY/&YYY&ZZZ (/ is an OR operator), meaning "I want to find all machines that either a) contain XXX and do not contain YYY, or b) contain both YYY and ZZZ.
I'm having trouble parsing the string based on the variable ordering, possibility that parts may not be shown, and handling of the / operator. Note, we use Microsoft 365.
Looking for some suggestions!
When I search for &XXX-YYY/&YYY&ZZZ, I should return the following machines:
Machine
Result
&XXX-YYY&ZZZ
TRUE (because XXX exists and YYY does not exist)
&XXX-YYY-ZZZ
TRUE (because XXX exists and YYY does not exist)
&XXX&YYY&ZZZ
TRUE (because YYY exists and ZZZ exists)
&XXX&ZZZ
FALSE (because YYY is specified in the search, but this machine doesn't specify it)
&ZZZ&YYY
TRUE (showing that parts can be in any order)
You can try it in cell C2 with the following formula:
=LET(query, A2, queries, TEXTSPLIT(query,, "/"), input, B2:B7,
qryNum, ROWS(queries),
SPLIT, LAMBDA(txt,LET(str, SUBSTITUTE(SUBSTITUTE(txt, "&",";1_"),
"-",";0_"), TEXTSPLIT(str,,";",TRUE))),
lkUps, DROP(REDUCE("", queries, LAMBDA(acc,qry, HSTACK(acc, SPLIT(qry)))),,1),
MAP(input, LAMBDA(txt, LET(str, SPLIT(txt),
out, REDUCE("", SEQUENCE(qryNum, 1), LAMBDA(acc,idx,
LET(cols, INDEX(lkUps,,idx), qry, FILTER(cols, cols<>""),
matches, SUM(N(ISNUMBER(XMATCH(str, qry)))),
result, IF(ROWS(qry)=matches,1,0),IF(acc="", result, MAX(acc, result))
))), IF(out=1, TRUE, FALSE)
)))
)
and here is the corresponding output:
Assumptions:
String values (operation and part) should be unique, i.e. the case &XXX-YYY&XXX is not considered, because &XXX is duplicated.
Explanation
The main idea is to transform the input information in a way we can do comparisons at array level via XMATCH. The first thing to do is to identify each OR condition in the search string because we need to test each one of them against the Input column. The name queries is an array with all the OR conditions.
We can transform the string inputs in a way we can split the string into an array. SPLIT is a user LAMBDA function that does that:
SUBSTITUTE(SUBSTITUTE(txt, "&",";1_"),"-",";0_"), TEXTSPLIT(str,,";",TRUE)))
What it does is convert for example the input: &XXX-YYY&ZZZ into the following array:
1_XXX
0_YYY
1_ZZZ
We change the original operations &,- into 1,0 just for convenience, but you can keep the original operation value, it is irrelevant for the calculation. It is important to set the fourth TEXTSPLIT input argument to TRUE to ensure no empty rows are generated.
The name lkUps is an array with all the OR conditions organized by column for query. In the format we want, for example:
1_XXX 1_YYY
0_YYY 1_ZZZ
Note: For creating lkUps we use the pattern: DROP/REDUCE/HSTACK, for more information about it, check the answer to the question: how to transform a table in Excel from vertical to horizontal but with different length provided by #DavidLeal.
Now we have all the elements we need to build the recurrence. We use MAP to iterate over all Input column values. For each element (txt) we transform it to the format of our convenience via SPLIT user LAMBDA function and name it str.
We use REDUCE function inside MAP to iterate over all columns of lkUps to check against str. We use SEQUENCE(qryNum, 1) as input of REDUCE to be able to iterate over each lkUps column (qry).
Now we are going to use the above variables in XMATCH and name the variable matches as follows:
SUM(N(ISNUMBER(XMATCH(str, qry))))
If all values from qry were found in str then we have a match. If that is the case the item of the SUM will be 1, otherwise 0. Therefore the SUM for the match case should be of the same size as qry.
Because we include in the XMATCH both parts and operations (1,0), we ensure that not just the same parts are found, but also their corresponding operations are the same. The order of the parts is not relevant, XMATCH ensures it.
The REDUCE recurrence keeps the maximum value from the previous iteration (previous OR condition). We just need at least one match among all OR conditions. Therefore once we finish all the recurrence, if the result value of REDUCE is 1 at least one match was found. Finally, we transform the result into a TRUE/FALSE.
Note: For a large list of operations instead of using the above approach of two SUBSTITUTE calls. The SPLIT function can be defined as follow:
LAMBDA(txt,tks, LET(seq, SEQUENCE(COLUMNS(tks),1),
out, REDUCE("", seq, LAMBDA(acc,idx, LET(str, IF(acc="", txt, acc),
SUBSTITUTE(str, INDEX(tks,1,idx), INDEX(tks,2,idx))))),
TEXTSPLIT(out,,";",TRUE)))
and the input tks (tokens) can be defined as follow: {"&","-";"1_", "0_"}, i.e. in the first row old values and in the second row the new values.

SQL Server validating postcodes

I have a table containing postcodes but there is no validation built in to the entry form so there is no consistency in the way they are stored in the database, sample below:
ID Postcode
001742 B5
001745
001746
001748 DY3
001750
001751
001768 B276LL
001774 B339HY
001776 B339QY
001780 WR51DD
I want to use these postcode to map the distance from a central point but before I can do that I need to put them into a valid format and filter out any blanks or incomplete postcodes.
I had considered using
left(postcode,3) + ' ' + right(postcode,3)
To correct the formatting but this wouldn't work for postcodes like 'M6 8HD'
My aim is to get the list of postcodes in a valid format but I don't know how to account for different lengths of postcode. Is this there a way to do this in SQL Server?
As discussed in the comments, sometimes looking at a problem the other way around presents a far simpler solution.
You have a list of arbitrary input provided by users, which frequently doesn't contain the correct spacing. You also have a list of valid postcodes which are correctly spaced.
You're trying to solve the problem of finding the correct place to insert spaces into your arbitrary inputs to make them match the list of valid codes, and this is extremely difficult to do in practice.
However, performing the opposite task - removing the spaces from the valid postcodes - is remarkably easy to do. So that is what I'd suggest doing.
In our most recent round of data modelling, we have modelled addresses with two postcode columns - PostCode containing the postcode as provided from whatever sources, and PostCodeNoSpace, a computed column which strips whitespace characters from PostCode. We use the latter column for e.g. searches based on user input. You may want to do something similar with your list of Valid postcodes, if you're keeping it around permanently - so that you can perform easy matches/lookups and then translate those matches back into a version that has spaces - which is actually a solution to the original question posed!

SQL Server - Percent based Full Text Search

I want to conduct search on a particular column of a table in such a way that returning result set should satify following 2 conditions:
Returning result set should have records whose 90% of the characters matches with the given search text.
Returning result set should have records whose 70% of the consecutive characters matches with the given search text.
It implies that when 10 character word Sukhminder is searched, then:
it should return records like Sukhmindes, ukhminder, Sukhmindzr, because it fulfils both of the above mentioned conditions.
But it should not return records like Sukhmixder because it does not fulfil the second condition. Likewise, It should not return record Sukhminzzz because it does not fulfil the first condition.
I am trying to use Full Text Search feature of SQL Server. But, could not formulate the required query yet. Kindly reply ASAP.
You could try using a combination of the SOUNDEX command and DIFFERENCE command with full text searching.
Check out this Google book online which talks about it
Do you mean 70% of the original word? I think the only way you could do this exactly as stated would be to work out all possible string permutations that could match the 70% criteria and bring back records matching any of those
Col LIKE '%min%' AND (
Col LIKE '%Sukhmin%' OR Col LIKE '%ukhmind%'
OR Col LIKE '%khminde%' OR Col LIKE '%hminder%' )
then do further processing to see if the 90% criteria is met.
Edit: Actually you might find this link on Fuzzy Searching to be of interest http://anastasiosyal.com/archive/2009/01/11/18.aspx

How to force table select to go over blocks

How can I make Sybase's database engine return an unsorted list of records in non-numeric order?
~~~
I have an issue where I need to reproduce an error in the application where I select from a table where the ID is generated in sequence, but the ID is not the last one in the selection.
Let me explain.
ID STATUS
_____________
1234 C
1235 C
1236 O
Above is 3 IDs. I had code where these would be the results of a
select #p_id = ID from table where (conditions).
However, there wasn't a clause to check for status = 'O' (open). Remember Sybase saves the last returned record into a variable.
~~~~~
I'm being asked to give the testing team something that will make the results not work. If Sybase selects the above in an unordered list, it could appear in ascending order, or, if the database engine needs to change blocks of stored data or something technical magic stuff, the order could be messed up. The original error was when the procedure would return say 1234 instead of 1236.
Is there a way that I can have a 100% guarantee that Sybase will search over a block of data and have to double back, effectively 'breaking' the ascending search, and returning not the last record, but any other one? (all records except the maximum will end up erroring, because they are all 'Closed')
I want some sort of magical SQL code that will make sure things don't search the table in exactly numeric order. Ideally I'd like to not have to change the procedure, as the test team want to see the exact same procedure breaking (as easy as plonking a order by id desc would fudge the results).
If you don't specify an order, there is no way to guarantee the return order of the results. It will be however the index is built - and can depend on the order of insertion, the type of index, and the content of index keys.
It's generally a bad idea to do those sorts of singleton SELECTs. You should always specify a specific record with the WHERE clause, or use a cursor, or TOPn or similar. The problem comes when someone tries to understand your code, because some databases when they see multiple hits take the first value, some take the last value, some take a random value (they call that "implementation-defined"), and some throw an error.
Is this by any chance related to 1156837? :)

Finding alphabetical position in a large list

I have an as400 table containing roughly 1 million rows of full names / company names which I would like to convert to use another datastore while still matching the speed of the original.
Currently, a user enters the search and almost instantaneously gets the alphabetical position of the search term in the table and and a page of matches. The user can then paginate either up or down through the records very quickly.
There is almost no updating of the data and approximately 50 inserts per week. I'm thinking that any database can maintain an alphabetical index of the names, but I'm unsure of how to quickly find the position of the search within the dataset. Any suggestions are greatly appreciated.
This sounds just like a regular pagination of results, except that instead of going to a specific page based on a page number or offset being requested, it goes to a specific page based on where the user's search fits in the results alphabetically.
Let's say you want to fetch 10 rows after this position, and 10 rows before.
If the user searches for 'Smith', you could do two selects such that:
SELECT
name
FROM
companies
WHERE
name < 'Smith'
ORDER BY
name DESC
LIMIT 10
and then
SELECT
name
FROM
companies
WHERE
name >= 'Smith'
ORDER BY
name
LIMIT 10
You could do a UNION to fetch that in one query, the above is just simplified.
The term the user searched for would fit half way through these results. If there are any exact matches, then the first exact match will be positioned such that it is eleventh.
Note that if the user searches for 'aaaaaaaa' then they'll probably just get the 10 first results with nothing before it, and for 'zzzzzzzz' they may get just the 10 last results.
I'm assuming that the SQL engine in question allows >= and < comparisons between strings (and can optimise that in indexes), but I haven't tested this, maybe you can't do this. If, like MySQL, it supports internationalized collations then you could even have the ordering done correctly for non-ascii characters.
If by "the position of the search" you mean the number of the record if they were enumerated alphabetically, you may want to try something like:
select count(*) from companies where name < 'Smith'
Most databases ought to optimize that reasonably well (but try it--theories you read on the web don't trump empirical data).
Just to add to the ordering suggestions:
Add an index to the name if this is your standard means of data retrieval.
You can paginate efficiently by combining LIMIT and OFFSET.

Resources