Grouping by the shortest common suffix in data - sql-server

I have a table with a list of FQDN's, for example:
www.bbc.co.uk
bbc.co.uk
bbc.com
www.bbc.com
www.live.bbc.co.uk
www.live.bbc.com
I'd like to group these by the domain name; not the exact full domain name, but the shortest matching domain name that exists in the data. For instance, in the example above, I'd like to group
www.bbc.co.uk
bbc.co.uk
www.live.bbc.co.uk
together, as they have the common "suffix" of bbc.co.uk.
The fact that these are domain names is probably irrelevant, but might also play a part in the solution - can anyone suggest a way of GROUPing data together by the shortest common suffix?
EDIT: as requested, as an output I'd ideally like something like:
Domain Count
bbc.co.uk 3
bbc.com 3

If you do not know how many suffix to add in your grouping, it will be hard.
Maybe you can try to group by the last suffix (after the last dot).
Then if you got result, add the next suffix and group.
Then if you got result, add another one...

You can get the same amount of dots if you first convert the domain type to an IP address using nslookup. Link
Alternatively, there exists entire databases with list of known domain names. Link2

I've managed to bodge my way around the problem: I've introduced a temporary "MasterDomainName" field to the database, and I've updated it with:
UPDATE r1
SET r1.MasterDomainName= r2.domainname
FROM #results r1
LEFT JOIN #results r2
ON r2.domainname = right(r1.domainname,len(r2.domainname))
It's not perfect, but it gets me closed to where I need to be. Thanks for everyone's input.

Related

How to take apart information between hyphens in SQL Server

How would I take apart a column that contains string:
92873-987dsfkj80-2002-04-11
20392-208kj48384-2008-01-04
Data would look like this:
Filename Yes/No Key
Abidabo Yes 92873-987dsfkj80-2002-04-11
Bibiboo No 20392-208kj48384-2008-01-04
Want it to look like this:
Filename Yes/No Key
Abidabo Yes 92873-987dsfkj80-20020411
Bibiboo No 20392-208kj48384-20080104
whereby I would like to concat the dates in the end as 20020411 and 20080104. From the right side, the information is the same always. From the left it is not, otherwise I could have concatenated it. It is not an import issue.
As mentioned in the comments already, storing data like this is a bad idea. However, you can obtain the dates from those strings by using a RIGHT function like so:
SELECT RIGHT('20392-208kj48384-2008-01-04', 10)
Output:
2008-01-04
Depending on the SQLSERVER version you are using, you can use STRING_SPLIT which requieres COMPATIBILITY_LEVEL 130. You can also build your own User Defined Function to split the contents of a field and manipulate it as you need, you can find some useful examples of SPLIT functions in this thread:
Split function equivalent in T-SQL?
Assuming I'm correct and the date part is always on the right side of the string, you can simply use RIGHT and CAST to get the date (assuming, again, that the date is represented as yyyy-mm-dd):
SELECT CAST(RIGHT(YourColumn, 10) As Date)
FROM YourTable
However, Panagiotis is correct in his comment - You shouldn't store data like that. Each column in the database should hold only a single point of data, be it string, number or date.
Update following your comment and the updated question:
SELECT LEFT(YourColumn, LEN(YourColumn) - 10) + REPLACE(RIGHT(YourColumn, 10), '-', '')
FROM YourTable
will return the desired results.

SQL Server validating postcodes

I have a table containing postcodes but there is no validation built in to the entry form so there is no consistency in the way they are stored in the database, sample below:
ID Postcode
001742 B5
001745
001746
001748 DY3
001750
001751
001768 B276LL
001774 B339HY
001776 B339QY
001780 WR51DD
I want to use these postcode to map the distance from a central point but before I can do that I need to put them into a valid format and filter out any blanks or incomplete postcodes.
I had considered using
left(postcode,3) + ' ' + right(postcode,3)
To correct the formatting but this wouldn't work for postcodes like 'M6 8HD'
My aim is to get the list of postcodes in a valid format but I don't know how to account for different lengths of postcode. Is this there a way to do this in SQL Server?
As discussed in the comments, sometimes looking at a problem the other way around presents a far simpler solution.
You have a list of arbitrary input provided by users, which frequently doesn't contain the correct spacing. You also have a list of valid postcodes which are correctly spaced.
You're trying to solve the problem of finding the correct place to insert spaces into your arbitrary inputs to make them match the list of valid codes, and this is extremely difficult to do in practice.
However, performing the opposite task - removing the spaces from the valid postcodes - is remarkably easy to do. So that is what I'd suggest doing.
In our most recent round of data modelling, we have modelled addresses with two postcode columns - PostCode containing the postcode as provided from whatever sources, and PostCodeNoSpace, a computed column which strips whitespace characters from PostCode. We use the latter column for e.g. searches based on user input. You may want to do something similar with your list of Valid postcodes, if you're keeping it around permanently - so that you can perform easy matches/lookups and then translate those matches back into a version that has spaces - which is actually a solution to the original question posed!

Find columns that match in two tables

I need to query two tables of companies in the first table are the full names of companies, and the second table are also the names but are incomplete. The idea is to find the fields that are similar. I put pictures of the reference and SQL code I'm using.
The result I want is like this
The closest way I found to do so:
SELECT DISTINCT
RTRIM(a.NombreEmpresaBD_A) as NombreReal,
b.EmpresaDB_B as NombreIncompleto
FROM EmpresaDB_A a, EmpresaDB_B b
WHERE a.NombreEmpresaBD_A LIKE 'VoIP%' AND b.EmpresaDB_B LIKE 'VoIP%'
The problem with the above code is that it only returns the record specified in the WHERE and if I put this LIKE '%' it returns the Cartesian product of two tables. The RDBMS is Microsoft SQL Server. I would greatly appreciate if you help me with any proposed solution.
Use the short name plus appended '%' as argument in the LIKE expression:
Edit with info that we deal with SQL Server:
SELECT a.NombreEmpresaBD_A as NombreReal
,b.NombreEmpresaBD_B as NombreIncompleto
FROM EmpresaDB_A a, EmpresaDB_B b
WHERE a.NombreEmpresaBD_A LIKE (b.NombreEmpresaBD_B + '%');
According to your screenshot you had the column name wrong!
String concatenation in T-SQL with + operator.
Above query finds a case where
'Computex S.A' LIKE 'Computex%'
but not:
'Voip Service Mexico' LIKE 'VoipService%'
For that you would have to strip blanks first or use more powerful pattern matching functions.
I have created a demo for you on data.SE.
Look up pattern matching or the LIKE operator in the manual.
I would suggest adding a foreign key between the tables linking the data. Then you can just search for the one table and join the second to get the other results.

MySQL Query Nightmare with RETs data

For those of you who have actually delt with RETS may be able to give me a hand here. The problem occurs when multiple properties are tied into the RETS data even though the property is sold. Basically what I need is to be able to check the database with the SELECT statement against three fields. The fields in question would be C_StreetName, C_StreetNumber, and C_PostalCode.
To make this clear what I want is some type of way to check for duplicates while gathering the dataset, this can't be done in php because of how the data is returned through the application. So if it finds another record with the same C_StreetName, C_StreetNumber, and C_PostalCode it will remove them from the dataset. Ideally it would be nice if it could also check the Status of the two to find out if one is Expired or Sold before removing them from the data.
I'm not familiar with complex SQL functions, I was looking at the IF statement until I found that can only be used while storing data not the other way around. And the CASE statement but it just doesn't seem like that would work.
If you guys have any suggestions on what I should use I'd appreciate it. Hopefully there is a way to do this and keep in mind this is only one table I am accessing I don't have any Joins.
Thanks in advance.
Here's something to get you going in the right direction. I haven't tested this, and am not sure you can nest a case expression inside max() in mysql.
What this accomplishes is to output one row per unique combination of street name, number and postcode, with a status of 'Expired' or 'Sold' taking precedence over other values. That is, if there's a row with 'Expired' it will be output in preference to non-expired and non-sold, and a row with 'Sold' will be output if it exists, regardless of what other rows exist for that property. The case statement just converts the status codes into something orderable.
select
C_StreetName,
C_StreetNumber,
C_PostalCode,
max(
case status
when 'Expired' then 1
when 'Sold' then 2
else 0
end) as status
group by
C_StreetName,
C_StreetNumber,
C_PostalCode;

How to force table select to go over blocks

How can I make Sybase's database engine return an unsorted list of records in non-numeric order?
~~~
I have an issue where I need to reproduce an error in the application where I select from a table where the ID is generated in sequence, but the ID is not the last one in the selection.
Let me explain.
ID STATUS
_____________
1234 C
1235 C
1236 O
Above is 3 IDs. I had code where these would be the results of a
select #p_id = ID from table where (conditions).
However, there wasn't a clause to check for status = 'O' (open). Remember Sybase saves the last returned record into a variable.
~~~~~
I'm being asked to give the testing team something that will make the results not work. If Sybase selects the above in an unordered list, it could appear in ascending order, or, if the database engine needs to change blocks of stored data or something technical magic stuff, the order could be messed up. The original error was when the procedure would return say 1234 instead of 1236.
Is there a way that I can have a 100% guarantee that Sybase will search over a block of data and have to double back, effectively 'breaking' the ascending search, and returning not the last record, but any other one? (all records except the maximum will end up erroring, because they are all 'Closed')
I want some sort of magical SQL code that will make sure things don't search the table in exactly numeric order. Ideally I'd like to not have to change the procedure, as the test team want to see the exact same procedure breaking (as easy as plonking a order by id desc would fudge the results).
If you don't specify an order, there is no way to guarantee the return order of the results. It will be however the index is built - and can depend on the order of insertion, the type of index, and the content of index keys.
It's generally a bad idea to do those sorts of singleton SELECTs. You should always specify a specific record with the WHERE clause, or use a cursor, or TOPn or similar. The problem comes when someone tries to understand your code, because some databases when they see multiple hits take the first value, some take the last value, some take a random value (they call that "implementation-defined"), and some throw an error.
Is this by any chance related to 1156837? :)

Resources