SQL - Inclusive Matches in column or across multiple rows

SQL - Inclusive Matches in column or across multiple rows - sql-server

I'm sure this has been asked before, but I don't know how to search for it.
First off, I do not want to implement full-text searching. The database contains multiple languages including Chinese and Japanese which pose a huge problem for Full-text indexes.
I have a table like the following:
Table Comment:
UserID int
CommentText nvarchar(400)
I want to do a search against this table and find anything matching multiple words. Normally I would just do something like
select *
from Comment
where CommentText like '%potato%' and CommentText like '%badger%'
But if the two words are in different rows I need to do something like
select
UserID, count(UserID )
from
Comment
where
CommentText like '%potato%' or CommentText like '%badger%'
group by
UserID
having
count(UserID ) > 1
But then if the words are sometimes in the same row and sometimes spread across multiple rows, how do I determine if both words matched?
Cases:
Both words are in a single row.
One word is in row 1 and the other word is in row 2 for the same UserID
One word is in multiple rows for the same UserID (so it returns multiple matches even if it's the same word several times)
My question is: for multiple words, how do I conduct a wildcard search and make sure all the words match at least once for a given UserID?
Thanks in advance
I'm thinking of a CTE to grab all rows that contain matches and concat them for a given userid, but I don't know if I can find something more efficient.

A simple way to approach this would use conditional aggregation:
SELECT UserID
FROM Comment
GROUP BY UserID
HAVING
SUM(CASE WHEN CommentText LIKE '%java%' THEN 1 ELSE 0 END) > 0 AND
SUM(CASE WHEN CommentText LIKE '%python%' THEN 1 ELSE 0 END) > 0;
Each of the sums in the HAVING clause keep track of each individual word you want to match. Only when at least one record has a positive match, for both words, would a user turn up in the result set.
Note that if you plan to continue down this road, you should look into SQL Server's full text capabilities.
https://learn.microsoft.com/en-us/sql/relational-databases/search/full-text-search

Just posting the question caused me to think of the answer.
select distinct UserID from (
select UserID FROM Comment where CommentText like '%java%'
UNION
select UserID FROM Comment where CommentText like '%python%'
) as a

Related

how to select first rows distinct by a column name in a sub-query in sql-server?

Actually I am building a Skype like tool wherein I have to show last 10 distinct users who have logged in my web application.
I have maintained a table in sql-server where there is one field called last_active_time. So, my requirement is to sort the table by last_active_time and show all the columns of last 10 distinct users.
There is another field called WWID which uniquely identifies a user.
I am able to find the distinct WWID but not able to select the all the columns of those rows.
I am using below query for finding the distinct wwid :
select distinct(wwid) from(select top 100 * from dbo.rvpvisitors where last_active_time!='' order by last_active_time DESC) as newView;
But how do I find those distinct rows. I want to show how much time they are away fromm web apps using the diff between curr time and last active time.
I am new to sql, may be the question is naive, but struggling to get it right.

If you are using proper data types for your columns you won't need a subquery to get that result, the following query should do the trick
SELECT TOP 10
[wwid]
,MAX([last_active_time]) AS [last_active_time]
FROM [dbo].[rvpvisitors]
WHERE
[last_active_time] != ''
GROUP BY
[wwid]
ORDER BY
[last_active_time] DESC
If the column [last_active_time] is of type varchar/nvarchar (which probably is the case since you check for empty strings in the WHERE statement) you might need to use CAST or CONVERT to treat it as an actual date, and be able to use function like MIN/MAX on it.
In general I would suggest you to use proper data types for your column, if you have dates or timestamps data use the "date" or "datetime2" data types
Edit:
The query aggregates the data based on the column [wwid], and for each returns the maximum [last_active_time].
The result is then sorted and filtered.
In order to add more columns "as-is" (without aggregating them) just add them in the SELECT and GROUP BY sections.
If you need more aggregated columns add them in the SELECT with the appropriate aggregation function (MIN/MAX/SUM/etc)
I suggest you have a look at GROUP BY on W3
To know more about the "execution order" of the instruction you can have a look here

You can solve problem like this by rank ordering the results by a key and finding the last x of those items, this removes duplicates while preserving the key order.
;
WITH RankOrdered AS
(
SELECT
*,
wwidRank = ROW_NUMBER() OVER (PARTITION BY wwid ORDER BY last_active_time DESC )
FROM
dbo.rvpvisitors
where
last_active_time!=''
)
SELECT TOP(10) * FROM RankOrdered WHERE wwidRank = 1

If my understanding is right, below query will give the desired output.
You can have conditions according to your need.
select top 10 distinct wwid from dbo.rvpvisitors order by last_active_time desc

Highlight Duplicate Values in a NetSuite Saved Search

I am looking for a way to highlight duplicates in a NetSuite saved search. The duplicates are in a column called "ACCOUNT" populated with text values.
NetSuite permits adding fields (columns) to the search using a stripped down version of SQL Server. It also permits conditional highlighting of entire rows using the same code. However I don't see an obvious way to compare values between rows of data.
Although duplicates can be grouped together in a summary report and identified by a count of 2 or more, I want to show duplicate lines separately and highlight each.
The closest thing I found was a clever formula that calculates a running total here:
sum/* comment */({amount})
OVER(PARTITION BY {name}
ORDER BY {internalid}
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
I wonder if it's possible to sort results by the field being checked for duplicates and adapt this code to identify changes in the "ACCOUNT" field between a row and the previous row.
Any ideas? Thanks!

This post has been edited. I have left the progression as a learning experience about NetSuite.
Original - plain SQL way - not suitable for NetSuite
Does something like this meet your needs? The test data assumes looking for duplicates on id1 and id2. Note: This does not work in NetSuite as it supports limited SQL functions. See comments for links.
declare #table table (id1 int, id2 int, value int);
insert #table values
(1,1,11),
(1,2,12),
(1,3,13),
(2,1,21),
(2,2,22),
(2,3,23),
(1,3,1313);
--select * from #table order by id1, id2;
select t.*,
case when dups.id1 is not null then 1 else 0 end is_dup --identify dups when there is a matching dup record
from #table t
left join ( --subquery to find duplicates
select id1, id2
from #table
group by id1, id2
having count(1) > 1
) dups
on dups.id1 = t.id1
and dups.id2 = t.id2
order by t.id1, t.id2;
First Edit - NetSuite target but in SQL.
This was a SQL test based on the example available syntax provided in the question since I do not have NetSuite to test against. This will give you a value greater than 1 on each duplicate row using a similar syntax. Note: This will give the appropriate answer but not in NetSuite.
select t.*,
sum(1) over (partition by id1, id2)
from #table t
order by t.id1, t.id2;
Second Edit - Working NetSuite version
After some back and forth here is a version that works in NetSuite:
sum/* comment */(1) OVER(PARTITION BY {name})
This will also give a value greater than 1 on any row that is a duplicate.
Explanation
This works by summing the value 1 on each row included in the partition. The partition column(s) should be what you consider a duplicate. If only one column makes a duplicate (e.g. user ID) then use as above. If multiple columns make a duplicate (e.g. first name, last name, city) then use a comma-separated list in the partition. SQL will basically group the rows by the partition and add up the 1s in the sum/* comment */(1). The example provided in the question sums an actual column. By summing 1 instead we will get the value 1 when there is only 1 ID in the partition. Anything higher is a duplicate. I guess you could call this field duplicate count.

SQL Search for multple words and sort results by relevance

Hope this will be simple for some of database gurus.
I’ll try to be as short and concise as possible.
I’m not too good with SQL I need a query (using SQL Server 2008) that will do the following:
Let’s say there is a Products table with columns like this (simplified):
ID, Title, Description, BrandID, TypeID, Price
Tasks:
Return all rows that matching BrandID
Return all rows that matching TypeID
Return all rows where Price is over or equal some value
Need to implement SQL paging with e.g. BETWEEN where will send parameters for: start row, end row and page size.
Results will be sometimes sorted by Price (ASC or DESC) and sometimes by Title (also ASC or DESC)
So far nothing to much complicated and I already have an solution.
For example:
SELECT T.*
FROM (
SELECT COUNT(1) OVER() AS TotalRecords,
ROW_NUMBER() OVER(ORDER BY Product.Price ASC) AS RowNumber,
Product.*
FROM Product
WHERE Products.BrandID = #BrandID
AND Products.TypeID = #TypeID
AND Products.Price >= #Price
) AS T
WHERE T.RowNumber BETWEEN #StartRowNumber AND #EndRowNumber
Now the fun starts: I need to add one more criteria to search and its by keywords that will be multiple words (separated by space) and should match them against Title and Description columns and in this case Order By will be by relevance where relevance is number of matched words within Title column so the results with where title match most of words from keywords will be returned first. Description column is ignored when counting relevance, it just needs to match at least one word from keywords LIKE ‘%’ + #Word ‘%’
Of course, keywords will be split into #Words but that is not a problem.
One important note: using full text search will be most appropriate solution but in this case I can’t use it!
Any help is appreciated.
Thanks in advance!

Count (column 3) and sum(values of column 4) based on a unique combination of column 1 and column 2;

Aim:
Count (column 3) and sum(values of column 4) based on a unique combination of column 1 and column 2;
Is this possible in one SQL statement? If yes, what is the precise syntax?
Explanation: Consider a table with 4 columns like
The task is to count all Id_Indiv and to sum up the weight values depending on the 4 existing unique combination of genus & species.
The desired output is
If possible to create the desired output in one sql statement, it could read somewhat like:
SELECT genus, species, count(ID_Indiv) as NoOfIndiv, sum(weight)
FROM (
SELECT DISTINCT
genus,
species
FROM TableName
)W
group by genus, species;
Note: there might be more than 50 possible Genus*species combinations, and we do not know if/when new combinations will come up. Thus I cannot predefine the combination, but the statement has to identfy them.
Please, could somebody help me with the exact Sql statement?
Thanks a lot in advance!

What do you use nested SELECT statement? You don't have to do that!
SELECT genus, species, count(ID_Indiv) as NoOfIndiv, sum(weight)
FROM TableName
GROUP BY genus, species;
GROUP BY will take care of making {genus, species} pair unique within each group.

It seems like your attempt was very close... so I'm not sure if I'm oversimplifying the issue... but shouldn't just a simple GROUP BY do what you are looking for?
SELECT
genus,
species,
COUNT(*) AS NoOfIndiv,
SUM(weight) as TotalWeight
FROM YourTable
GROUP BY
genus,
species

Efficient checking of possible duplicate entities

I have a requirement to produce a list of possible duplicates before a user saves an entity to the database and warn them of the possible duplicates.
There are 7 criteria on which we should check the for duplicates and if at least 3 match we should flag this up to the user.
The criteria will all match on ID, so there is no fuzzy string matching needed but my problem comes from the fact that there are many possible ways (99 ways if I've done my sums corerctly) for at least 3 items to match from the list of 7 possibles.
I don't want to have to do 99 separate db queries to find my search results and nor do I want to bring the whole lot back from the db and filter on the client side. We're probably only talking of a few tens of thousands of records at present, but this will grow into the millions as the system matures.
Anyone got any thoughs of a nice efficient way to do this?
I was considering a simple OR query to get the records where at least one field matches from the db and then doing some processing on the client to filter it some more, but a few of the fields have very low cardinality and won't actually reduce the numbers by a huge amount.
Thanks
Jon

OR and CASE summing will work but are quite inefficient, since they don't use indexes.
You need to make UNION for indexes to be usable.
If a user enters name, phone, email and address into the database, and you want to check all records that match at least 3 of these fields, you issue:
SELECT i.*
FROM (
SELECT id, COUNT(*)
FROM (
SELECT id
FROM t_info t
WHERE name = 'Eve Chianese'
UNION ALL
SELECT id
FROM t_info t
WHERE phone = '+15558000042'
UNION ALL
SELECT id
FROM t_info t
WHERE email = '42#example.com'
UNION ALL
SELECT id
FROM t_info t
WHERE address = '42 North Lane'
) q
GROUP BY
id
HAVING COUNT(*) >= 3
) dq
JOIN t_info i
ON i.id = dq.id
This will use indexes on these fields and the query will be fast.
See this article in my blog for details:
Matching 3 of 4: how to match a record which matches at least 3 of 4 possible conditions
Also see this question the article is based upon.
If you want to have a list of DISTINCT values in the existing data, you just wrap this query into a subquery:
SELECT i.*
FROM t_info i1
WHERE EXISTS
(
SELECT 1
FROM (
SELECT id
FROM t_info t
WHERE name = i1.name
UNION ALL
SELECT id
FROM t_info t
WHERE phone = i1.phone
UNION ALL
SELECT id
FROM t_info t
WHERE email = i1.email
UNION ALL
SELECT id
FROM t_info t
WHERE address = i1.address
) q
GROUP BY
id
HAVING COUNT(*) >= 3
)
Note that this DISTINCT is not transitive: if A matches B and B matches C, this does not mean that A matches C.

You might want something like the following:
SELECT id
FROM
(select id, CASE fld1 WHEN input1 THEN 1 ELSE 0 "rule1",
CASE fld2 when input2 THEN 1 ELSE 0 "rule2",
...,
CASE fld7 when input7 THEN 1 ELSE 0 "rule2",
FROM table)
WHERE rule1+rule2+rule3+...+rule4 >= 3
This isn't tested, but it shows a way to tackle this.

What DBS are you using? Some support using such constraints by using server side code.

Have you considered using a stored procedure with a cursor? You could then do your OR query and then step through the records one-by-one looking for matches. Using a stored procedure would allow you to do all the checking on the server.
However, I think a table scan with millions of records is always going to be slow. I think you should work out which of the 7 fields are most likely to match are make sure these are indexed.

I'm assuming your system is trying to match tag ids of a certain post, or something similar. This is a multi-to-multi relationship and you should have three tables to handle it. One for the post, one for tags and one for post and tags relationship.
If my assumptions are correct then the best way to handle this is:
SELECT postid, count(tagid) as common_tag_count
FROM posts_to_tags
WHERE tagid IN (tag1, tag2, tag3, ...)
GROUP BY postid
HAVING count(tagid) > 3;

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

SQL - Inclusive Matches in column or across multiple rows - sql-server

Just posting the question caused me to think of the answer. select distinct UserID from ( select UserID FROM Comment where CommentText like '%java%' UNION select UserID FROM Comment where CommentText like '%python%' ) as a

Related

how to select first rows distinct by a column name in a sub-query in sql-server?

Highlight Duplicate Values in a NetSuite Saved Search

SQL Search for multple words and sort results by relevance

Count (column 3) and sum(values of column 4) based on a unique combination of column 1 and column 2;

Efficient checking of possible duplicate entities

Categories

Resources