Check if String starts with column value with best performance

Check if String starts with column value with best performance - sql-server

I currently have this table (myTable) in my database:
user start
----------------------------
Adam 12345
Alex 123
Benny 2345
In my program, I accept a string from user, eg: 12345678
My objective is to select out the row where user input starts with myTable.Start
-- For example, it would be great to have something like:
select * from myTable where "12345678".startsWith(start)
-- and returns me Adam, 12345 & Alex, 123
As of now I'm using
select user, start
from myTable where charindex(start, "12345678") = 1
order by start desc
which does the job, but in absolute terrible performance, myTable row count is about near a million, I'm not sure if indexing start would help as I'm not doing a direct compare in this case.
Does anyone know a better way to accomplish this?

Try this,
select user, start
from myTable
where '12345678' like start+'%'
order by start desc
Also add an index to the column start.

You should use the pattern matching operator, LIKE operator AND add an index to the column, eg:
select user, start
from myTable where start LIKE "12345678%"
order by start desc
Without the index performance will be as bad as before - the server would have to check all rows for matches.
With an index, the operation becomes a range search: rows where start is greater or equal to 12345678 but less than the next higher string lexically (12345679). Range searches can use indexes and load process only the matching lines
EDIT
Oops, missed that the query tries to do the opposite - find rows that can act as prefixes to a string. This can't be accelerated with indexes because it's equivalent to
"someconstantvalue" LIKE start +'%'
This has to generate a new value for each row before matching, so can't use any indexes

Related

How to get rid of duplicates with T-SQL

Hi I have a login table that has some duplicated username.
Yes I know I should have put a constraint on it, but it's a bit too late for that now!
So essentially what I want to do is to first identify the duplicates. I can't just delete them since I can't be too sure which account is the correct one. The accounts have the same username and both of them have roughly the same information with a few small variances.
Is there any way to efficiently script it so that I can add "_duplicate" to only one of the accounts per duplicate?

You can use ROW_NUMBER with a PARTITION BY in the OVER() clause to find the duplicates and an updateable CTE to change the values accordingly:
DECLARE #dummyTable TABLE(ID INT IDENTITY, UserName VARCHAR(100));
INSERT INTO #dummyTable VALUES('Peter'),('Tom'),('Jane'),('Victoria')
,('Peter') ,('Jane')
,('Peter');
WITH UpdateableCTE AS
(
SELECT t.UserName AS OldValue
,t.UserName + CASE WHEN ROW_NUMBER() OVER(PARTITION BY UserName ORDER BY ID)=1 THEN '' ELSE '_duplicate' END AS NewValue
FROM #dummyTable AS t
)
UPDATE UpdateableCTE SET OldValue = NewValue;
SELECT * FROM #dummyTable;
The result
ID UserName
1 Peter
2 Tom
3 Jane
4 Victoria
5 Peter_duplicate
6 Jane_duplicate
7 Peter_duplicate
You might include ROW_NUMBER() as another column to find the duplicates ordinal. If you've got a sort clause to get the earliest (or must current) numbered with 1 it should be easy to find and correct the duplicates.
Once you've cleaned this mess, you should ensure not to get new dups. But you know this already :-D

There is no easy way to get rid of this nightmare. Some manual actions required.
First identify duplicates.
select * from dbo.users
where userId in
(select userId from dbo.users
group by username
having count(userId) > 1)
Next identify "useless" users (for example those who registered but never place any order).
Rerun the query above. Out of this list find duplicates which are the same (by email for example) and combine them in a single record. If they did something useful previously (for example placed orders) then first assign these orders to a user which survive. Remove others.
Continue with other criteria until you you get rid of duplicates.
Then set unique constrain on username field. Also it is good idea to set unique constraint on email field.
Again, it is not easy and not automatic.

In this case where you duplicates and the original names have some variance it is highly impossible to select non duplicate rows since you are not aware which is real and which is duplicate.
I think the best thing to is to correct you data and then fix from where you are getting this slight variant duplicates.

selecting random rows based on weight on another row

I need to select random rows from a table based on weight in another row. Like if the user enters random value 50 I need to select 50 random rows from the table being that the rows with higher weight gets returned more number of times. I saw using NEWID() to select n number of random rows and this link
Random Weighted Choice in T-SQL
where we can select one row based on the weight from another row but I need to select several rows based on user random input number ,so will the best way be using the suggested answer in the above link and looping over it n number of times(but I think it would return the same row) is there any other easy solution.
MY table is like this
ID Name Freq
1 aaa 50
2 bbb 30
3 ccc 10
so when the user enters 50 I need to return 50 random names so it should be like more aaa ,bbb than ccc.Might be like 25 aaa 15 bbb and 10 ccc. Anything close to this will work to.I saw this answer but when I execute against my DB it seems to be running for 5mins and no results yet.
SQL : select one row randomly, but taking into account a weight

I think the difficult part here is getting any individual row to potentially appear more than once. I'd look into doing something like the following:
1) Build a temp table, duplicating records according to their frequency (I'm sure there's a better way of doing this, but the first answer that came to my mind was a simple while loop... This particular one really only works if the frequency values are integers)
create table #dup
(
id int,
nm varchar(10)
)
declare #curr int, #maxFreq int
select #curr=0, #maxFreq=max(freq)
from tbl
while #curr < #maxFreq
begin
insert into #dup
select id, nm
from tbl
where freq > #curr
set #curr = #curr+1
end
2) Select your top records, ordered by a random value
select top 10 *
from #dup
order by newID()
3) Cleanup
drop table #dup

Maybe could you try something like the following:
ORDER BY Freq * rand()
in your sql? So columns with a higher Freq value should in theory get returned more often than those with a lower Freq value. It seems a bit hackish but it might work!

SQL Server 2005 SELECT TOP 1 from VIEW returns LAST row

I have a view that may contain more than one row, looking like this:
[rate] | [vendorID]
8374 1234
6523 4321
5234 9374
In a SPROC, I need to set a param equal to the value of the first column from the first row of the view. something like this:
DECLARE #rate int;
SET #rate = (select top 1 rate from vendor_view where vendorID = 123)
SELECT #rate
But this ALWAYS returns the LAST row of the view.
In fact, if I simply run the subselect by itself, I only get the last row.
With 3 rows in the view, TOP 2 returns the FIRST and THIRD rows in order. With 4 rows, it's returning the top 3 in order. Yet still top 1 is returning the last.
DERP?!?
This works..
DECLARE #rate int;
CREATE TABLE #temp (vRate int)
INSERT INTO #temp (vRate) (select rate from vendor_view where vendorID = 123)
SET #rate = (select top 1 vRate from #temp)
SELECT #rate
DROP TABLE #temp
.. but can someone tell me why the first behaves so fudgely and how to do what I want? As explained in the comments, there is no meaningful column by which I can do an order by. Can I force the order in which rows are inserted to be the order in which they are returned?
[EDIT] I've also noticed that: select top 1 rate from ([view definition select]) also returns the correct values time and again.[/EDIT]

That is by design.
If you don't specify how the query should be sorted, the database is free to return the records in any order that is convenient. There is no natural order for a table that is used as default sort order.
What the order will actually be depends on how the query is planned, so you can't even rely on the same query giving a consistent result over time, as the database will gather statistics about the data and may change how the query is planned based on that.
To get the record that you expect, you simply have to specify how you want them sorted, for example:
select top 1 rate
from vendor_view
where vendorID = 123
order by rate

I ran into this problem on a query that had worked for years. We upgraded SQL Server and all of a sudden, an unordered select top 1 was not returning the final record in a table. We simply added an order by to the select.
My understanding is that SQL Server normally will generally provide you the results based on the clustered index if no order by is provided OR off of whatever index is picked by the engine. But, this is not a guarantee of a certain order.
If you don't have something to order off of, you need to add it. Either add a date inserted column and default it to GETDATE() or add an identity column. It won't help you historically, but it addresses the issue going forward.

While it doesn't necessarily make sense that the results of the query should be consistent, in this particular instance they are so we decided to leave it 'as is'. Ultimately it would be best to add a column, but this was not an option. The application this belongs to is slated to be discontinued sometime soon and the database server will not be upgraded from SQL 2005. I don't necessarily like this outcome, but it is what it is: until it breaks it shall not be fixed. :-x

How would I determine if a varchar field in SQL contains any numeric characters?

I'm working on a project where we have to figure out if a given field is potentially a company name versus an address.
In taking a very broad swipe at it, we are going under the assumption that if this field contains no numbers, odds are it is a name vs. a street address (we're aiming for the 80% case, knowing some will have to be done manually).
So now to the question at hand. Given a table with, for the sake of simplicity, a single varchar(100) column, how could I find those records who have no numeric characters at any position within the field?
For example:
"Main Street, Suite 10A" --Do not return this.
"A++ Billing" --Should be returned
"XYZ Corporation" --Should be returned
"100 First Ave, Apt 20" --Should not be returned
Thanks in advance!

Sql Server allows for a regex-like syntax for range [0-9] or Set [0123456789] to be specified in a LIKE operator, which can be used with the any string wildcard (%). For example:
select * from Address where StreetAddress not like '%[0-9]%';
The wildcard % at the start of the like will obviously hurt performance (Scans are likely), but in your case this seems inevitable.
Another MSDN Reference.

select * from table where column not like '%[0-9]%'
This query returns you all rows from table where column does not contain any of the digits from 0 to 9.

I like the simple regex approach, but for the sake of discussion will mention this alternative which uses PATINDEX.
SELECT InvoiceNumber from Invoices WHERE PATINDEX('%[0-9]%', InvoiceNumber) = 0

This worked for me .
select total_employee_count from company_table where total_employee_count like '%[^0-9]%'
This returned all rows that contains non numeric values including 2-3 ..

This Query to list out Tables created with numeric Characters
select * from SYSOBJECTS where xtype='u' and name like '%[0-9]%'

MS Access row number, specify an index

Is there a way in MS access to return a dataset between a specific index?
So lets say my dataset is:
rank | first_name | age
1 Max 23
2 Bob 40
3 Sid 25
4 Billy 18
5 Sally 19
But I only want to return those records between 'rank' 2 and 4, so my results set is Bob, Sid and Billy? However, Rank is not part of the table, and this should be generated when the query is run. Why don't I use an autogenerated number, because if a record is deleted, this will be inconsistent, and what if I wanted the results in reverse!
This obviously very simple, and the reason I ask is because I am working on a product catalogue and I am looking for a more efficient way of paging through the returned dataset, so if I only return 1 page worth of data from the database this is obviously going to be quicker then return a complete set of 3000 records and then having to subselect from that set!
Thanks R.

Original suggestion:
SELECT * from table where rank BETWEEN 2 and 4;
Modified after comment, that rank is not existing in structure:
Select top 100 * from table;
And if you want to choose subsequent results, you can choose the ID of the last record from the first query, say it was ID 101, and use a WHERE clause to get the next 100;
Select top 100 * from table where ID > 100;
But these won't give you what you're looking for either, I bet.

How are you calculating rank? I assume you are basing it on some data in another dataset somewhere. If so, create a function, do a table join, or do something that can calculate rank based on values in other table(s), then you can do queries based on the rank() function.
For example:
select *
from table
where rank() between 2 and 4
If you are not calculating rank based on some data somewhere, there really isn't a way to write this query, and you might as well be returning three random rows from the table.

I think you need to use a correlated subquery to calculate the rank on the fly e.g. I'm guessing the rank is based on name:
SELECT T1.first_name, T1.age,
(
SELECT COUNT(*) + 1
FROM MyTable AS T2
WHERE T1.first_name > T2.first_name
) AS rank
FROM MyTable AS T1;
The bad news is the Access data engine is poorly optimized for this kind of query; in my experience, performace will start to noticeably degrade beyond a few hundred rows.
If it is not possible to maintain the rank on the db side of the house (e.g. high insertion environment) consider doing the paging on the client side. For example, an ADO classic recordset object has properties to support paging (PageCount, PageSize, AbsolutePage, etc), something for which DAO recordsets (being of an older vintage) have no support.
As always, you'll have to perform your own timings but I suspect that when there are, say, 10K rows you will find it faster to take on the overhead of fetching all the rows to an ADO recordset then finding the page (then perhaps fabricate smaller ADO recordset consisting of just that page's worth of rows) than it is to perform a correlated subquery to only fetch the number of rows for the page.

Unfortunately the LIMIT keyword isn't available in MS Access -- that's what is used in MySQL for a multi-page presentation. If you can write an order key into the results table, then you can use it something like this:
SELECT TOP 25 MyOrder, Etc FROM Table1 WHERE MyOrder in
(SELECT TOP 55 MyOrder FROM Table1 ORDER BY MyOrder DESC)
ORDER BY MyOrder ASCENDING

If I understand you correctly, there is ionly first_name and age columns in your table. If this is the case, then there is no way to return Bob, Sid, and Billy with a single query. Unless you do something like
SELECT * FROM Table
WHERE FirstName = 'Bob'
OR FirstName = 'Sid'
OR FirstName = 'Billy'
But I think that this is not what you are looking for.
This is because SQL databases make no guarantee as to the order that the data will come out of the database unless you specify an ORDER BY clause. It will usually come out in the same order it was added, but there are no guarantees, and once you get a lot of rows in your table, there's a reasonably high probability that they won't come out in the order you put them in.
As a side note, you should probably add a "rank" column (this column is usually called id) to your table, and make it an auto incrementing integer (see Access documentation), so that you can do the query mentioned by Sev. It's also important to have a primary key so that you can be certain which rows are being updated when you are running an update query, or which rows are being deleted when you run a delete query. For example, if you had 2 people named Max, and they were both 23, how you delete 1 row without deleting the other. If you had another auto incrementing unique column in there, you could specify the unique ID in your query to delete only one.
[ADDITION]
Upon reading your comment, If you add an autoincrement field, and want to read 3 rows, and you know the ID of the first row you want to read, then you can use "TOP" to read 3 rows.
Assuming your data looks like this
ID | first_name | age
1 Max 23
2 Bob 40
6 Sid 25
8 Billy 18
15 Sally 19
You can wuery Bob, Sid and Billy with the following QUERY.
SELECT TOP 3 FirstName, Age
From Table
WHERE ID >= 2
ORDER BY ID

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Check if String starts with column value with best performance - sql-server

Try this, select user, start from myTable where '12345678' like start+'%' order by start desc Also add an index to the column start.

Related

How to get rid of duplicates with T-SQL

selecting random rows based on weight on another row

SQL Server 2005 SELECT TOP 1 from VIEW returns LAST row

How would I determine if a varchar field in SQL contains any numeric characters?

MS Access row number, specify an index

Categories

Resources