selecting random rows based on weight on another row - sql-server

I need to select random rows from a table based on weight in another row. Like if the user enters random value 50 I need to select 50 random rows from the table being that the rows with higher weight gets returned more number of times. I saw using NEWID() to select n number of random rows and this link
Random Weighted Choice in T-SQL
where we can select one row based on the weight from another row but I need to select several rows based on user random input number ,so will the best way be using the suggested answer in the above link and looping over it n number of times(but I think it would return the same row) is there any other easy solution.
MY table is like this
ID Name Freq
1 aaa 50
2 bbb 30
3 ccc 10
so when the user enters 50 I need to return 50 random names so it should be like more aaa ,bbb than ccc.Might be like 25 aaa 15 bbb and 10 ccc. Anything close to this will work to.I saw this answer but when I execute against my DB it seems to be running for 5mins and no results yet.
SQL : select one row randomly, but taking into account a weight

I think the difficult part here is getting any individual row to potentially appear more than once. I'd look into doing something like the following:
1) Build a temp table, duplicating records according to their frequency (I'm sure there's a better way of doing this, but the first answer that came to my mind was a simple while loop... This particular one really only works if the frequency values are integers)
create table #dup
(
id int,
nm varchar(10)
)
declare #curr int, #maxFreq int
select #curr=0, #maxFreq=max(freq)
from tbl
while #curr < #maxFreq
begin
insert into #dup
select id, nm
from tbl
where freq > #curr
set #curr = #curr+1
end
2) Select your top records, ordered by a random value
select top 10 *
from #dup
order by newID()
3) Cleanup
drop table #dup

Maybe could you try something like the following:
ORDER BY Freq * rand()
in your sql? So columns with a higher Freq value should in theory get returned more often than those with a lower Freq value. It seems a bit hackish but it might work!

Related

How to configure a table column in TSQL that works as a sequence depending on the values of another two columns?

I have a table that looks like this:
ID A B Count
-----------------
1 abc 0 1
2 abc 0 2
3 abc 1 1
4 xyz 1 1
5 xyz 1 2
6 xyz 1 3
7 abc 1 2
8 abc 0 3
The "Count" column is incremented by one in the next insertion depending on the value of fields "A" and "B". so for example, if the next record I want to insert is:
ID A B Count
-----------------
abc 0
The value of count will be 4.
I have been trying to find documentation about this, but I'm still quite lost in the MS SQL world! There must be a way to configure the "Count" column as a sequence dependent on the other two columns. My alternative would be to select all the records with A=abc and B=0, get the maximum "Count", and do +1 in the latest one, but I suspect there must be another way related to properly defining the Count column when creating the table.
The first question is: Why do you need this?
There is ROW_NUMBER() which will - provided the correct PARTITION BY in the OVER() clause - do this for you:
DECLARE #tbl TABLE(ID INT,A VARCHAR(10),B INT);
INSERT INTO #tbl VALUES
(1,'abc',0)
,(2,'abc',0)
,(3,'abc',1)
,(4,'xyz',1)
,(5,'xyz',1)
,(6,'xyz',1)
,(7,'abc',1)
,(8,'abc',0);
SELECT *
,ROW_NUMBER() OVER(PARTITION BY A,B ORDER BY ID)
FROM #tbl
ORDER BY ID;
The problem is: What happens if a row is changed or deleted?
If you write this values into a persistant column and one row is removed physically, you'll have a gap. Okay, one can live with this... But if a value in A is changed from abc to xyz (same applies to B of course) the whole approach breaks.
If you still want to write this into a column you can use the ROW_NUMBER() from above to fill these values initially and a TRIGGER to set the next value with your SELECT MAX()+1 approach for new rows.
If the set of combinations is limited you might create a SEQUENCE (needs v2012+) for each.
But - to be honest - the whole issue smells a bit.

SQL Server Insert Random

i have this table:
Create Table Person
(
Consecutive Integer Identity(1,1),
Identification Varchar(15) Primary Key,
)
The Identification column can contain letters, numbers, and is optional, i.e., the customer can enter it or not, if not, creates a number automatic.. how can i do to insert a random number that does not exist before?, preferably a lower number.
A example could be:
Select Random From Person Where Random Not Exists In Identification
This is my code:
Select Min(Convert(Integer,Identification)) - 1
From Person
Where IsNumeric(Identification) = 1
Or
Select Max(Convert(Integer,Identification)) + 1
From Person
Where IsNumeric(Identification) = 1
Works well, but if the customer enter a number high, for example 1000, or higher, then the number will begin from there could have an overflow error
But if there is not a number below Identification and greater than 0 then well be -1, -2, -3.. etc.
Thanks in advance..
I agree with what M.Ali said. But you can just make use of the below code, but still I don't recommend beyond what M.Ali said.
The loop with continue until a random number is generated which is not in your table. You can change the precision to 5 digits by changing 1000 to 10000 and so on.
DECLARE #I INT = 0
DECLARE #RANDOM INT;
WHILE(#I=0)
BEGIN
SELECT #RANDOM = 1000 + (CONVERT(INT, CRYPT_GEN_RANDOM(3)) % 1000);
IF NOT EXISTS(SELECT Identification FROM YOURTABLE WHERE Identification = CAST(#RANDOM AS VARCHAR(4)))
BEGIN
-- Do your stuff here
BREAK;
END
ELSE
BEGIN
-- The ELSE part
END
END
Maintaining a Random number of VARCHAR(15), which depends on end user's input can be a very expensive approach when you also want it to be unique.
Imagine a scenario when you have some decent amount of rows say 10,000 rows in this table and a user comes in trying to insert a Random number, chances are the user maybe try 5, 10 or even maybe 15 times to get a unique random value.
On each failed attempt a call will be made to server, a search will be done on table (more rows more expensive this query will become), and the more (failed) attempts a user makes more disappointment/poor application experience user will have.
Would you ever go back to an application(web/windwos) where just for registeration you had struggle this many time? obviously not.
The moral of the story is if you are asking a user to enter some random value, do not expect users to maintain your database integrity and keep that column unique, take control and pair that value with another column which will definately be random. In your case it can be the Identity column. Or alternately you can generate that value for user yourself, using guid.
select count(*) +1 from Person
This generates a logical ID for Identification that sets the ID to what it 'would have been' with a simple incrementor.
However, you then cannot delete records; instead you must deactivate them, or clear the row.
Alternately, have a separate (hidden) column that only auto-increments, and if Identification is left empty, use the value from the hidden column. Same result, but less risk if deletion is relevant.

How can I assign a number to each row in a table representing the record number?

How can I show the number of rows in a table in a way that when a new record is added the number representing the row goes higher and when a record is deleted the number gets updated accordingly?
To be more clear,suppose I have a simple table like this :
ID int (primary key) Name varchar(5)
The ID is set to get incremented by itself (using identity specification) so it can't represent the number of row(record) since if I have for example 3 records as:
ID NAME
1 Alex
2 Scott
3 Sara
and I delete Alex and Scott and add a new record it will be:
3 Sara
4 Mina
So basically I'm looking for a sql-side solution for doing this so that I don't change anything else in the source code in multiple places.
I tried to write something to get the job done but it failes. Here it is :
SELECT COUNT(*) AS [row number],Name
FROM dbo.Test
GROUP BY ID, Name
HAVING (ID = ID)
This shows as:
row number Name
1 Alex
1 Scott
1 Sara
while I want it to get shown as:
row number Name
1 Alex
2 Scott
3 Sara
If you just want the number against the rows while selecting the data and not in the database then you can use this
select row_number() over(order by id) from dbo.Test
This will give the row number n for nth row.
Try
SELECT id, name, ROW_NUMBER() OVER (ORDER BY id) AS RowNumber
FROM MyTable
What you want is called an auto increment.
For SQL-Server this is achieved by adding the IDENTITY(1,1) attribute to the table definition.
Other RDBMS use a different syntax. Firebird for example has generators, which do the counting. In a BEFORE-INSERT trigger you would assign the ID-field to the current value of the generator (which will be increased automatically).
I had this exact problem a while ago, but I was using SQL Server 2000, so although row number() is the best solution, in SQL Server 2000, this isn't available. A workaround for this is to create a temporary table, insert all the values with auto increment, and replace the current table with the new table in T-SQL.

"order by newid()" - how does it work?

I know that If I run this query
select top 100 * from mytable order by newid()
it will get 100 random records from my table.
However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?
I know what NewID() does, I'm just
trying to understand how it would help
in the random selection. Is it that
(1) the select statement will select
EVERYTHING from mytable, (2) for each
row selected, tack on a
uniqueidentifier generated by NewID(),
(3) sort the rows by this
uniqueidentifier and (4) pick off the
top 100 from the sorted list?
Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.
SELECT TOP 100 *
FROM master..spt_values
ORDER BY NEWID()
The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.
SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)
In general it works like this:
All rows from mytable is "looped"
NEWID() is executed for each row
The rows are sorted according to random number from NEWID()
100 first row are selected
as MSDN says:
NewID() Creates a unique value of type
uniqueidentifier.
and your table will be sorted by this random values.
use select top 100 randid = newid(), * from mytable order by randid
you will be clarified then..
I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.
Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.
From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.
4 ways to get a random row from a large table:
Method 1, Bad: ORDER BY NEWID() > Bad performance!
Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really
random!
Method 3, Best but Requires Code: Random Primary Key >
Fastest, but won't work for negative numbers.
Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered
index.
More on method 3:
Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.
/* Get a random number smaller than the table's top ID */
DECLARE #rand BIGINT;
DECLARE #maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT #rand = ABS((CHECKSUM(NEWID()))) % #maxid;
/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= #rand;

MS Access row number, specify an index

Is there a way in MS access to return a dataset between a specific index?
So lets say my dataset is:
rank | first_name | age
1 Max 23
2 Bob 40
3 Sid 25
4 Billy 18
5 Sally 19
But I only want to return those records between 'rank' 2 and 4, so my results set is Bob, Sid and Billy? However, Rank is not part of the table, and this should be generated when the query is run. Why don't I use an autogenerated number, because if a record is deleted, this will be inconsistent, and what if I wanted the results in reverse!
This obviously very simple, and the reason I ask is because I am working on a product catalogue and I am looking for a more efficient way of paging through the returned dataset, so if I only return 1 page worth of data from the database this is obviously going to be quicker then return a complete set of 3000 records and then having to subselect from that set!
Thanks R.
Original suggestion:
SELECT * from table where rank BETWEEN 2 and 4;
Modified after comment, that rank is not existing in structure:
Select top 100 * from table;
And if you want to choose subsequent results, you can choose the ID of the last record from the first query, say it was ID 101, and use a WHERE clause to get the next 100;
Select top 100 * from table where ID > 100;
But these won't give you what you're looking for either, I bet.
How are you calculating rank? I assume you are basing it on some data in another dataset somewhere. If so, create a function, do a table join, or do something that can calculate rank based on values in other table(s), then you can do queries based on the rank() function.
For example:
select *
from table
where rank() between 2 and 4
If you are not calculating rank based on some data somewhere, there really isn't a way to write this query, and you might as well be returning three random rows from the table.
I think you need to use a correlated subquery to calculate the rank on the fly e.g. I'm guessing the rank is based on name:
SELECT T1.first_name, T1.age,
(
SELECT COUNT(*) + 1
FROM MyTable AS T2
WHERE T1.first_name > T2.first_name
) AS rank
FROM MyTable AS T1;
The bad news is the Access data engine is poorly optimized for this kind of query; in my experience, performace will start to noticeably degrade beyond a few hundred rows.
If it is not possible to maintain the rank on the db side of the house (e.g. high insertion environment) consider doing the paging on the client side. For example, an ADO classic recordset object has properties to support paging (PageCount, PageSize, AbsolutePage, etc), something for which DAO recordsets (being of an older vintage) have no support.
As always, you'll have to perform your own timings but I suspect that when there are, say, 10K rows you will find it faster to take on the overhead of fetching all the rows to an ADO recordset then finding the page (then perhaps fabricate smaller ADO recordset consisting of just that page's worth of rows) than it is to perform a correlated subquery to only fetch the number of rows for the page.
Unfortunately the LIMIT keyword isn't available in MS Access -- that's what is used in MySQL for a multi-page presentation. If you can write an order key into the results table, then you can use it something like this:
SELECT TOP 25 MyOrder, Etc FROM Table1 WHERE MyOrder in
(SELECT TOP 55 MyOrder FROM Table1 ORDER BY MyOrder DESC)
ORDER BY MyOrder ASCENDING
If I understand you correctly, there is ionly first_name and age columns in your table. If this is the case, then there is no way to return Bob, Sid, and Billy with a single query. Unless you do something like
SELECT * FROM Table
WHERE FirstName = 'Bob'
OR FirstName = 'Sid'
OR FirstName = 'Billy'
But I think that this is not what you are looking for.
This is because SQL databases make no guarantee as to the order that the data will come out of the database unless you specify an ORDER BY clause. It will usually come out in the same order it was added, but there are no guarantees, and once you get a lot of rows in your table, there's a reasonably high probability that they won't come out in the order you put them in.
As a side note, you should probably add a "rank" column (this column is usually called id) to your table, and make it an auto incrementing integer (see Access documentation), so that you can do the query mentioned by Sev. It's also important to have a primary key so that you can be certain which rows are being updated when you are running an update query, or which rows are being deleted when you run a delete query. For example, if you had 2 people named Max, and they were both 23, how you delete 1 row without deleting the other. If you had another auto incrementing unique column in there, you could specify the unique ID in your query to delete only one.
[ADDITION]
Upon reading your comment, If you add an autoincrement field, and want to read 3 rows, and you know the ID of the first row you want to read, then you can use "TOP" to read 3 rows.
Assuming your data looks like this
ID | first_name | age
1 Max 23
2 Bob 40
6 Sid 25
8 Billy 18
15 Sally 19
You can wuery Bob, Sid and Billy with the following QUERY.
SELECT TOP 3 FirstName, Age
From Table
WHERE ID >= 2
ORDER BY ID

Resources