Compare rows without nested for-loops - sql-server

I have a table in mssql, since I am kinda new to sql I have run in to a problem. The table consists of the following data.
ID | Long | Lat | TimeStamp
-----------+--------------+--------------+------------------
123 | 54 | 18 | 2012-12-02...
143 | 31 | 35 | 2011-09-14...
322 | 53 | 19 | 2012-11-29...
And so on...
I have written a boolean function which checks a condition for a pair of long and lats. I have also written a function which gives the distance between a pair of long and lats. What I want to do is to add a column with the distance to the row which is closest to the the current row and also passes the boolean function and are sufficiently close to each other in time. The database table consists of several million rows, I therefore refrain from using nested for-loops, how would you guys tackle this large dataset? Do mssql have some smart way of doing this?
All help is welcome <3

You basically need to comapare 1 million records to 1 million other records in your function, giving you 1 trillion values to work with. In my opinion, SQL is not really going to help you manage that many records, so you'll have to do some work yourself and also only do bits of work at a time. There may be a better way, but I would do the following:
Add the following columns to the table: calculated destination, distance
Run an algorithm periodically or if a new record is added. This algorithm only needs to run for records where calculated is false.
The algorithm can work something like below:
With the current record, loop through each record that has calculated as true and calculate the distance to that location. if the location is the shortest thus far, set a variable to temporarily store the destination and distance.
After the loop has finished, you have the shortest destination and distance for the current location. Update that in the database and also set the calculated flag to true.
Now compare this distance to the distance in the record for the destination. If it is shorter, also update that record.

Related

Automatically shift down or insert cells so don't get "Array not expanded, would overwite data" error

I'm making a Google Sheet for the roleplaying game Pathfinder in which I want to be able to copy and paste different monster stat blocks from another site to column A and have each section put into individual cells in column B and other columns using SPLIT, etc. Then eventually I want to be able to add different templates onto that, such as the skeleton template, which automatically alter some of the stats.
But I'm having trouble in that because the stat blocks can vary a lot between monsters Spell-like Abilities (SLA's) can be blank for some monsters and have very many for others. I have an array formula to get all the data between "Spell-Like Abilities" and the next section down, "Statistics," and SPLIT each SLA out in column B. But if there are too many SLA's it gets the overwrite error because the result would go over the sections/formulas below that in column B.
I could just move those lower column B sections down to have enough room but because I don't know how many SLA's any monster might have I don't know how many cells of space I would need to leave open. I could leave as many as say 25 but what if a monster I don't know has 26? Plus I don't want to leave a lot of blank space in column B like that as it would make it harder to read or deal with and would be blank most of the time and useless.
So I have been trying to find a way that the right amount of blank cells would be added in automatically for the SLA part in column B for every monster I paste in column A, shifting the cells below it down accordingly. However, I have not been able to find a formula or script that can do this. I could add a part to the formula to count how many cells are between "Spell-Like Abilities" and "Statistics" and use that to know how much space I need but I can't find a formula that adds cells or shifts down cells.
Edit:Edit: The cell is B21. Right now I have a lot of blank cells below that with STATISTICS in cell B41 and other cells with formulas below that. If I move "STATISTICS" up too high I get the error depending on how many SLA's or spells the monster has. Ideally I want B21 to have no blank cells below it by default so I can have STATISTICS in cell B22 or B23. And then whatever stat block I post in column A I want to push STATISTICS and all cells below it down the necessary amount of cells so that the overwrite error doesn't happen. If the monster has no SLA's it won't move. If it has 15 rows of SLA's then STATISTICS and all cells below it will move down 14 so there are 15 cells (counting B21 itself) to display those SLA's in column B.
Here is my formula in that cell. Edit: updated:
=ARRAYFORMULA(SPLIT(TRANSPOSE(SPLIT(JOIN("#", (REGEXREPLACE(REGEXREPLACE((INDIRECT("A"&(ARRAYFORMULA(MIN(IF(REGEXMATCH(A1:A40, "Spell-Like|Spells Known"), ROW(A1:A40), )))+2))):(INDIRECT("A"&(ARRAYFORMULA(MIN(IF(REGEXMATCH(A:A, "TACTICS|STATISTICS"), ROW(A:A), )))-2))),"Known","Known:;"),"Prepared","Prepared:;"))), "#")), ",""—"";"))
There is + and - 2 because the actual SLA's have a blank space above and below them in the section.
Here is an example of what it comes out to with Statistics below.
| Constant | protection from good |
| 3/day | detect thoughts (DC 13) | dream (DC 16) | nightmare (DC 16) | suggestion (DC 14) |
| 1/day | shadow walk |
STATISTICS
Here is the sheet:
https://docs.google.com/spreadsheets/d/1R8LYW2HGhpM3TS9m4ZBaNstwzNtsyll9fZnQ8WaA3Nc/edit#gid=0
The original formulas are complicated so I just answer with a simple example.
Please modify it yourself.
=ArrayFormula(SPLIT({ INDIRECT("A"&MATCH("Section A",A:A,0)&":A"&MATCH("Section B",A:A,0)-2); "(blank)"; INDIRECT("A"&MATCH("Section B",A:A,0)&":A"&MATCH("Section C",A:A,0)-2); "(blank)"; INDIRECT("A"&MATCH("Section C",A:A,0)&":A"&MATCH("Section C",A:A,0)+2) },","))
Perform different split on each array
=ARRAYFORMULA(SPLIT({A76;A77},IF(REGEXMATCH({A76;A77},";"),";",",")))

Optimizing ID generation in a particular format

I am looking to generate IDs in a particular format. The format is this:
X | {N, S, E, W} | {A-Z} | {YY} | {0-9} {0-9} {0-9} {0-9} {0-9}
The part with "X" is a fixed character, the second part can be any of the 4 values N, S, E, W (North, South, East, West zones) based on the signup form data, the third part is an alphabet from the set {A-Z} and it is not related to the input data in anyway (can be randomly assigned), YY are the last 2 digits of the current year and the last part is a 5 digit number from 00000 to 99999.
I am planning to construct this ID by generating all 5 parts and concatenating the results into a final string. The steps for generating each part:
This is fixed as "X"
This part will be one of "N", "S", "E", "W" based on the input data
Generate a random alphabet from {A-Z}
Last 2 digits of current year
Generate 5 random digits
This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for a particular zone, which is enough for my use case.
For handling collisions, I plan to query the database and generate a new ID if the ID already exists in the DB. This will continue until I generate an ID which doesnt exist in the DB.
Is this strategy good or should I use something else? When the DB has a lot of entries of a particular zone in a particular year, what would be the approximate probability of collision or expected number of DB calls?
Should I instead use, sequential IDs like this:
Start from "A" in part 3 and "00000" in part 5
Increment part 3 to "B", when "99999" has been used in part 5
If I do use this strategy, is there a way I can implement this without looking into the DB to first find the last inserted ID?
Or some other way to generate the IDs in this format. My main concern is that the process should be fast (not too many DB calls)
If there's no way around DB calls, should I use a cache like Redis for making this a little faster? How exactly will this work?
For handling collisions, I plan to query the database and generate a
new ID if the ID already exists in the DB. This will continue until I
generate an ID which doesnt exist in the DB.
What if you make 10 such DB calls because of this. The problem with randomness is that collisions will occur even though the probability is low. In a production system with high load, doing a check with random data is dangerous.
This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for
a particular zone, which is enough for my use case.
Your range is small, no doubt. But you need to see tahat the probability of collision will be 1 / 26 * 10^5 which is not that great!.
So, if the hash size is not a concern, read about UUID, Twitter snowflake etc.
If there's no way around DB calls, should I use a cache like Redis for
making this a little faster? How exactly will this work?
Using a cache is a good idea. Again, the problem here is the persistence. If you are looking for consistency, then Redis uses LRU and keys would get lost in time.
Here's how I would solve this issue:
So, I would first write write a mapper range for characters.
Ex: N goes from A to F, S from G to M etc.
This ensures that there is some consistency among the zones.
After this, we can do the randomized approach itself but with indexing.
So, suppose let's say there is a chance for collision. We can significantly reduce this value.
Make the unique hash in your table as indexable.
This means that your search is much faster.
When you want to insert, generate 2 random hashes and do a single IN query - something like "select hash from table where hash in (hash1,hash2)". If this does not work, next time, you need to generate 4 random hashes and do the same query. If it works , use the hash. Keep increasing the exponential value to avoid collisions.
Again this is speculative, better approcahes may be there.

How to get the next available number in sequence? [duplicate]

This question already has answers here:
How do I find a "gap" in running counter with SQL?
(22 answers)
Closed 3 years ago.
Table has a field that is a counter.
When the record is deleted, the counter number becomes available again.
New records must use the lowest available "slot" for counter.
Example:
Material | Counter
00AF10 | 02
00AF11 | 03
00AF12 | 04
In this case, a new inserted record will take the counter number "01" and a new record after that, the counter number "05".
I've tried doing a select max counter + 1 as a new record, but that will of course fail the requirement of utilizing available "slots" in the counter sequence. I'm pulling an all nighter and my brain is fried. Can you help?
You shouldn't do this, its a bad idea, not just technically but also business wise it makes little sense and offers no real benefit.
However, if you really have to do it, and getting a different job is not an option.
I'd make a new table with two columns, one column for the contiguous values, another, a foreign key to you your other data, that may get deleted.
Then you can handle deletions and inserts with hideous transnational triggers, but at least you'll be able to index your new table appropriately.

Allocate places according to distance

As an over-simplified example I have a list of events that have a maximum attendance:
event | places
===================
event_A | 1
event_B | 2
event_C | 1
And a list of attendees with the distance to the events:
attendee | event_A dist | event_B dist | event_C dist
==========================================================
attendee_1 | 12 | 15 | 12
attendee_2 | 11 | 15 | 11
attendee_3 | 10 | 11 | 12
Can anyone suggest a simple method to produce a set of options providing the best case allocations based on shortest total distance and on shortest mean distance?
I currently have the data held in Oracle Spatial database, but I'm open to suggestions.
I currently understand your problem as follows:
Each atendee should be assigned to exactly one event
Each event has a limit as to how many atendees are assigned to it
Underfull or even empty events are no problem
Each assignment between an event and an atendee corresponds to a given distance
You want to minimize the overall distance for all assignments
You might want to print the result not using sums but using means
Based on this interpretation, I suggest the following algorithm:
Create a complete bipartite graph, with nodes in partition A for atendees and nodes in partition B for places in events. So every atendee corresponds to one node, and every event corresponds to as many nodes as it has places. All atendees are connected to all event nodes with the distance as the edge cost.
At this point your problem corresponds to a general assignment problem, with “agents” corresponding to your event places, and “tasks” corresponding to your atendees. Every atendee must be covered, but not every event place must be used.
Add dummy attendees to allow a perfect matching. Simply sum up all the places, subtract from that the number of actual atendees. For the difference, create as many atendee nodes, with distance zero to all the event nodes.
By making both partitions equal in size, you are now in the domain of the more common linear assignment problem.
Use the hungarian algorithm to compute a minimal cost assignment. Perhaps you can think of some simplifications which make use of the fact that you have many equivalent nodes, i.e. places for the same event and all those dummy atendees.
All of this should probably be done in application code, not in the data base. So I'd rather tag this algorithm. You'll need to pull the full cost matrix from the database to provide costs for your edges.

Model for a Scheduling System

We have built a scheduling system to control our client's appointments. This
system is similar to the one used by Prometric to schedule an exam. The main
concerns are: guarantee that there is no overscheduling, support at least one
hundred thousand appointments per month and allow to increase/decrease the
testing center capacity easily.
We devised the following design based on Capacity Vectors. We assumed that
each appointment requires at least five minutes. A vector is composed by 288
slots (24 hours * 12 slots per hour), each one represented by one byte. The
vector allows the system to represent up to 255 appointments every five
minutes. The information used is composed by two vectors: one to store the
testing center capacity (NOMINAL CAPACITY) and another to store the used capacity
(USED CAPACITY). To recover the current capacity (CURRENT CAPACITY), the system
takes the testing NOMINAL CAPACITY and subtracts the USED CAPACITY.
The database has the following structure:
NOMINAL CAPACITY
The nominal capacity represents the capacity for work days (Mon-Fri).
| TEST_CENTER_ID | NOMINAL_CAPACITY
| 1 | 0000001212121212....0000
| 2 | 0000005555555555....0000
...
| N | 0000000000010101....0000
USED CAPACITY
This table stores the used capacity for each day / testing center.
| TEST_CENTER_ID | DATE | USED_CAPACITY
| 1 | 01/01/2010 | 0000001010101010...0000
| 1 | 01/02/2010 | 0000000202020202...0000
| 1 | 01/03/2010 | 0000001010101010...0000
...
| 2 | 01/01/2010 | 0000001010101010...0000
...
| N | 01/01/2010 | 0000000000000000...0000
After the client chose the testing center and a date, the system presents the
available slots doing the following calculation. For example:
TEST_CENTER_ID 1
DATE 01/01/2010
NOMINAL_CAPACITY 0000001212121212...0000
USED_CAPACITY 0000001010101010...0000
AVAILABLE_CAPAC 0000000202020202...0000
If the client decides to schedule an appointment, the system locks the chosen
day (a row in # USED CAPACITY table) and increases the corresponding byte.
The system is working well now, but we foresee contention problems if the
number of appointments increases too much.
Does anyone has a better/another model for this problem or suggestions to
improve it?
We have thought in avoiding the contention by sub-diving the representation of a
vector in hour and changing to an optimistic lock strategy. For example:
| TEST_CENTER_ID | DATE | USED_CAPACITY_0 | USED_CAPACITY_1 | ... | USED_CAPACITY_23
| 1 | 01/01/2010 | 000000101010 | 1010... | ... | ...0000
This way will don't need to lock a row and reduce collision events.
Here's one idea:
You can use optimistic locking with your current design. Instead of using a separate version number or timestamp to check whether the entire row has been modified, save a memento of the used_capacity array when the row is read. You only lock on update, at which time you compare just the modified slot byte to see if it's been updated. If not, you can embed the new value into that one element without modifying the others, thus retaining modifications to other slots performed by other processes since your initial read.
This should work as well on an set of adjacent bytes for appointments that are longer than 5 minutes.
If you know the slot(s) in question when you initially read, then you can just save the beginning array index and salient values instead of the entire array.

Resources