This question already has answers here:
How do I find a "gap" in running counter with SQL?
(22 answers)
Closed 3 years ago.
Table has a field that is a counter.
When the record is deleted, the counter number becomes available again.
New records must use the lowest available "slot" for counter.
Example:
Material | Counter
00AF10 | 02
00AF11 | 03
00AF12 | 04
In this case, a new inserted record will take the counter number "01" and a new record after that, the counter number "05".
I've tried doing a select max counter + 1 as a new record, but that will of course fail the requirement of utilizing available "slots" in the counter sequence. I'm pulling an all nighter and my brain is fried. Can you help?
You shouldn't do this, its a bad idea, not just technically but also business wise it makes little sense and offers no real benefit.
However, if you really have to do it, and getting a different job is not an option.
I'd make a new table with two columns, one column for the contiguous values, another, a foreign key to you your other data, that may get deleted.
Then you can handle deletions and inserts with hideous transnational triggers, but at least you'll be able to index your new table appropriately.
Related
I've finished my first semester in a college-level SQL course where we used "SQL queries for Mere Mortals" 3rd edition.
Long term I want to work in data governance or as a data scientist, so digging deeper is needed and I found the Stanford SQL course. Today taking the first mini quiz, I got the answers right but on these two I'm not understanding WHY I got the answers right.
My 'SQL for Mere Mortals' book doesn't even cover hash or tree-based indexes so I've been searching online for them.
I mostly guessed based on what she said but it feels more like luck than "I solidly understand why". So I've ordered "Introduction to Algorithms" 3rd edition by Thomas Cormen and it arrived last week but it will take me a while to read through all 1,229 pages.
Found that book in this other stackoverflow link =>https://stackoverflow.com/questions/66515417/why-is-hash-function-fast
Stanford Course => https://www.edx.org/course/databases-5-sql
I thought a hash index on College.enrollment would not speed up because they limit it to less than a number vs an actual number ?? I'm guessing per this link Better to use "less than equal" or "in" in sql query that the query would be faster if we used "<=" rather than "<" ?
This one was just a process of elimination as it mentions the first item after the WHERE clause, but then was confusing as it mentions the last part of Apply.cName = College.cName.
My questions:
I'm guessing that similar to algebra having numerators and denominators, quotients, and many other terms that specifically describe part of an equation using technical terms. How would you use technical terms to describe why these answers are correct.
On the second question, why is the first part of the second line referenced and the last part of the same line referenced as the answers. Why didn't they pick the first part of each of the last part of each?
For context, most of my SQL queries are written for PostgreSQL now within PyCharm on python but I do a lot of practice using the PgAgmin4 or MySqlWorkbench desktop platforms.
I welcome any recommendations you have on paper books or pdf's that have step-by-step tutorials as many, many websites have holes or reference technical details that are confusing.
Thanks
1. A hash index is only useful for equality matches, whereas a tree index can be used for inequality (< or >= etc).
With this in mind, College.enrollment < 5000 cannot use a hash index, as it is an inequality. All other options are exact equality matches.
This is why most RDBMSs only let you create tree-based indexes.
2. This one is pretty much up in the air.
"the first item after the WHERE clause" is not relevant. Most RDBMSs will reorder the joins and filters as they see fit in order to match indexes and table statistics.
I note that the query as given is poorly written. It should use proper JOIN syntax, which is much clearer, and has been in use for 30 years already.
SELECT * -- you should really specify exact columns
FROM Student AS s -- use aliases
JOIN [Apply] AS a ON a.sID = s.sID -- Apply is a reserved keyword in many RDBMS
JOIN College AS c ON c.cName = a.aName
WHERE s.GPA > 1.5 AND c.cName < 'Cornell';
Now it's hard to say what a compiler would do here. A lot depends on the cardinalities (size of tables) in absolute terms and relative to each other, as well as the data skew in s.GPA and c.cName.
It also depends on whether secondary key (or indeed INCLUDE) columns are added, this is clearly not being considered.
Given the options for indexes you have above, and no other indexes (not realistic obviously), we could guesstimate:
Student.sID, College.cName
This may result in an efficient backwards scan on College starting from 'Cornell', but Apply would need to be joined with a hash or a naive nested loop (scanning the index each time).
The index on Student would mean an efficient nested loop with an index seek.
Student.sID, Student.GPA
Is this one index or two? If it's two separate indexes, the second will be used, and the first is obviously going to be useless. Apply and College will still need heavy joins.
Apply.cName, College.cName
This would probably get you a merge-join on those two columns, but Student would need a big join.
Apply.sID, Student.GPA
Student could be efficiently scanned from 1.5, and Apply could be seeked, but College requires a big join.
Of these options, the first or the last is probably better, but it's very hard to say without further info.
In a real system, I would have indexes on all tables, and use INCLUDE columns wisely in order to avoid key-lookups. You would want to try to get a better feel for which tables are the ones that need to be filtered early etc.
First question
A hash-index is not linearly-searchable (see Slide 7), that is, you cannot perform range-comparisons with a hash-index. This is because (in general terms) hash functions are one-way: given the output of a hash function you cannot determine the input, and the output will be in apparently random order (having a random order is good for ensuring an even load over the set of hashtable bins).
Now, for a contrived and oversimplified example:
Supposing you have these rows:
PK | Enrollment
----------------
1 | 1
2 | 10
3 | 100
4 | 1000
5 | 10000
A perfect hash index of this table would look something like this:
Assuming that the hash of 1 is 0xF822AA896F34253E and the hash of 10 is 0xB383A8BBDAA41F98, and so on...
EnrollmentHash | PhysicalRowPointer
---------------------------------------
0xF822AA896F34253E | 1
0xB383A8BBDAA41F98 | 2
0xA60DCD4E78869C9C | 3
0x49B0AF769E6B1EB3 | 4
0x724FD1728666B90B | 5
So given this hashtable index, looking at the hashes you cannot determine which hash represents larger enrollment values vs. smaller values. But a hashtable index does give you O(1) lookup for single specific values, which is why it works best for discrete, non-continuous, data values, especially columns used in JOIN criteria.
Whereas a tree-hash does preserve relative ordering information about values, but with O( log n ) lookup time.
Second question
First, I need to rewrite the query to use modern JOIN syntax. The old style (using commas) has been obsolete since SQL-92 in 1992, that's almost 30 years ago.
SELECT
*
FROM
Apply
INNER JOIN Student ON Student.sID = Apply.sID
INNER JOIN College ON Apply.cName = Apply.cName
WHERE
Student.GPA > 1.5
AND
College.cName < 'Cornell'
Now, generally speaking the best way to answer this kind of question would be to know what the STATISTICS (cardinality, value distribution, etc) of the tables are. But without that I can still make some guesses.
I assume that College is the smallest table (~500 rows?), Student will have maybe 1-2m rows, and assuming every Student makes 4-5 applications then the Apply table will have ~5m rows.
...armed with that inference, we can deduce:
Student.sID = Apply.sID is an ID match - so a hash-index would be better in most cases (excepting if the PK clustering matters, but I won't digress).
Student.GPA > 1.5 - this is a range search so having a tree-based index here helps.
College.cName < 'Cornell' - again, this is a range comparison so a tree-based index here helps too.
So the best indexes would be Student.GPA and College.cName, but that isn't an option - so let's see what the benefits of each option are...
(As I was writing this, I saw that #charlieface posted their answer which already covers this, so I'll just link to theirs to save my time: https://stackoverflow.com/a/67829326/159145 )
I am looking to generate IDs in a particular format. The format is this:
X | {N, S, E, W} | {A-Z} | {YY} | {0-9} {0-9} {0-9} {0-9} {0-9}
The part with "X" is a fixed character, the second part can be any of the 4 values N, S, E, W (North, South, East, West zones) based on the signup form data, the third part is an alphabet from the set {A-Z} and it is not related to the input data in anyway (can be randomly assigned), YY are the last 2 digits of the current year and the last part is a 5 digit number from 00000 to 99999.
I am planning to construct this ID by generating all 5 parts and concatenating the results into a final string. The steps for generating each part:
This is fixed as "X"
This part will be one of "N", "S", "E", "W" based on the input data
Generate a random alphabet from {A-Z}
Last 2 digits of current year
Generate 5 random digits
This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for a particular zone, which is enough for my use case.
For handling collisions, I plan to query the database and generate a new ID if the ID already exists in the DB. This will continue until I generate an ID which doesnt exist in the DB.
Is this strategy good or should I use something else? When the DB has a lot of entries of a particular zone in a particular year, what would be the approximate probability of collision or expected number of DB calls?
Should I instead use, sequential IDs like this:
Start from "A" in part 3 and "00000" in part 5
Increment part 3 to "B", when "99999" has been used in part 5
If I do use this strategy, is there a way I can implement this without looking into the DB to first find the last inserted ID?
Or some other way to generate the IDs in this format. My main concern is that the process should be fast (not too many DB calls)
If there's no way around DB calls, should I use a cache like Redis for making this a little faster? How exactly will this work?
For handling collisions, I plan to query the database and generate a
new ID if the ID already exists in the DB. This will continue until I
generate an ID which doesnt exist in the DB.
What if you make 10 such DB calls because of this. The problem with randomness is that collisions will occur even though the probability is low. In a production system with high load, doing a check with random data is dangerous.
This format gives me 26 x 10^5 = 26, 00, 000 unique IDs each year for
a particular zone, which is enough for my use case.
Your range is small, no doubt. But you need to see tahat the probability of collision will be 1 / 26 * 10^5 which is not that great!.
So, if the hash size is not a concern, read about UUID, Twitter snowflake etc.
If there's no way around DB calls, should I use a cache like Redis for
making this a little faster? How exactly will this work?
Using a cache is a good idea. Again, the problem here is the persistence. If you are looking for consistency, then Redis uses LRU and keys would get lost in time.
Here's how I would solve this issue:
So, I would first write write a mapper range for characters.
Ex: N goes from A to F, S from G to M etc.
This ensures that there is some consistency among the zones.
After this, we can do the randomized approach itself but with indexing.
So, suppose let's say there is a chance for collision. We can significantly reduce this value.
Make the unique hash in your table as indexable.
This means that your search is much faster.
When you want to insert, generate 2 random hashes and do a single IN query - something like "select hash from table where hash in (hash1,hash2)". If this does not work, next time, you need to generate 4 random hashes and do the same query. If it works , use the hash. Keep increasing the exponential value to avoid collisions.
Again this is speculative, better approcahes may be there.
So, I am working on a feature in a web application. The problem is like this-
I have four different entities. Let's say those are - Item1, Item2, Item3, Item4. There's two phase of the feature. Let's say the first phase is - Choose entities. In the first phase, User will have option to choose multiple items for each entity and for every combination from that choosing, I need to do some calculation. Then in the second phase(let's say Relocate phase) - based on the calculation done in the first phase, for each combination I would have to let user choose another combination where the value of the first combination would get removed to the row of the second combination.
Here's the data model for further clarification -
EntityCombinationTable
(
Id
Item1_Id,
Item2_Id,
Item3_Id,
Item4_Id
)
ValuesTable
(
Combination_Id,
Value
)
So suppose I have following values in both values -
EntityCombinationTable
Id -- Item1_Id -- Item2_Id -- Item3_Id -- Item4_Id
1 1 1 1 1
2 1 2 1 1
3 2 1 1 1
4 2 2 1 1
ValuesTable
Combination_Id -- Value
1 10
2 0
3 0
4 20
So if in the first phase - I choose (1,2) for Item1, (1,2) for Item_2 and 1 for both Item_3 and Item4, then total combination would be 2*2*1*1 = 4.
Then in the second phase, for each of the combination that has value greater than zero, I would have to let the user choose different combination where the values would get relocated.
For example - As only combination with Id 1 and 2 has value greater than zero, only two relocation combination would need to be shown in the second dialog. So if the user choose (3,3,3,3) and (4,4,4,4) as relocation combination in the second phase, then new row will need to be inserted in
EntityCombinationTable for (3,3,3,3) and (4,4,4,4). And values of (1,1,1,1) and (2,2,1,1) will be relocated respectively to rows corresponding to (3,3,3,3) and (4,4,4,4) in the ValuesTable.
So the problem is - each of the entity can have items upto 100 or even more. So in worst case the total number of combinations can be 10^8 which would lead to a very heavy load in database(inserting and updating a huge number rows in the table) and also generating all the combination in the code level would require a substantial time.
I have thought about an alternative approach to not keep the items as combination. Rather keep separate table for each entity. and then make the combination in the runtime. Which also would cause performance issue. As there's a lot more different stages where I might need the combination. So every time I would need to generate all the combinations.
I have also thought about creating key-value pair type table, where I would keep the combination as a string. But in this approach I am not actually reducing number of rows to be inserted rather number of columns is getting reduced.
So my question is - Is there any better approach this kind of situation where I can keep track of combination and manipulate in an optimized way?
Note - I am not sure if this would help or not, but a lot of the rows in the values table will probably have zero as value. So in the second phase we would need to show a lot less rows than the actual number of possible combinations
In my program you can book an item. This item has an id with 6 characters from 32 possible characters.
So my possibilities are 32^6. Every id must be unique.
func tryToAddItem {
if !db.contains(generateId()) {
addItem()
} else {
tryToAddItem()
}
}
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
So that is quite high. This are 5 database queries on a lot of datas.
When the probability is so high I want to implement a prefix „A-xxxxxx“.
What is a good condition for that? At which time do I will need a prefix?
In my example 90% ids were use. What is about the rest? Do I threw it away?
What is about database performance when I call tryToAddItem 5 times? I could imagine that this is not best practise.
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
Not quite. Let's represent the number of call you make with the random variable X, and let's call the probability of an id collision p. You want the probability that you make the call at most five times, or in general at most k times:
P(X≤k) = P(X=1) + P(X=2) + ... + P(X=k)
= (1-p) + (1-p)*p + (1-p)*p^2 +... + (1-p)*p^(k-1)
= (1-p)*(1 + p + p^2 + .. + p^(k-1))
If we expand this out all but two terms cancel and we get:
= 1- p^k
Which we want to be greater than some probability, x:
1 - p^k > x
Or with p in terms of k and x:
p < (1-x)^(1/k)
where you can adjust x and k for your specific needs.
If you want less than a 50% probability of 5 or more calls, then no more than (1-0.5)^(1/5) ≈ 87% of your ids should be taken.
First of all make sure there is an index on the id columns you are looking up. Then I would recommend thinking more in terms of setting a very low probability of a very bad event occurring. For example maybe making 20 calls slows down the database for too long, so we'd like to set the probability of this occurring to <0.1%. Using the formula above we find that no more than 70% of ids should be taken.
But you should also consider alternative solutions. Is remapping all ids to a larger space one time only a possibility?
Or if adding ids with prefixes is not a big deal then you could generate longer ids with prefixes for all new items going forward and not have to worry about collisions.
Thanks for response. I searched for alternatives and want show three possibilities.
First possibility: Create an UpcomingItemIdTable with 200 (more or less) valid itemIds. A task in the background can calculate them every minute (or what you need). So the action tryToAddItem will always get a valid itemId.
Second possibility
Is remapping all ids to a larger space one time only a possibility?
In my case yes. I think for other problems the answer will be: it depends.
Third possibility: Try to generate an itemId and when there is a collision try it again.
Possible collisions handling: Do some test before. Measure the time to generate itemIds when there are already 1000,10.000,100.000,1.000.000 etc. entries in the table. When the tryToAddItem method needs more than 100ms (or what you prefer) then increase your length from 6 to 7,8,9 characters.
Some thoughts
every request must be atomar
create an index on itemId
Disadvantages for long UUIDs in API: See https://zalando.github.io/restful-api-guidelines/#144
less usable, because...
-cannot be memorized and easily communicated by humans
-harder to use in debugging and logging analysis
-less convenient for consumer facing usage
-quite long: readable representation requires 36 characters and comes with higher memory and bandwidth consumption
-not ordered along their creation history and no indication of used id volume
-may be in conflict with additional backward compatibility support of legacy ids
[...]
TLDR: For my case every possibility is working. As so often it depends on the problem. Thanks for input.
I need to store the number of plays for every second of a podcast / audio file. This will result in a simple timeline graph (like the "hits" graph in Google Analytics) with seconds on the x-axis and plays on the y-axis.
However, these podcasts could potentially go on for up to 3 hours, and 100,000 plays for each second is not unrealistic. That's 10,800 seconds with up to 100,000 plays each. Obviously, storing each played second in its own row is unrealistic (it would result in 1+ billion rows) as I want to be able to fetch this raw data fast.
So my question is: how do I best go about storing these massive amounts of timeline data?
One idea I had was to use a text/blob column and then comma-separate the plays, each comma representing a new second (in sequence) and then the number for the amount of times that second has been played. So if there's 100,000 plays in second 1 and 90,000 plays in second 2 and 95,000 plays in second 3, then I would store it like this: "100000,90000,95000,[...]" in the text/blob column.
Is this a feasible way to store such data? Is there a better way?
Thanks!
Edit: the data is being tracked to another source and I only need to update the raw graph data every 15min or so. Hence, fast reads is the main concern.
Note: due to nature of this project, each played second will have to be tracked individually (in other words, I can't just track 'start' and 'end' of each play).
Problem with the blob storage is you need to update the entire blob for all of the changes. This is not necessarily a bad thing. Using your format: (100000, 90000,...), 7 * 3600 * 3 = ~75K bytes. But that means you're updating that 75K blob for every play for every second.
And, of course, the blob is opaque to SQL, so "what second of what song has the most plays" will be an impossible query at the SQL level (that's basically a table scan of all the data to learn that).
And there's a lot of parsing overhead marshalling that data in and out.
On the other hand. Podcast ID (4 bytes), second offset (2 bytes unsigned allows pod casts up to 18hrs long), play count (4 byte) = 10 bytes per second. So, minus any blocking overhead, a 3hr song is 3600 * 3 * 10 = 108K bytes per song.
If you stored it as a blob, vs text (block of longs), 4 * 3600 * 3 = 43K.
So, the second/row structure is "only" twice the size (in a perfect world, consult your DB server for details) of a binary blob. Considering the extra benefits this grants you in terms of being able to query things, that's probably worth doing.
Only down side of second/per row is if you need to to a lot of updates (several seconds at once for one song), that's a lot of UPDATE traffic to the DB, whereas with the blob method, that's likely a single update.
Your traffic patterns will influence that more that anything.
Would it be problematic to use each second, and how many plays is on a per-second basis?
That means 10K rows, which isn't bad, and you just have to INSERT a row every second with the current data.
EDIT: I would say that that solutions is better than doing a comma-separated something in a TEXT column... especially since getting and manipulating data (which you say you want to do) would be very messy.
I would view it as a key-value problem.
for each second played
Song[second] += 1
end
As a relational database -
song
----
name | second | plays
And a hack psuedo-sql to start a second:
insert into song(name, second, plays) values("xyz", "abc", 0)
and another to update the second
update song plays = plays + 1 where name = xyz and second = abc
A 3-hour podcast would have 11K rows.
It really depends on what is generating the data ..
As I understand you want to implement a map with the key being the second mark and the value being the number of plays.
What is the pieces in the event, unit of work, or transaction you are loading?
Can I assume you have a play event along the podcastname , start and stop times
And you want to load into the map for analysis and presentation?
If that's the case you can have a table
podcastId
secondOffset
playCount
each even would do an update of the row between the start and ending position
update t
set playCount = playCount +1
where podCastId = x
and secondOffset between y and z
and then followed by an insert to add those rows between the start and stop that don't exist, with a playcount of 1, unless you preload the table with zeros.
Depending on the DB you may have the ability to setup a sparse table where empty columns are not stored, making more efficient.