When I am generating sequence values from the below-created sequence trial_seq, it gave 1, 2, 21, 41, 4,......
CREATE SEQUENCE trial_seq
MINVALUE 1
MAXVALUE 999999999999999999999999999
START WITH 1
INCREMENT BY 1
CACHE 20;
I am confused with the working of CACHE. What values from the sequence are stored in cache when for the first time NEXTVAL is called (just after the creation of sequence)? Are they from 1 to 20 (both inclusive) or just some random 20 numbers between MINVALUE and MAXVALUE?
Now, if the cache is storing random 20 number within sequence range, then it's okay but if the cache is storing from 1 to 20, then why it is giving 21 and subsequently 41, it should be giving values within the range of 1 to 20 until all values within this get exhausted? I specifically want to understand this by NOT using NOCACHE and/or ORDER.
Also, I am just learning, not using for RAC.
You describe that you get a sequence value in the following order 1, 2, 21, 41, 4. Are you executing the nextval query from within APEX? That might indeed be correct.
APEX uses connection pools and a sequence cache cannot be shared over connections, so each connection will have its own sequence cache.
Let's imagine you have a connection pool of 5 connections. Each connection will create its own sequence cached values at the moment you execute sequence.nextval within that connection.
Connection 1 - Sequence cache 1 - 20
Connection 2 - Sequence cache 21 - 40
Connection 3 - Sequence cache 41 - 60
Connection 4 - Sequence cache 61 - 80
Connection 5 - Sequence cache 81 - 100
You can imagine that when APEX controls which connection it is going to use within the connection pool, it is impossible to determine the sequence onbeforehand.
In short:
The sequence will generate a number in ascending order, f.e. 1,2,3,4,5,etc.. not random.
There can be gaps within that sequence, f.e. 1,2,4,5,8,9,10. Here 3,6,7 are skipped and will never be used again.
If you run it through APEX, multiple connections can/will be used and so multiple cached sequence ranges will be used. The order you will see could be anything like 1,21,41,22,42,2,3,43. The jumping from 1 to 21 to 41, back to 22 aren't gaps, but is because it is using a different connection, and thus a different cached sequence range. I used 1, 21, 41 so you can see the CACHE 20 behaviour back in the example.
You could try executing this as a script, to see the correct behaviour:
DROP SEQUENCE trial_seq
/
CREATE SEQUENCE trial_seq
MINVALUE 1
MAXVALUE 999999999999999999999999999
START WITH 1
INCREMENT BY 1
CACHE 20
/
create table trial_test( id number)
/
insert into trial_test
select trial_seq.nextval
from dual
connect by rownum <= 10
/
select *
from trial_test
/
What is the outcome? It should be 1 - 10, or at least somewhere near that outcome.
Let me know if this answers your question.
Related
I'm using COLUMNS_UPDATED() in a trigger to identify those columns whose values should be written to an audit table. The trigger / auditing had been working fine for multiple years. I noticed yesterday that the auditing is no longer working consistently.
I've listed the first forty columns of the table in question at the bottom for reference, along with the ORDINAL_POSITION from INFORMATION_SCHEMA.COLUMNS. The table has a total of 109 columns.
I added print COLUMNS_UPDATED() to my trigger to get some debug info.
When I update CurrentOnFleaTick, the 9th column, I see this printed:
0x0001000000000000000000000000
This is expected - the 9th column should be represented as the least significant bit of the second byte. Similarly, if I update HasAttackedAnotherAnimalExplanation I see this:
0x0000010000000000000000000000
Again, expected - the 17th column should be represented as the least significant bit of the third byte.
But... when I update HouseholdIncludesCats, I see this:
0x0000000200000000000000000000
Not expected! Where you see the 2 there should be a 1, as HouseholdIncludesCats ordinal position is 25, making it the first column represented in the fourth byte, which should be represented in the least significant bit of that byte.
I narrowed things down by updating every column between HasAttackedAnotherAnimalExplanation and HouseholdIncludesCats and found that the 'off by one' problem I'm having starts with HouseTrainedId, ordinal position 24. When updating HouseTrainedId I'm expecting
0x0000800000000000000000000000
but instead I get
0x0000000100000000000000000000
which I believe is wrong, and it is what I expect to be getting for updates to the HouseholdIncludesCats column.
I don not believe the mask should skip ahead. The mask is currently not using the most significant bit of the 3rd byte.
I did recently drop a column, but I don't have a record of its ordinal position. Based on the original code that would have created the table, I believe the ordinal position of the column that was dropped was NOT 24. (I think it was 7... It had been defined after the BreedIds.)
I'm not necessarily looking for a deep root cause determination. If there was something I could do to reset whatever internal data SQL Server uses that'd be fine. Sort of like a rebuild index idea for table metadata? Is there something like that that might fix this?
Thanks in advance for helpful answers! :)
COLUMN_NAME ORDINAL_POSITION
PetId 1
AdopterUserId 2
AdoptionDeadline 3
AgeMonths 4
AgeYears 5
BreedIds 6
Color 7
CreatedOn 8
CurrentOnFleaTick 9
CurrentOnHeartworm 10
CurrentOnVaccinations 11
FoodTypeId 12
GenderId 13
GuardianForMonths 14
GuardianForYears 15
HairCoatLength 16
HasAttackedAnotherAnimalExplanation 17
HasAttackedAnotherAnimalId 18
HasBeenReferredByShelter 19
HasHadTraining 20
HasMedicalConditions 21
HasRecentlyBittenExplanation 22
HasRecentlyBittenId 23
HouseTrainedId 24
HouseholdIncludesCats 25
HouseholdIncludesChildren5to10 26
HouseholdIncludesChildrenUnder5 27
HouseholdIncludesDogs 28
HouseholdIncludesOlderChildren 29
HouseholdIncludesOtherPets 30
HouseholdOtherPets 31
KnowsCommandDown 32
KnowsCommandPaw 33
KnowsCommandSit 34
KnowsCommandStay 35
KnowsOtherCommands 36
LastUpdatedOn 37
LastVisitedVetOn 38
ListingCodeId 39
LitterTypeClumping 40
So... I thought I had googled enough before posting this, but I guess I hadn't. I found this:
https://www.sqlservercentral.com/forums/topic/columns_updated-and-phantom-fields
using COLUMNPROPERTY() to get ColumnID is definitely the way to go.
I'm reading this:
https://instagram-engineering.com/sharding-ids-at-instagram-1cf5a71e5a5c
In in the last section "Solution", where they are generating a globally unique ID based on the DB's autoincrement feature + milliseconds since epoch + shard ID.
Why do we need to append shard ID to it?
Specifically, it says
Next, we take the shard ID for this particular piece of data we’re
trying to insert. Let’s say we’re sharding by user ID, and there are
2000 logical shards; if our user ID is 31341, then the shard ID is
31341 % 2000 -> 1341. We fill the next 13 bits with this value
THis doesn't make sense: if you are already modding user ID by number of shards (31341 % 2000), that means 1) You already have user ID! 2) You already know the shard it belongs to with the mod function!
What am I misunderstanding here?
Maybe I can break it down for you a bit better, and it's not just because user-id wont fit.
They're using Twitter Snowflake ID. This was designed to generate a unique ID across multiple servers, across multiple data centers, in a parallel. For instance, at the same exact instant two "items" in two "places" need a guaranteed unique ID for anything at the same instant less than a millisecond apart, maybe even at the same nanosecond... This unique ID has the requirements of needing to be being extremely fast to produce, efficient, built in a logical way that can be parsed efficiently, can fit within 64 bits, and the method of generating it needs to be able to handle a HUGE amount if IDs over many peoples lifetimes. This means they cannot do DB lookups to get a unique ID that's not already taken, the can't verify that the generated ID is unique after generating it to be sure, and they couldn't use existing methods that could possibly generate duplicates even if very rarely like UUID. So they devised a way..
They set a custom common epoch, such at today in a long integer as a base point. So with this they have a 42 bit long integer that starts at 0+time since that epoch.
Then they also added a sequence as a 12 bit long integer in the case that a single process on a single machine had to generate 2 or more IDs in the same millisecond. Now they have 42+12=54 bits in use, and when your considering that multiple processes on multiple machines (normally only one machine per data center providing IDs, but could be more, and normally only one worker/process per machine) you realize that you need more than just 42+12..
So they also have to encode a data center ID and a "worker" (process) ID. This will cover multiple data centers with multiple workers in each data center. These two IDs are both 5 bit long integers. All these integers are unsigned, so these 5 bit integers can go up to 31 which give each of these partial IDs 32 possibilities including 0. So, 32 data centers, with up to 32 workers in each datacenter.. So now we're at 42+12+5+5=64bits, with up to 32x32=1024 workers producing these IDs distributed.
So.. With a lifetime up to 139 years of being able to fit in the 42 bit portion... 10 bits for a node ID (or data center+worker IDs)... a sequence of 12 bits (4096 IDs per millisecond per worker)... You come up with a 64 maximum guaranteed unique ID system/formula that scales amazingly well over those 139 years that doesn't rely on a database in any way but can be efficiently produced and stored in a database.
So, this ID system works out to 42+12+10 and you can divide those 10 bits up, or not, however you like and not go beyond storing a 64bit unsigned long integer anywhere. Very flexible, and works great.
Again, it's called a Snowflake ID and Twitter came up with it. Those 10 bits can be called a shard ID, node ID, or a combination of data center ID and worker ID, it really depends on your needs. But, by not tying that shard/node ID to a user but to multiple processes and being able to use that ID across multiple "things", you wont have to worry about a lot of things and you can span multiple databases full of multiple things and and and..
The one thing that does matter is that that shard/node ID can only hold 1024 different values and no user ID or any unique ID that they could use is just going to go from 0 to 1023 in they don't assign it themselves to whatever.
So you see, those 10 bits have to be something that's static, assignable and easily parse-able for them regardless.
Here's a simply python function that'll generate a snowflake ID:
def genSnowflakeId(worker_id, data_center_id, ids_generated):
"Returns a snowflake ID - This function will generate a unique ID that fits in a 64 bit unsigned number that scales for multiple workers running in mutiple datacenters. You must manage a timestamp and sequence sanity with ids_generated (i.e. increment if time apart < 1 millisecond or always increment and roll over to 0 if > 4095). Ultimately this will allow you to efficiently generate unique IDs across multiple locations for 139 years that fits in a bigint(20) database field and can be parsed for the created timestamp, worker ID, and datacenter ID. See https://github.com/twitter-archive/snowflake/tree/snowflake-2010"
import sys
import time
# Mon Jul 8 05:07:56 EDT 2019
twepoch = 1562576876131L
sequence = 0L
worker_id_bits = 5L
data_center_id_bits = 5L
sequence_bits = 12L
timestamp_bits = 42L
#total bits 64
max_worker_id = -1L ^ (-1L << worker_id_bits)
max_data_center_id = -1L ^ (-1L << data_center_id_bits)
max_ids_generated = -1L ^ (-1L << sequence_bits)
worker_id_shift = sequence_bits
data_center_id_shift = sequence_bits + worker_id_bits
timestamp_left_shift = sequence_bits + worker_id_bits + data_center_id_bits
sequence_mask = -1L ^ (-1L << sequence_bits)
# Sanity checks for input
if worker_id > max_worker_id or worker_id < 0:
raise ValueError("worker_id", "worker id can't be greater than %i or less than 0" % max_worker_id)
if data_center_id > max_data_center_id or data_center_id < 0:
raise ValueError("data_center_id", "data center id can't be greater than %i or less than 0" % max_data_center_id)
if ids_generated > max_ids_generated or ids_generated < 0:
raise ValueError("ids_generated", "ids generated can't be greater than %i or less than 0" % max_ids_generated)
timestamp = long(int(time.time() * 1000))
new_id = ((timestamp - twepoch) << timestamp_left_shift) | (data_center_id << data_center_id_shift) | (worker_id << worker_id_shift) | sequence
return new_id
Hope this answer satisfies ya :)
They need an image id with 64 bits of length.
41 bits for milliseconds since epoch + 13 bits for the shard-id + 10 bits for the autoincrement value.
They took the shard-id instead of the user-id simply because only shard-id fits in 13 bits whereas user-id will require more bits.
a.I am building a DW using SQL Server. What options do I have in order to create an ID from 2 int values (High and low) and how I may reference this unique value with the incoming value?
ex. Currently I have to check if 10 is between 9 and 11, in order to reference the range name of value 10.
b. I will update a table from three others daily. How I may check for any data loss? I am thinking about a compare with counting rownumbers
You can use either BETWEEN or the < and > operators to check if a number is between a high and low value.
I have SQL Server 2012 and I want to know what's the usage of sequence. I Look for a sample to explain usage of sequence.
EDIT
I know create and use Sequence in database. I want to know what is practical scenario for use of Sequence.
CREATE SEQUENCE dbo.OrderIDs
AS INT
MINVALUE 1
NO MAXVALUE
START WITH 1;
SELECT NextOrderID = NEXT VALUE FOR dbo.OrderIDs
UNION ALL SELECT NEXT VALUE FOR dbo.OrderIDs
UNION ALL SELECT NEXT VALUE FOR dbo.OrderIDs;
Results:
NextOrderID
-----------
1
2
3
See here for original source and more examples. The page refers to SQL Server Denali which is the beta of SQL 2012 but the syntax is still the same.
One of the ways I leverage the SEQUENCE command is for reference numbers in an ASP/C# detailsview page (as an example). I use the detailsview to enter requests into a database and the SEQUENCE command serves as the request/ticket number for each request. I set the inital sequence command to start with a specific number and increment by 1 for each request.
If I present these requests in a gridview I make the SEQUENCE reference numbers appear but don't make them editable. Its great for a reference number when records are similar with other fields in the database. It's also perfect for customers when they have questions about a specific entry in a given database. This way I have a unique number per entry no matter if the rest of the information is identical or not.
Here's how I generally leverage the SEQUENCE command:
CREATE SEQUENCE blah.deblah
START WITH 1
INCREMENT BY 1
NO CYCLE
NO CACHE
In short, I start my sequence at #1 (you can choose any number you want to start with) and it counts upwards in increments of 1. I don't cycle the sequence numbers when they reach the system max number.
Assume there is a popular web server, the number of visits to this web server can be tens of thousands in an hour, in order to analyse the statistical property of these visits, we want to know the number of requests in a specific time range and IP range.
For example, we have 1012 requests in the following format:
(IP address , visiting time)
Suppose we want to know how many visits came from the IP range [10.12.72.0 , 10.12.72.255] during 2p.m and 4p.m.
The only candidate ideas i can think of are:
(1)use B-TREE to index this large data set using one dimension, for instance build a B-TREE on the parameter IP. Using this B-TREE we can quickly get the number of request coming from any specific IP range, but how can we know how many of these visits are between 2p.m and 4p.m?
(2)use BITMAP, but similar to B-TREE, due to space requirement the BITMAP can only be built on one dimension, for instance IP address, we don't know how many of these request are issued between 2p.m and 4p.m.
Is there any efficient algorithm, thx? The number of queries can be quite large
Your first step is figure out the precision you need...
TIME:
Do you need, to the millisecond, time stamps or is, to the hour, good enough?
Number of hours since 1970 can fit in under a million, 3 bytes ~integer
Number of milliseconds and you need 8 bytes ~long
IP:
Are all your IPs v4 (4 bytes) or v6 (16 bytes)?
Will you ever search by specific IP or will you only use IP ranges?
If the latter you could just use the class C for each IP 123.123.123.X (3 bytes)
Assuming:
1 hour time precision is good enough
3 byte IP class C is good enough
Re-Organizing your data (2 possible structures pick one):
Database:
You can use a relational database
Table: Hits
IPClassC INT NON-CLUSTERED INDEX
TimeHrsUnix INT NON-CLUSTERED INDEX
Count BIGINT DEFAULT VALUE (1)
Flat Files:
You can use more flat files
Have 1 flat file for each class C IP that appears in your logs (max 2^24)
Each file is 8B (big int) * 1MB (Hrs Since 1970 to 2070) = 8MB in size
How to load your new data structure:
Database:
Parse your logs (read in memory one line at a time)
Convert record to 3 byte IP and 3 byte Time
Convert your IP class C to an integer and your Time hrs to an integer
IF EXISTS(SELECT * FROM Hits WHERE IPClassC = #IP AND TimeHrsUnix = #Time)
UPDATE Hits SET Count = Count + 1 WHERE IPClassC = #IP AND TimeHrsUnix = #Time
Else
INSERT INTO Hits VALUES(#IP, #Time)
Flat Files:
Parse your logs (read in memory one line at a time)
Convert record to 3 byte IP and 3 byte Time
Convert your IP to a string and your time to an integer
if File.Exist(IP) = False
File.Create(IP)
File.SetSize(IP, 8 * 1000000)
CountBytes = File.Read(IP, 8 * Time, 8)
NewCount = Convert.ToLong(CountBytes) + 1
CountBytes = Convert.ToBytes(NewCount)
File.Write(IP, CountBytes, 8 * Time, 8)
Querying your new data structures:
Database:
SELECT SUM(Count) FROM Hits WHERE IPClassC BETWEEN #IPFrom AND #IPTo AND TimeHrsUnix BETWEEN #TimeFrom AND #TimeTo
Flat File:
Total = 0
Offset = 8 * TimeFrom
Len = (8 * TimeTo) - Offset
For IP = IPFrom To IPTo
If File.Exist(IP.ToString())
CountBytes = File.Read(IP.ToString(), Offset, Len)
LongArray = Convert.ToLongArray(CountBytes)
Total = Total + Math.Sum(LongArray)
Next IP
Some extra tips:
If you go the database route your likely going to have to use multiple partitions for the database file
If you go the flat file route you may want to break your query into threads (assuming your SAS will handle the bandwidth). Each thread would handle a sub set of the IP/Files in the range. Once all threads completed the totals from each would be summed.
You want a data structure that supports orthogonal range counting.
10^12 is a large number (TERA) - certainly too large for in-memory processing.
I would store this in a relational database with a star schema, use a time dimension, and pre-aggregate by time of day (e.g. hour bands), IP subnets, and other criteria that you are interested in.