SQL Server NEWSEQUENTIALID() - clarification for super fast .net core implementation - sql-server

Currently I'm trying to write SQL Server NEWSEQUENTIALID() in .NET Core 2.2 that should be running really fast and also it should allocate minimum possible amount memory but I need clarification how calculate uuid version and when (which byte to place it or what bit shift is needed). So now I have generated timestamp, retrieved mac address and copied bytes 8 and 9 from some base random generated guid but surely I'm missing something because results doesn't match with output of original algorithm.
byte[16] guidArray;
// mac
guidArray[15] = macBytes[5];
guidArray[14] = macBytes[4];
guidArray[13] = macBytes[3];
guidArray[12] = macBytes[2];
guidArray[11] = macBytes[1];
guidArray[10] = macBytes[0];
// base guid
guidArray[9] = baseGuidBytes[9];
guidArray[8] = baseGuidBytes[8];
// time
guidArray[7] = ticksDiffBytes[0];
guidArray[6] = ticksDiffBytes[1];
guidArray[5] = ticksDiffBytes[2];
guidArray[4] = ticksDiffBytes[3];
guidArray[3] = ticksDiffBytes[4];
guidArray[2] = ticksDiffBytes[5];
guidArray[1] = ticksDiffBytes[6];
guidArray[0] = ticksDiffBytes[7];
var guid = new Guid(guidArray);
Current benchmark results:
Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------------------------- |----------:|---------:|---------:|------:|--------:|-------:|------:|------:|----------:|
| SqlServerNewSequentialGuid | 37.31 ns | 0.680 ns | 0.636 ns | 1.00 | 0.00 | 0.0127 | - | - | 80 B |
| Guid_Standard | 63.29 ns | 0.435 ns | 0.386 ns | 1.70 | 0.03 | - | - | - | - |
| Guid_Comb | 299.57 ns | 2.902 ns | 2.715 ns | 8.03 | 0.13 | 0.0162 | - | - | 104 B |
| Guid_Comb_New | 266.92 ns | 3.173 ns | 2.813 ns | 7.16 | 0.11 | 0.0162 | - | - | 104 B |
| MyFastGuid | 70.08 ns | 1.011 ns | 0.946 ns | 1.88 | 0.05 | 0.0050 | - | - | 32 B |
Update:
Here are the latest results of benchmarking common id generators written in .net core.
As u can see my implementation NewSequentialGuid_PureNetCore is at most 2x worst performing then wrapper around rpcrt4.dll (which was my baseline) but me implementation eats less memory (30B).
Here are a sequence of sample first 10 guids:
492bea01-456f-3166-0001-e0d55e8cb96a
492bea01-456f-37a5-0002-e0d55e8cb96a
492bea01-456f-aca5-0003-e0d55e8cb96a
492bea01-456f-bba5-0004-e0d55e8cb96a
492bea01-456f-c5a5-0005-e0d55e8cb96a
492bea01-456f-cea5-0006-e0d55e8cb96a
492bea01-456f-d7a5-0007-e0d55e8cb96a
492bea01-456f-dfa5-0008-e0d55e8cb96a
492bea01-456f-e8a5-0009-e0d55e8cb96a
492bea01-456f-f1a5-000a-e0d55e8cb96a
If u want code then give me a sign ;)

The official documentation states it quite clearly:
NEWSEQUENTIALID is a wrapper over the Windows UuidCreateSequential
function, with some byte shuffling applied.
There are also links in the quoted paragraph which might be of interest for you. However, considering that the original code is written in C / C++, I somehow doubt that .NET can outperform it, so reusing the same approach might be a more prudent choice (even though it would involve unmanaged calls).
Having said that, I sincerely hope that you have researched the behaviour of this function and considered all its side effects before deciding to pursue this approach. And I certainly hope you aren't going to use this output as a clustered index for your table(s). The reason for this is also mentioned in the docs (as a warning, no less):
The UuidCreateSequential function has hardware dependencies. On SQL
Server, clusters of sequential values can develop when databases (such
as contained databases) are moved to other computers. When using
Always On and on SQL Database, clusters of sequential values can
develop if the database fails over to a different computer.
Basically, the function generates a monotonous sequence only while the database is in the same hosting environment. When:
a network card gets changed on the bare metal (or whatever else the function depends upon), or
a backup is restored someplace else (think Prod-to-Dev refresh, or simply prod migration / upgrade), or
a failover happens, whether in a cluster or in an AlwaysOn configuration
, the new SQL Server instance will have its own range of generated values, which is supposed not to overlap the ranges of other instances on other machines. If that new range comes "before" the existing values, you'll end up with fragmentation issues for absolutely no good reason. Oh, and top (1) to get the latest value won't work anymore.
Indeed, if all you need is a non-exhaustible monotonous sequence, follow the Greg Low's advice and just stick to bigint. It's half as wide, and no, you can't possibly exhaust it.

Related

Need help creating a simple form for reviewing a (very) large number of diagnosis codes

OK, been lurking here for a long time, but never asked a question before. Apologies for long and complicated question. So I have a very large excel sheet with nearly 40,000 unique codes from the ICD-10 classification system, which classifies essentially all known diseases. Theis is a hierarchical clasisfication system where codes are organized in 20 something chapters and gradually more specific codes, with 3 or more positions. For example, the code A22 is anthrax, with a number of sub-codes A22.0=Cutaneous anthrax, A22.1=Pulmonary anthrax, etc. However, for some diseases, there are no 4-digit codes under the 3-digit codes (e.g. C01, below) or only one 4-digit code that is meaningful for us to recognize (e.g. C00, below). For other diseases, we want full precision (e.g. G23).
Example table
| 3-digit code | Specific code | Description |
| -------- | -------- |-------- |
| C00 | C00.0 | External upper lip |
| C00 | C00.1 | External lower lip |
| C00 | C00.2 | External lip, unspecified |
| C00 | C00.3 | Upper lip, inner aspect |
| C01 | C01 | Malignant neoplasm of base of tongue |
| G23 | G23 | Other degenerative diseases of basal ganglia |
| G23 | G23.0 | Hallervorden-Spatz disease |
| G23 | G23.1 | Progressive supranuclear ophthalmoplegia [Steele-Richardson-Olszewski] |
| G23 | G23.2 | Multiple system atrophy, parkinsonian type [MSA-P] |
| G23 | G23.3 | Multiple system atrophy, cerebellar type [MSA-C] |
The issue at hand is that I'm conducting a large-scale research study based on a health register where diagnoses are coded using this system. Due to a policy of information minimization/data privacy, we need to select which of these 40,000 codes where we need full precision (i.e. on 4-digit level) and where it is sufficent with 3-digit codes. This is a very tedious task and I need to make it as efficient as possible. My idea is to create a simple form that links to my large table (which has the exact format as above, only longer) and presents each 3-digit code one by one, with a simple checkbox or something that allows me to select or not select whether this group should have full precision. I'm envisioning something simple like this:
enter image description here
Sorry for the stupidly long prelude, but my question is much simpler: what would be a simple way to achieve this? I don't "know" any graphical programming languages, but have used SAS, R and statistical programming systems for about 20 years, so I really just need a push in the right direction. Could it, for example, be done using Access form? Any help would be much appreciated!
Thanks,
Gustaf
So, I haven't really tried anything yet as I don't even know where to start.

ad-hoc slowly-changing dimensions materialization from external table of timestamped csvs in a data lake

Question
main question
How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system?
rationale
We're designing ephemeral data warehouses as data marts for end users that can be spun up and burned down without consequence. This requires we have all data in a lake/blob/bucket.
We're ripping daily full extracts because:
we couldn't reliably extract just the changeset (for reasons out of our control), and
we'd like to maintain a data lake with the "rawest" possible data.
challenge question
Is there a solution that could give me the state as of a specific date and not just the "newest" state?
existential question
Am I thinking about this completely backwards and there's a much easier way to do this?
Possible Approaches
custom dbt materialization
There's a insert_by_period dbt materialization in the dbt.utils package, that I think might be exactly what I'm looking for? But I'm confused as it's dbt snapshot, but:
run dbt snapshot for each file incrementally, all at once; and,
built directly off of an external table?
Delta Lake
I don't know much about Databricks's Delta Lake, but it seems like it should be possible with Delta Tables?
Fix the extraction job
Is our oroblem is solved if we can make our extracts contain only what has changed since the previous extract?
Example
Suppose the following three files are in a folder of a data lake. (Gist with the 3 csvs and desired table outcome as csv).
I added the Extracted column in case parsing the timestamp from the filename is too tricky.
2020-09-14_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 |
| 2 | B | 3 - Propose | | 9/12 | 9/14 |
2020-09-15_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 |
| 3 | C | 1 - Lead | | 9/14 | 9/15 |
2020-09-16_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/16 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 |
End Result
Below is SCD-II for the three files as of 9/16. SCD-II as of 9/15 would be the same but OppId=3 has only one from valid_from=9/15 and valid_to=null
| OppId | CustId | Stage | Won | LastModified | valid_from | valid_to |
|-------|--------|-------------|-----|--------------|------------|----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 | null |
| 2 | B | 3 - Propose | | 9/12 | 9/14 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 | null |
| 3 | C | 1 - Lead | | 9/14 | 9/15 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 | null |
Interesting concept and of course it would a longer conversation than is possible in this forum to fully understand your business, stakeholders, data, etc. I can see that it might work if you had a relatively small volume of data, your source systems rarely changed, your reporting requirements (and hence, datamarts) also rarely changed and you only needed to spin up these datamarts very infrequently.
My concerns would be:
If your source or target requirements change how are you going to handle this? You will need to spin up your datamart, do full regression testing on it, apply your changes and then test them. If you do this as/when the changes are known then it's a lot of effort for a Datamart that's not being used - especially if you need to do this multiple times between uses; if you do this when the datamart is needed then you're not meeting your objective of having the datamart available for "instant" use.
Your statement "we have a DW as code that can be deleted, updated, and recreated without the complexity that goes along with traditional DW change management" I'm not sure is true. How are you going to test updates to your code without spinning up the datamart(s) and going through a standard test cycle with data - and then how is this different from traditional DW change management?
What happens if there is corrupt/unexpected data in your source systems? In a "normal" DW where you are loading data daily this would normally be noticed and fixed on the day. In your solution the dodgy data might have occurred days/weeks ago and, assuming it loaded into your datamart rather than erroring on load, you would need processes in place to spot it and then potentially have to unravel days of SCD records to fix the problem
(Only relevant if you have a significant volume of data) Given the low cost of storage, I'm not sure I see the benefit of spinning up a datamart when needed as opposed to just holding the data so it's ready for use. Loading large volumes of data everytime you spin up a datamart is going to be time-consuming and expensive. Possible hybrid approach might be to only run incremental loads when the datamart is needed rather than running them every day - so you have the data from when the datamart was last used ready to go at all times and you just add the records created/updated since the last load
I don't know whether this is the best or not, but I've seen it done. When you build your initial SCD-II table, add a column that is a stored HASH() value of all of the values of the record (you can exclude the primary key). Then, you can create an External Table over your incoming full data set each day, which includes the same HASH() function. Now, you can execute a MERGE or INSERT/UPDATE against your SCD-II based on primary key and whether the HASH value has changed.
Your main advantage doing things this way is you avoid loading all of the data into Snowflake each day to do the comparison, but it will be slower to execute this way. You could also load to a temp table with the HASH() function included in your COPY INTO statement and then update your SCD-II and then drop the temp table, which could actually be faster.

PostgreSQL + pgpool replication with miss balancing

I have a PostgreSQL replication M-S with pgpool as a load balancer on master server only. The replication is going OK and there is no delay on the process. The problem is that the master server is receiving more request than the slave even when I have configured a balance different from 50% for each server.
This is the pgpool show_pool_nodes with backend weigth M(1)-S(2)
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+-------------+------+--------+-----------+---------+------------+-------------------+-------------------
0 | master-ip | 9999 | up | 0.333333 | primary | 56348331 | false | 0
1 | slave-ip | 9999 | up | 0.666667 | standby | 3691734 | true | 0
as you can appreciate the master server is receiving +10x request than slave
This is the pgpool show_pool_nodes with backend weigth M(1)-S(5)
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+-------------+------+--------+-----------+---------+------------+-------------------+-------------------
0 | master-ip | 9999 | up | 0.166667 | primary | 10542201 | false | 0
1 | slave-ip | 9999 | up | 0.833333 | standby | 849494 | true | 0
The behave is quite similar when I assign M(1)-S(1)
Now I wonder if I miss understood the pgpool functioning:
Pgpool only balances read queries(as write queries are sent to
master always)
Backend Weight parameter is assigned to calculate distribution only
in balancing mode. As greater the value is more likely to be chosen
for pgpool, so if a server has a greater lb_weight it would be
selected more times than others with lower values.
If I'm right why is happening this?
Is there a way that I can actually assign a proper balancing configuration of select_cnt queries? My intention is to overcharge the slave with read queries and let to master only a "few" read queries as it is taking all the writing.
You are right on pgpool load balancing. There could be some reasons why this doesn't seem to work. For start, notice that you have the same port number for both backends. Try configuring your backend connection settings like shown in the sample pgpool.conf: https://github.com/pgpool/pgpool2/blob/master/src/sample/pgpool.conf.sample (lines 66-87), (where you also set the weights to your needs) and assign different port numbers to each backend.
Also check (assuming your running mode is master/slave):
load_balance_mode = on
master_slave_mode = on
-- changes require restart
There is a relevant FAQ entry " It seems my pgpool-II does not do load balancing. Why?" here: https://www.pgpool.net/mediawiki/index.php/FAQ (if pgpool version 4.1 also consider statement_level_load_balance). So far, i have assumed that the general conditions for load balancing (https://www.pgpool.net/docs/latest/en/html/runtime-config-load-balancing.html) are met.
You can try to adjust below one configs in pgpool.conf file:
1. wal lag delay size
delay_threshold = 10000000
it is used to let pgpool know if the slave postgresql wal is too delay to use. Change large more query can be pass to slave. Change small more query will go to master.
Besides, the pgbench testing parameter is also key. Use -C parameter, it will let connection per query, otherwise connection per session.
pgpoll load balance decision making depends of a matrix of parameter combination. not only a single parameter
Here is reference.
https://www.pgpool.net/docs/latest/en/html/runtime-config-load-balancing.html#GUC-LOAD-BALANCE-MODE

Best database storage for matching products from offers

I have following problem. I have products, offers and their parameters (in MySQL about 300 000 000 rows). Based on offer parameters and their rate (parameters are dynamic and every parameter type has different rate) I must join offers to product. Of course there will be a lot of updates, deletes or inserts (for example around 5000req/s).
Second functionality will be sending these connected information via api. Anyone have any recommendations what NoSQL, relational database or something similar to use for storage?
Edit
I'll show my example on a small sample of data in MySQL:
Offer
+----------+-----------------+
| offer_id | name |
+----------+-----------------+
| 1 | iphone_se_black |
| 2 | iphone_se_red |
| 3 | iphone_se_white |
+----------+-----------------+
Parameter_rating
+--------------+----------------+--------+
| parameter_id | parameter_name | rating |
+--------------+----------------+--------+
| 1 | os | 10 |
| 2 | processor | 10 |
| 3 | ram | 10 |
| 4 | color | 1 |
+--------------+----------------+--------+
Parameter value
+----+--------------+----------------+
| id | parameter_id | value |
+----+--------------+----------------+
| 1 | 1 | iOS |
| 2 | 2 | some_processor |
| 3 | 3 | 2GB |
| 4 | 4 | black |
| 5 | 4 | red |
| 6 | 4 | white |
+----+--------------+----------------+
Parameter_to_value
+----------+--------------------+
| offer_id | parameter_value_id |
+----------+--------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 2 | 5 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 3 | 6 |
+----------+--------------------+
and based on this data I must return that bids 1,2 and 3 are one product.
The biggest problem is that data often changes. For example, changing prices, removing offers, etc. Therefore, I do not think that MySQL is the most suitable technology and I try to choose another.
Platform
any recommendations what NoSQL, relational database or something similar to use for storage?
Therefore, I do not think that MySQL is the most suitable technology and I try to choose another.
All that is ordinary fare for a Relational database. Tens of thousands of banks run trading and pricing systems that are extremely active from hundreds of thousands of users, on such systems. Every day. The changes you allude to are normal on such systems (eg. pricing and pricing basis, change all the time, in response to Buys & Sells).
But they use genuine SQL platforms. Freeware/shareware/vapourware/nowhere suites such as MySQL and PostgreSQL are neither SQL-compliant, nor viable platforms for high-throughput OLTP systems (no server architecture; no ACID Transactions; etc). They are still implementing the basics that SQL platforms have had since 1984, which is very difficult (impossible!) because they do not have a server architecture.
Therefore MySQL and PostgreSQL are not suitable for the reason of abject performance; zero concurrency; etc, and not for any database design concerns.
For an appreciation of the value of a genuine OLTP Server Architecture, refer to Oracle vs Sybase ASE. Although the article deals with Oracle explicitly, it applies to all freeware because all freeware has the same non-architecture that Oracle has. Actually, even less than Oracle. You get what you pay for.
Data Analysis
This answer is limited to Relational databases; SQL, its designated data sublanguage; and a genuine, commercially viable, SQL platform.
It appears the system supports an auction of some kind, which means you have to maintain an inventory of available/sold items. The database design that is required is quite ordinary.
However, your question is not clear enough to be answered. You are making many assumptions, that we are not party to. Allow me to ask some leading questions, which you need to consider and answer (update your Question):
what are the fundamental things that the systems transacts operations against ?
(products such as phones ?)
how are those things identified ?
(Not the ID but how do humans identify each thing)
what are the properties of those things ?
(please, not "parameter" ... maybe OS; RAM; Processor; Colour) ?
Then property values can be understood
(You can't mess with the attributes of a thing unless you hold and maintain the thing)
what are the operations or transactions against those things
(a) internal or admin transactions
(eg. AddProperty; AddPropertyValue; AddProduct; etc)
(b) external or online user transactions
(eg. BidProduct [offer to buy]; CloseBid; etc)
who are the operators, to which those transactions are permitted ?
(eg. Admins; product suppliers; online bidders; etc)
I can't make any sense of your Parameter_to_value, please explain
What is rating ? Some kind of weighting for the property vs the other properties, or something the bidders declare ?
Database Design • Tentative
This might take a few iterations.
Don't worry about ID fields on each and every file: first we have to understand the data, how it relates to other data, and how it is identified. We can add ID fields at the end.
Note
All my data models are rendered in IDEF1X, the Standard for modelling Relational databases since 1993
My IDEF1X Introduction is essential reading for beginners.
The IDEF1X Anatomy is a refresher for those who have lapsed.
If you have trouble reading the Predicates from the Data Model, let me know and I will produce them in text form.

Key length impact on queries performance?

I've noticed that my meta_keys are getting pretty long, e.g user_event_first_impression_ratings and I retrieve most of the data with WordPress functions e.g get_post_meta($post_id, $meta_key);
I've thought about this often - there's no way to make shorter names because I've got a lot of different things going on and not naming them like that would lose its purpose which is understanding quicky in phpMyAdmin and code what and where is going on.
I've thought of making a table (in excel for example) where I give very short, like 2-3 digit numberic codes for every meta_key, replace them and then use that to navigate in database and code. Im sure that I would know all these codes by heart pretty soon.
Does meta_key length have any impact to queries and get_meta-s performance?
String vs integer?
Let's leave query quality out of this and pretend that query is well written.
If some of you is not familiar with WordPress database, here's an example:
--------------------------------------------------------------------------
| meta_id (unique row nr) | post id | meta_key | meta_value |
--------------------------------------------------------------------------
| 1 | 343 | my_event_color | red |
| 2 | 623 | my_event_id | 235 |
| 3 | 423 | my_event_lenght | 537644 |
| 4 | 243 | my_event_name | tortilla |
| 5 | 732 | my_event_is_xxx | 1 |
| ... | ... | ... | ... |
Etc for many, many, many rows - meta_id is only unique number here
To your first question, no. Or the difference in performance between a long key and a short key is so tiny as to not make it worth thinking about. So don't worry about your excel reference table.
See the following:
https://dba.stackexchange.com/questions/91057/does-the-length-of-the-index-name-have-any-performance-impact
Table name or column name length affect performance?
https://dba.stackexchange.com/questions/91057/does-the-length-of-the-index-name-have-any-performance-impact
To your second question I don't really understand what you're asking.

Resources