With an ~ 18 years old application users file "cases" and each case creates a row in a "journal" table a data base (on SQL 2000). This cases can be tagged with "descriptors" where a somewhere hard coded limit of 50 is set. The descriptors/tags are stored in a lookup table and the key for the descriptors is a number from the power of two sequence (2^n).
This table looks like this:
key
descriptor
1
D 1
2
D 2
4
D 3
8
D 4
16
D 5
There are 50 rows, which means the biggest key is 562.949.953.421.312. Each case can have up to 8 descriptors, which are unfortunately stored in a single column in the case journal table. They keys are stored as a summary of all descriptors on that case.
A case with the descriptor D2 has 2 in the journal
A case with the descriptors D2 and D4 has 10
A case with the descriptors D1, D3 and D5 has 21
The Journal has 100 million records. Now the first time since years there is the requirement to analyze the journal by descriptors.
What would be a smart (mathematical) way to query the journal and get the results for one descriptor?
Edit:
in answer to the comment of #Squirrel:
jkey
jvalue
descriptors
1
V 1
0
2
V 2
24
3
V 3
3
4
V 4
12
5
V 5
6
You need to use bitwise operators.
Assuming the column is bigint then
where yourcolumn & 16 > 0
will find the ones matching D5 for example
If you are trying this query for literal values larger than fit into a signed 32 bit int make sure you cast them to BIGINT by the way as they will be interpreted as numeric datatype by default which cannot be used with bitwise operators.
WHERE yourcolumn & CAST(562949953421312 AS BIGINT) > 0
You may also similarly need to cast yourcolumn if it is in fact numeric rather than bigint
Related
I'm working on a local project and I would like to setup a column in a SQL Server table that I called serial. This value should have the next format/validation [A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]. Four characters that will have numbers or capital letters.
For example, the first record would be AAA0, and then start incrementing from right to left, but the interesting part is that once it reaches AAA9, I want to continue but with letters, so once 9 is reached, continue with A: AAAA. My last combination would be AAAZ, and then continue with the second position and so on, until I could complete ZZZZ.
Serial
------
AAA0
…
AAAZ
AAB0
…
AABZ
AAC0
…
AACZ
…
ZZZZ
I've been able to do something similar, but for example only incrementing numbers 0000 until 9999 or with just letters AAAA to ZZZZ (see below dummy data).
Please let me know if you have any comments or if you have seen this scenario before. I will appreciate any help.
I've reviewed some data on the internet and did this small test, but once again, this is changing letters.
CREATE TABLE #MyTable
(
MyHeadID INT IDENTITY(0,1) NOT NULL,
Consecutive AS
CHAR(MyHeadID/17576%26+65) + --26^3
CHAR(MyHeadID/676%26+65) + --26^2
CHAR(MyHeadID/26%26+65) + --26^1
CHAR(MyHeadID%26+65) --26^0
PERSISTED NOT NULL,
UniqueID VARCHAR(36) NOT NULL,
CreatedDate datetime DEFAULT GETDATE()
)
INSERT INTO #MyTable (UniqueID)
(SELECT NEWID())
SELECT Consecutive FROM #MyTable
Let's not talk about the UniqueID and CreatedDate columns, those are just for testing purposes. I created an identity column and then the code for the autoincrement on letters.
I also found this reply here but it's not my case since in that reply they are dividing 4 letters and 4 numbers with a substring it can be resolved. In my case, I need to intercalculate numbers and letters from right to left.
Here's a possible solution, counting in base 36 AAA0 starts at decimial 479879.
Ideally you'd have an identity incrementing number on your table starting at 479879. Simulating that with a numbers table generated using a CTE, the following gives base36 counting:
with numbers as (
select 479879 + Row_Number() over (order by a.object_id)n
from sys.all_objects a cross join sys.all_objects b
)
select top 10000
Concat(Char(((n/36/36/36) % 36) + case when (n/36/36/36) % 36 between 0 and 9 then 48 else 55 end),
Char(((n/36/36) % 36) + case when (n/36/36) % 36 between 0 and 9 then 48 else 55 end),
Char(((n/36) % 36) + case when (n/36) % 36 between 0 and 9 then 48 else 55 end),
Char((n % 36) + case when n % 36 between 0 and 9 then 48 else 55 end)) Base36
from numbers
order by n
See Fiddle
The previous answer is good but you can do this and cover all numbers with simple math. Also, let's assume your IDENTITY starts at one. There are 36 values for each digit, so those are base 36 encoded as pointed out. The encoding goes 0 - 9 then A - Z as you mentioned in your post. The get the individual digits of some number n from right to left, the algorithm goes:
n mod 36 is the right most digit.
n / 36 mod 36 gives the second-to-right digit.
n / (36 * 36) mod 36 gives the second-to-left digit.
n / (36 * 36 * 36) mod 36 gives left digit.
To test this logic, we can write a function:
CREATE FUNCTION CustomNumber(#id INT)
RETURNS CHAR(4)
AS BEGIN
RETURN CHAR((#id-1)/ POWER(36, 3)% 36+CASE WHEN (#id-1)/ POWER(36, 3)% 36 BETWEEN 0 AND 9 THEN 48 ELSE 55 END)
+CHAR((#id-1)/ POWER(36, 2)% 36+CASE WHEN (#id-1)/ POWER(36, 2)% 36 BETWEEN 0 AND 9 THEN 48 ELSE 55 END)
+CHAR((#id-1)/ 36% 36+CASE WHEN (#id-1)/ 36% 36 BETWEEN 0 AND 9 THEN 48 ELSE 55 END)
+CHAR((#id-1)% 36+CASE WHEN (#id-1)% 36 BETWEEN 0 AND 9 THEN 48 ELSE 55 END);
END;
We can then test this function by calling it to make sure it works. To call it, lets create some numbers. This is complete overkill, but let's pass it 1 through 1,000,000 so we can see it in action:
;WITH digits (I)
AS (
SELECT I
FROM (VALUES (0), (1),(2),(3),(4),(5),(6),(7),(8),(9)) AS digits (I) ),
integers (I)
AS (SELECT D1.I + (10 * D2.I) + (100 * D3.I) + (1000 * D4.I) + (10000*D5.I) + (100000*D6.I)
FROM digits AS D1
CROSS JOIN digits AS D2
CROSS JOIN digits AS D3
CROSS JOIN digits AS D4
CROSS JOIN digits AS D5 CROSS JOIN digits AS D6
)
SELECT I, dbo.CustomNumber(I)
FROM integers
WHERE I > 0
ORDER BY I;
If you run that and patiently wait (it takes about 20 seconds on my weak laptop, but you don't have to use a million numbers if you don't like) you will see that it does produce the result that you want. At this point we know the formula is correct, so you have options to add it to your table.
One option is to use the formula as a PERSISTED column as you did. Another option is to use a trigger. You can leave the formula as a function or you can just put the code in directly. If you need more characters than 4, you can easily add another by following the pattern (just change the POWER to which you raise).
As I mentioned, the previous answer is good, I just wanted to show another method and its derivation. I have had to implement custom sequences several times with varying formats and you can use this general technique.
I want to apply feature selection on a dataset (lung.mat)
After loading the data, I computed the mean of distances between each feature with others by Jaccard measure. Then I sorted the distances descendingly in B1. And then I selected for example 25 number of all the features and saved the matrix in databs1.
I want to select the features that have distance values greater than the mean of the array (B1).
close all;
clc
load lung.mat
data=lung;
[n,m]=size(data);
for i=1:m-1
for j=i+1:m
t1(i,j)=fjaccard(data(:,i),data(:,j));
b1=sum(t1)/(m-1);
end
end
[B1,indB1]=sort(b1,'descend');
databs1=data(:,indB1(1:25));
databs1=[databs1,data(:,m)]; %jaccard
save('databs1.mat');
I’ll be grateful to have your opinions about how to define this in B1, selecting values of B1 which are greater than the mean of the array B1, It means cutting the rest of smaller values than the mean of B1.
I used this line,
B1(B1>mean(B1(:)))
after running, B1 still has the full number of features(column) equal to the full dataset, for example, lung.mat has 57 features and B1 by this line still has 57 columns,
I considered that by this line B1 will be cut to the number of features that are greater than the mean of B1.
the general answer to your question is here (this seems clear to you based on your code):
a=randi(10,1,10) %example data
a>mean(a) %get binary matrix of which elements are larger than mean
a(a>mean(a)) %select elements from a that are larger than mean
a =
1 9 10 7 8 8 4 7 2 8
ans =
1×10 logical array
0 1 1 1 1 1 0 1 0 1
ans =
9 10 7 8 8 7 8
This is my table (copied from the similar question Finding minimum value in index(match) array [EXCEL])
A B C D
tasmania 10 3 10
queensland 22 8 10
new south wales 10 12 12
northern territory 8 4 15
south australia 12 2 8
western australia 32 4 15
tasmania 72 6 16
I have criteria for B and C, and I want to retrieve the A with the lowest corresponding value D. Values in B, C and D can be duplicates, values in A can not.
Example:
B >= 8
C >= 4
Should result in "queensland" (lowest matching value is 10), but not "tasmania" (has the same cost)
I am currently trying this array formula:
{ =MIN(IF(B:B>=8;IF(C:C>=4;D;""));1) }
Which returns the correct lowest D, but since I am losing the informaiton about A, I can not retrieve the value for A
This as an array formula should work for you:
=INDEX($A$1:$A$7,MATCH(MIN(IF($B$1:$B$7>=8,IF($C$1:$C$7>=4,$D$1:$D$7))),IF($B$1:$B$7>=8,IF($C$1:$C$7>=4,$D$1:$D$7)),0))
It should be noted that if you have Excel 2016 or Office365, you'll have access to the MINIFS function which is probably better suited for this task (i don't actually have the newest version, so am unable to test)
I have three nested ROW groups:-
The first one is a depended on a wether a field is true or false in the dataset, for each case. This is the where the error is worst. The second is nested on the first and is based on a group variable in the cases (1 to many), the third is the ref number of the cases.
The sums don't work for a cloumn that is produced by a join, depending on the ID of the second group. It seems to pull the right value, but multiplies by the number of cases. I can divide by the case numbers here, inside the last nested group(ref#) to get the right value. Tried using "Count" , Blank, Add total after..
If I try to sum the column with "=Sum(ReportItems!Textbox231.Value)" Produces:-
The Value expression for the textrun 'Textbox232.Paragraphs[0].TextRuns[0]' uses an aggregate function on a report item. Aggregate functions can be used only on report items contained in page headers and footers.
The sums work fine for the non joined values..in all three nested row groups. But for the joined values they are out by an order of magnitude. Why is this?
SUM not working for 3rd column
SUM yields wired results
SELECT DISTINCT
Here is a common reason why this kind of problem happen.
The likely reason for the SUM being wrong is the fact that the DISTINCT in your select hides duplicates in the underlying query. Since the SUM is executed before the distinct, it sum the results that you don't see after they're filtered out by the DISTINCT.
Instead of DISTINCT use a GROUP BY query, then you can either make a base query that do not have duplicates (which you don't have to hide with a DISTINCT) or if you can't get rid of the duplicates, aggregate your column before displaying it by doing a MIN, a MAX or an AVG.
I'd be happy to help more but there's not enough information in your question to reproduce the problem on my computer.
There are other reasons why a SUM can return unexpected results: typically implicit cast (SQL server decides on an unexpected datatype and rounds the numbers), and in some situations a CASE clause which is executed either before or after a WHERE condition. But these don't seem to be the problem here.
Example
DECLARE #T TABLE (ID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, NumVal INT)
DECLARE #i INT
SET #i = 1
WHILE #i < 1000
BEGIN
INSERT INTO #T (NumVal) VALUES (#i)
IF RIGHT (CAST (#i AS VARCHAR(12)),1) = 7
BEGIN INSERT INTO #T (NumVal) VALUES (#i) END
SET #i = #i +1
END
SELECT DISTINCT NumVal, SUM (NumVal) FROM #T GROUP BY NumVal
In the example above, I have inserted 999 distinct entries in a table, but duplicated any number which ends with 7. The select distinct give the impression that there are only 999 entries, while a sum adds the numbers ending with 7. Your situation is probably more complicated, but what I want to show here is that duplicates in the underlying becomes invisible with a DISTINCT and reappear with a SUM:
NumVal Sum
1 1
2 2
3 3
4 4
5 5
6 6
7 14
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 34
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
I have data in column A, and would like to put the averages in column B like this:
a b
1 10 10
2 7 8.5
3 8 8.333
4 19 11
5 13 11.5
where b1 =average(a1), b2 =average(a1:a2), b3 =average(a1:a3)....
Using average() is alright for small amounts of data, but I have over 1500 data entries. I would like to find a more efficient way of doing this.
Make your initial range reference absolute, while the other is relative, i.e.:
b4 = average($a$1:a4)
You can paste that 1500 times an it will always increment the end of the range while keeping the beginning pinned to A1 due to the dollar signs in that reference.