I've been reading about data demographics of teradata and came across with this two terms. It is mentioned that this two goes hand in hand to make good index choice, but I can't seem to understand exactly what is the difference between the two values.
Can anyone explain to me the exact difference between the two. Examples on how the values are derived would be really helpful.
I'm thinking both values will come from this query:
sel <columnname>, count(*)
from <tablename>
Here are the definition of the two terms, btw.
Maximum Rows/Value –No. of rows for the most-often-occurring value in the column.
Typical Rows/Value –No. of rows for a typical value in the column.
Any inputs will be much appreciated.
Thank you.
Here is my understanding of Maximum Rows/Value vs Typical Rows/Value.
Suppose (SQL Fiddle Link: http://sqlfiddle.com/#!4/27641/13/0)
SELECT MAX (COUNT ("sometext")) max_row_per_value
FROM table1
GROUP BY id
And here is the result
MAX_ROW_PER_VALUE
7
In this case, when you look at id=1, there are 7 records for that value, being the maximum rows/value.
The typical rows/value is what I consider the AVG(), like this:
SELECT AVG (COUNT ("sometext")) typical_row_per_value
FROM table1
GROUP BY id
Result
TYPICAL_ROW_PER_VALUE
4.5
Related
I am trying to get a total summation of both the ItemDetail.Quantity column and ItemDetail.NetPrice column. For sake of example, let's say the quantity that is listed is for each individual item is 5, 2, and 4 respectively. I am wondering if there is a way to display quantity as 11 for one single ItemGroup.ItemGroupName
The query I am using is listed below
select Location.LocationName, ItemDetail.DOB, SUM (ItemDetail.Quantity) as "Quantity",
ItemGroup.ItemGroupName, SUM (ItemDetail.NetPrice)
from ItemDetail
Join ItemGroupMember
on ItemDetail.ItemID = ItemGroupMember.ItemID
Join ItemGroup
on ItemGroupMember.ItemGroupID = ItemGroup.ItemGroupID
Join Location
on ItemDetail.LocationID = Location.LocationID
Inner Join Item
on ItemDetail.ItemID = Item.ItemID
where ItemGroup.ItemGroupID = '78' and DOB = '11/20/2019'
GROUP BY Location.LocationName, ItemDetail.DOB, Item.ItemName,
ItemDetail.NetPrice, ItemGroup.ItemGroupName
If you are using SQL Server 2012 , you can use the summation on partition to display the
details and aggregates in the same query.
SUM(SalesYTD) OVER (ORDER BY DATEPART(yy,ModifiedDate)),1)
Link :
https://learn.microsoft.com/en-us/sql/t-sql/functions/sum-transact-sql?view=sql-server-ver15
We can't be certain without seeing sample data. But I suspect you need to remove some fields from you GROUP BY clause -- probably Item.ItemName and ItemDetail.NetPrice.
Generally, you won't GROUP BY a column that you are applying an aggregate function to in the SELECT -- as in SUM(ItemDetail.NetPrice). And it is not very common, in my experience, to GROUP BY columns that aren't included in the SELECT list - as you are doing with Item.ItemName.
I think you need to go back to basics and read about what GROUP BY does.
First of all welcome to the overflow...
Second: The answer is going to be "It depends"
Any time you aggregate data you will need to Group by the other fields in the query, and you have that in the query. The gotcha is what happens when data is spread across multiple locations.
My suggestion is to rethink your problem and see if you really need these other fields in the query. This will depend on what the person using the data really wants to know.
Do they need to know how many of item X there are, or do they really need to know that item X is spread out over three sites?
You might find you are better off with two smaller queries.
I have a report which has a transaction type as a row group. There are two different types. I want to get the percentage of one type 2 compared to type 1.
I am not sure how to do this, I assume I need to use an expression which states the name of the transaction type and then make a calculation based on the other type.
So Instead of a total for July being 300, I would like the percentage of SOP+ compared to SOP-, so in this case 1.96%. For clarity, the figures in SOP+ are not treated as negative.
When you design a query to be used in a report, it is generally easier to work with different types of values being in separate columns. You can let the report do most of the grouping and aggregation for you. In that case, the expression would be something like this:
=Fields!SOP_PLUS.Value / Fields!SOP_MINUS.Value
Since they are both in rows in the same column, you have to use some logic to separate them out into columns and then do the operation.
You'll need to add two calculated fields to your dataset. Use an expression like this to get the values:
=IIf(Fields!TYPE_CODE.Value = "SOP+", Fields!SOP.Value, Nothing)
In other words, you will have new columns that have just the plus and minus values with blanks in the other rows. Now you can use a similar expression to earlier to compare them.
=Max(Fields!SOP_PLUS.Value) / Max(Fields!SOP_MINUS.Value)
Keep in mind that the Max function applies to the current group scope. When you add in multiple row and column groups to the mix this can get more complicated. If that becomes an issue, I would suggest looking at rewriting the query to provide these values in separate rows to make the report design easier.
WITH table1([sop-], [sop+]) AS (
SELECT 306, -6
UNION ALL
SELECT 606, -14)
SELECT(CAST([sop+] AS DECIMAL(5, 2)) / CAST([sop-] AS DECIMAL(5, 2))) * 100.0 FROM table1;
Returns :
-1.960784000
-2.310231000
from some reasons I need to insert an artificial(dummy) column into a mdx expression. (the reason is that i need to obtain a query with specific number of columns )
to ilustrate, this is my sample query:
SELECT {[Measures].[AFR],[Measures].[IB],[Measures].[IC All],[Measures].[IC_without_material],[Measures].[Nonconformance_PO],[Measures].[Nonconformance_GPT],[Measures].[PM_GPT_Weighted_Targets],[Measures].[PM_PO_Weighted_Targets], [Measures].[AVG_LC_Costs],[Measures].[AVG_MC_Costs]} ON COLUMNS,
([dim_ProductModel].[PLA].&[SME])
* ORDER( {([dim_ProductModel].[Warranty Group].children)} , ([Measures].[Nonconformance_GPT],[Dim_Date].[Date Full].&[2014-01-01]) ,desc)
* ([dim_ProductModel].[PLA Text].members - [dim_ProductModel].[PLA Text].[All])
* {[Dim_Date].[Date Full].&[2013-01-01]:[Dim_Date].[Date Full].&[2014-01-01]} ON ROWS
FROM [cub_dashboard_spares]
it is not very important, just some measures and crossjoined dimensions. Now I would need to add f.e. 2 extra columns, I don't care whether this would be a measure with null/0 values or another crossjoined dimension. Can I do this in some easy way without inserting any data into my cube?
In sql I can just write Select 0 or select "dummy1", but here it is not possible neither in ON ROWS nor in ON COLUMNS part of the query.
Thank you very much for your help,
Regards,
Peter
ps: so far I could just insert some measure more times, but I am interested whether there is a possibility to insert really "dummy" column
Your query just has the measures dimension on columns. The easiest way to extend it by some columns would be to repeat the last measure as many times that you get the correct number of columns.
Another possibility, which may be more efficient in case the last measure is complex to calculate would be to use
WITH member Measures.dummy as NULL
SELECT {[Measures].[AFR],[Measures].[IB],[Measures].[IC All],[Measures].[IC_without_material],[Measures].[Nonconformance_PO],[Measures].[Nonconformance_GPT],[Measures].[PM_GPT_Weighted_Targets],[Measures].[PM_PO_Weighted_Targets], [Measures].[AVG_LC_Costs],[Measures].[AVG_MC_Costs],
Measures.dummy, Measures.dummy, Measures.dummy
}
ON COLUMNS,
([dim_ProductModel].[PLA].&[SME])
* ORDER( {([dim_ProductModel].[Warranty Group].children)} , ([Measures].[Nonconformance_GPT],[Dim_Date].[Date Full].&[2014-01-01]) ,desc)
* ([dim_ProductModel].[PLA Text].members - [dim_ProductModel].[PLA Text].[All])
* {[Dim_Date].[Date Full].&[2013-01-01]:[Dim_Date].[Date Full].&[2014-01-01]}
ON ROWS
FROM [cub_dashboard_spares]
i. e. adding a dummy measure that should not need much computation as many times as you need it to the end of the columns.
Ok,
I've been doing a lot of reading on returning a random row set last year, and the solution we came up with was
ORDER BY newid()
This is fine for <5k rows. But when we are getting >10-20k rows we are getting SQL time outs, the Execution planned tells me that 76% of my query cost comes from this line. and removing this line increase the speed by an order of magnitude when we have a large amount of rows.
Our users have a requirement of doing up to 100k rows at a time like this.
To give you all a bit more details.
We have a table with 2.6 million 4 digit alpha-numeric codes. We use a random set of these to gain entry into a venue. For example, if we have an event with a 5000 capacity, a random set of 5000 of these will be drawn from the table then issued to the each customer as a bar-code, then the bar-code scanning app at the door with have the same list of 5000. The reason for using a 4 digit alpha numeric code (and not a stupidly long number like a GUID) is that it easy for people to write the number down (or SMS it to a friend) and just bring the number and have it entered manually, so we don't want large amount of characters. Customers love the last bit btw.
Is there a better way than ORDER BY newid(), or is there a faster way to get 100k random rows from a table with 2.6 mil?
Oh, and we are using MS SQL 2005.
Thanks,
Jo
There is an MSDN article entitled "Selecting Rows Randomly from a Large Table" that talks about this exact problem and shows a solution (using no sorting but instead using a WHERE clause on a generated column to filter the rows).
The reason your query is slow is that the ORDER BY clause causes the whole table to be copied into tempdb for sorting.
If you want to generate random 4-digit codes, why not just generate them instead of trying to pull them out of a database?
Generate 100k unique numbers from 0 to 1,679,616 (which is the number of unique four-digit alphanumeric codes, ignoring case - 2.6 million rows must have some duplicates) and convert them to your four-digit codes.
You don't have to sort.
DECLARE #RandomNumber int
DECLARE #Threshold float
SELECT #RandomNumber = COUNT(*) FROM customers
SELECT #Threshold = 50000 / #RandomNumber
SELECT TOP 50000 * FROM customers WHERE rand() > #Threshold ORDER BY newid()
Just as a matter of interest, what is the performance like if you replace
ORDER BY newid()
by
ORDER BY CHECKSUM(newid())
One thought is to break down the process into steps. Add a column in the table for a GUID then do an update statement into the table adding the GUIDs. This can be done ahead of time if necessary. You should then be able to run the query with an orderby on the GUID column to recieve the results the same way.
Have you tried using % (modulo) on a given int column? Not sure what your table structure is, but you could do something like this:
select top 50000 *
from your_table
where CAST((CAST(ASCII(SUBSTRING(venuecode,1,1)) as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,2,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,3,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,4,1))as varchar(3))) as bigint) % 500000 between 0 and 50000
The above code will take all of your alpha numeric venues and convert them to an integer and then split the entire table into 500,000 buckets of which you are taking the top 50000 that fall between 0 and 50000. You can play with the number after the % since (500,000) and you can play with the between. This should randomize it for you. Not sure if the where clause will bite you on performance, but it's worth a shot. Also, without an order by, there is no guarantee of the order (if you have multiple cpus and threading).
I'm currently working my way through the exam study book, Querying Microsoft SQL Server 2012. I've been learning SQL over the last few months and I am currently looking over windowing functions. I came to this application question and it got me thinking about another question, which I'll list below:
So in the columns diffprev and diffnext it only lists the difference between the previous and the next value. How could I list the maximum difference between subsequent values across all of the rows (partitioned by custid)? So just scanning the table, I see that in custid 1's history, the greatest difference between subsequent rows is $548. Then for custid 2, the greatest difference is $390.95. I could see these values appearing in a maxdiff column across all the rows pertaining to the partition.
Thank you for aiding my studying!
If you're just looking for the value, this should work:
with cte as (
select custid, val - lag(val)
over (partition by custid order by orderdate, orderid) as prevVal
from Sales.OrderValues
)
select custid, max(abs(val))
from cte
group by custid
If you want the details of the rows that attain that maximum, it'll be a bit more work.
Bonus tip - pictures of text are the worst. You're more likely to get help if the people helping don't need to type your code out. Even better though would be a fully functioning example (complete with table definitions and sample data) so we can verify against your data!