SQL Server aggregating data that may contain multiple copies - sql-server

I am working on some software where I need to do large aggregation of data using SQL Server. The software is helping people play poker better. The query I am using at the moment looks like this:
Select H, sum(WinHighCard + ChopHighCard + WinPair + ChopPair + Win2Pair + Chop2Pair + Win3OfAKind + Chop3OfAKind + WinStraight + ChopStraight + WinFlush + ChopFlush + WinFullHouse + ChopFullHouse + WinQuads + ChopQuads + WinStraightFlush + ChopStraightFlush) / Count(WinStraightFlush) as ResultTotal
from [FlopLookup].[dbo].[_5c3c3d]
inner join[FlopLookup].[dbo].[Lookup5c3c3d] on [FlopLookup].[dbo].[Lookup5c3c3d].[id] = [FlopLookup].[dbo].[_5c3c3d].[id]
where (H IN (1164, 1165, 1166) ) AND
(V IN (1260, 1311))
Group by H;
This works fine and is the fastest way I have found to do what I am trying to achieve. The problem is I need to enhance the functionality in a way that allows the aggregation to include multiple instances of V. So for example in the above query instead of it just including the data from 1260 and 1311 once it may need to include 1260 twice and 1311 three times. But obviously just saying
V IN (1260, 1260, 1311, 1311, 1311)
won't work because each unique value is only counted once in an IN clause.
I have come up with a solution to this problem which works but seems rather clunky. I have created another lookup table which just takes the values between 0 and 1325 and assigns them to a field called V1 and for each V1 there are 100 V2 values that for e.g. for V1 = 1260 there is a range from 126000 through to 126099 for the V2 values. Then in the main query I join to this table and do the lookup like this:
Select H, sum(WinHighCard + ChopHighCard + WinPair + ChopPair + Win2Pair + Chop2Pair + Win3OfAKind + Chop3OfAKind + WinStraight + ChopStraight + WinFlush + ChopFlush + WinFullHouse + ChopFullHouse + WinQuads + ChopQuads + WinStraightFlush + ChopStraightFlush) / Count(WinStraightFlush) as ResultTotal
from [FlopLookup].[dbo].[_5c3c3d]
inner join[FlopLookup].[dbo].[Lookup5c3c3d] on [FlopLookup].[dbo].[Lookup5c3c3d].[id] = [FlopLookup].[dbo].[_5c3c3d].[id]
inner join[FlopLookup].[dbo].[VillainJoinTable] on [FlopLookup].[dbo].[VillainJoinTable].[V1] = [FlopLookup].[dbo].[_5c3c3d].[V]
where (H IN (1164, 1165, 1166) ) AND
(V2 IN (126000, 126001, 131100, 131101, 131102) )
Group by H;
So although it works it is quite slow. It feels inefficient because it is adding data multiple times when what would probably be more appropriate is a way of doing this using multiplication, i.e. instead of passing in 126000, 126001, 126002, 126003, 126004, 126005, 126006, 126007 I instead pass in 1260 in the original query and then multiply it by 8. But I have not been able to work out a way to do this.
Any help would be appreciated. Thanks.
EDIT - Added more information at the request of Livius in the comments
H stands for "Hero" and is in the table _5c3c3d as a smallint representing the two cards the player is holding (e.g. AcKd, Js4h etc.). V stands for "Villain" and is similar to Hero but represents the cards the opponent is holding similarly encoded. The encoding and decoding takes place in the code. These two fields form the clustered index for the _5c3c3d table. The remaining field in this table is Id which is another smallint which is used to join with the table Lookup5c3c3d which contains all the equity information for the hero's hand against the villain's hand for the flop 5c3c3d.
V2 is just a field in a table I have created to try and resolve the problem described by having a table called VillainJoinTable which has V1 (which maps directly to V in _5c3c3d via a join) and V2 which can potentially contain 100 numbers per V1 (e.g. when V1 is 1260 it could contain 126000, 126001 ... 126099). This is in order to allow me to create an "IN" clause that can effectively have multiple lookups to equity information for the same V multiple times.
Here are some screenshots:
Structure of the three tables
Some data from _5c3c3d
Some data from Lookup5c3c3d
Some data from VillainJoinTable

Related

Solve system of equations with data loaded, loop through group IDs and different observations

I have data for a large amount of Group IDs, and each group ID has anywhere from 4 to 30 observations. I would like to solve a (linear or nonlinear, depending on approach) system of equations using data in Matlab. I want to solve a system of three equations and three unknowns, but also load in data for known variables. I need observations 2 through 4 in order to solve this, but would also like to move to the next set of 3 observations (if it exists) to see how the solutions change. I would like to record these calculations as well.
What is the best way to accomplish this? I have a standard idea of how to solve the system using fsolve, but what is the best way to loop through group IDs with varying amounts of observations?
Here is some sample code I have written when thinking about this issue:
%%Load Data
Data = readtable('dataset.csv'); % Full Dataset
%Define Variables
%Main Data
groupID = Data{:,1};
Known1 = Data{:,7};
Known2 = Data{:,8};
Known3 = Data{:,9};
%%%%%%Function %%%%%
f = [A,B,C];
% Define the function handle for the system of equations
fun = #(f) [A^2 + B*Known3 - 2C*Known1 +1/Known2 - D2;
A + (B^2)Known3 - C*Known1 +1/Known2 - D3;
A - B*Known3 + C^2*Known1 +1/Known2 - D4];
% Define the initial guess for the solution
f0 = [0; 0; 0];
% Solve the nonlinear system of equations
f = fsolve(fun, f0)
%%%% Create Loop %%%%%%
% Set the number of observations to load at a time
numObservations = 3;
% Set the initial group ID
groupID = 1;
% Set the maximum number of groups
maxGroups = 100;
% Loop through the groups of data
while groupID <= maxGroups
% Load the data for the current group
data = loadData(groupID, numObservations);
% Update the solution using the new data
x = fsolve(fun, x);
% Print the updated solution
disp(x);
% Move on to the next group of data
groupID = groupID + 1;
end
What are the pitfalls with writing the code like this, and how can I improve it?

Weighted Average w/ Array Formula & Query That Pulls From A Separate Sheet

Link To Sheet
So I've got an array formula which I've included below. I need to adjust this so that it becomes a weighted average based on variables stored on a sheet titled Variables.
Current Formula:
=ARRAYFORMULA(QUERY(
{PROPER(ADP!A3:A),ADP!E3:S;
PROPER(ADP!J3:J),ADP!S3:S;
PROPER(ADP!Z3:Z),ADP!AG3:AG},
"select Col1, Sum(Col2)
where
Col2 is not null and
Col1 is not null
group by Col1
order by Sum(Col2)
label
Col1 'PLAYER',
Sum(Col2) 'ADP AVG'"))
Here's what I thought would work but doesn't:
=ARRAYFORMULA(QUERY(
{PROPER(ADP!A3:A),ADP!E3:E*(Variables!$F$11/Variables!$F$14);
PROPER(ADP!J3:J),ADP!S3:S*(Variables!$F$12/Variables!$F$14);
PROPER(ADP!Z3:Z),ADP!AG3:AG*(Variables!$F$13/Variables!$F$14)},
"select Col1, Sum(Col2)
where
Col2 is not null and
Col1 is not null
group by Col1
order by Sum(Col2)
label
Col1 'PLAYER',
Sum(Col2) 'ADP AVG'"))
What I'm trying to get is the value pulled in K to be multiplied by the value in VariablesF11, the value pulled in Y to be multiplied by VariablesF12, and the value in AL multiplied by the variables in F13. And have that numerator divided by the value in VariablesF14.
After our extensive chat, I'm providing here the answer we came up with, just on the chance it might somehow help someone else. But the issue in your case was less about the technicalities of the formula, and more about the structuring of multiple data sources, and the associated logic to pull the data together.
Here is the main formula:
={"Adjusted
Ranking
by " & Variables!F21;
arrayformula(
if(A2:A<>"",
( if(((D2:D>0) * Source1Used),D2:D,Variables!$F$21)*Variables!$F$12
+ if(((F2:F>0) * Source2Used),F2:F,Variables!$F$21)*Variables!$F$13
+ if(((H2:H>0) * Source3Used),H2:H,Variables!$F$21)*Variables!$F$14
+ if(((J2:J>0) * Source4Used),J2:J,Variables!$F$21)*Variables!$F$15
+ if(((L2:L>0) * Source5Used),L2:L,Variables!$F$21)*Variables!$F$16
+ if(((N2:N>0) * Source6Used),N2:N,Variables!$F$21)*Variables!$F$17 )) / Variables!$F$18) }
A2:A is the list of players' names. The D2:D>0 is a test of whether that player has a rating obtained from a particular data source.
Source1Used is a named range for a tickbox cell, where the user can indicate whether that data source is to be included in the calculations.
This formula creates an average value, using from 1 to 6 possible sources, user selectable.
The formula that gave the rating value for one specific source is as follows:
={"Rating in
Source1";ArrayFormula(if(A2:A<>"",if(C2:C,vlookup(A2:A,indirect("ADP!$" & ADP!E3 & "$10:" & ADP!E5),ADP!E6-ADP!E4+1,0),0),""))}
This takes a name in column A, checks if it is listed in a specific source's data, and if so, it pulls back the rating value from the data source. INDIRECT is used since the column locations for each data source may vary, but are obtained from a fixed table, in cells ADP!E3 and E5. E4 and E6 are the numeric values of the column letters.

Dimension for geozones or Lat & Long in data warehouse

I have a DimPlace dimension that has the name of the place (manually entered by the user) and the latitude and longitude of the place (automatically captured). Since the Places are entered manually the same place could be in there multiple time with different names, additionally, two distinct places could be very close to each other.
We want to be able to analyze the MPG between two "places" but we want to group them to make a larger area - i.e. using lat & long put all the various spellings of one location, as well as distinct but very close locations, in one record.
I am planning on making a new dimension for this - something like DimPlaceGeozone. I am looking for a resource to help with loading all the lat & long values mapped to ... something?? Maybe postal code, or city name? Sometimes you can find a script to load common dimensions (like DimTime) - I would love something similar for lat & long values in North America?
I've done something similar in the past... The one stumbling block I hit up front was that 2 locations, straddling a border could be physically closer together than 2 locations that are both in the same area.
I got around it by creating a "double grid" system that causes each location to fall into 4 areas. That way 2 locations that share at least 1 "area" you know they are within range of each other.
Here's an example, covering most of the United States...
IF OBJECT_ID('tempdb..#LatLngAreas', 'U') IS NOT NULL
DROP TABLE #LatLngAreas;
GO
WITH
cte_Lat AS (
SELECT
t.n,
BegLatRange = -37.9 + (t.n / 10.0),
EndLatRange = -37.7 + (t.n / 10.0)
FROM
dbo.tfn_Tally(1030, 0) t
),
cte_Lng AS (
SELECT
t.n,
BegLngRange = -159.7 + (t.n / 10.0),
EndLngRange = -159.5 + (t.n / 10.0)
FROM
dbo.tfn_Tally(3050, 0) t
)
SELECT
Area_ID = ROW_NUMBER() OVER (ORDER BY lat.n, lng.n),
lat.BegLatRange,
lat.EndLatRange,
lng.BegLngRange,
lng.EndLngRange
INTO #LatLngAreas
FROM
cte_Lat lat
CROSS JOIN cte_Lng lng;
SELECT
b3.Branch_ID,
b3.Name,
b3.Lat,
b3.Lng,
lla.Area_ID
FROM
dbo.ContactBranch b3 -- replace with DimPlace
JOIN #LatLngAreas lla
ON b3.Lat BETWEEN lla.BegLatRange AND lla.EndLatRange
AND b3.lng BETWEEN lla.BegLngRange AND lla.EndLngRange;
HTH,
Jason

How do I match a substring of variable length?

I am importing data into my SQL database from an Excel spreadsheet.
The imp table is the imported data, the app table is the existing database table.
app.ReceiptId is formatted as "A" followed by some numbers. Formerly it was 4 digits, but now it may be 4 or 5 digits.
Examples:
A1234
A9876
A10001
imp.ref is a free-text reference field from Excel. It consists of some arbitrary length description, then the ReceiptId, followed by an irrelevant reference number in the format " - BZ-0987654321" (which is sometimes cropped short, or even missing entirely).
Examples:
SHORT DESC A1234 - BZ-0987654321
LONGER DESCRIPTION A9876 - BZ-123
REALLY LONG DESCRIPTION A2345 - B
REALLY REALLY LONG DESCRIPTION A23456
The code below works for a 4-digit ReceiptId, but will not correctly capture a 5-digit one.
UPDATE app
SET
[...]
FROM imp
INNER JOIN app
ON app.ReceiptId = right(right(rtrim(replace(replace(imp.ref,'-',''),'B','')),5)
+ rtrim(left(imp.ref,charindex(' - BZ-',imp.ref))),5)
How can I change the code so it captures either 4 (A1234) or 5 (A12345) digits?
As ughai rightfully wrote in his comment, it's not recommended to use anything other then columns in the on clause of a join.
The reason for that is that using functions prevents sql server for using any indexes on the columns that it might use without the functions.
Therefor, I would suggest adding another column to imp table that will hold the actual ReceiptId and be calculated during the import process itself.
I think the best way of extracting the ReceiptId from the ref column is using substring with patindex, as demonstrated in this fiddle:
SELECT ref,
RTRIM(SUBSTRING(ref, PATINDEX('%A[0-9][0-9][0-9][0-9]%', ref), 6)) As ReceiptId
FROM imp
Update
After the conversation with t-clausen-dk in the comments, I came up with this:
SELECT ref,
CASE WHEN PATINDEX('%[ ]A[0-9][0-9][0-9][0-9][0-9| ]%', ref) > 0
OR PATINDEX('A[0-9][0-9][0-9][0-9][0-9| ]%', ref) = 1 THEN
SUBSTRING(ref, PATINDEX('%A[0-9][0-9][0-9][0-9][0-9| ]%', ref), 6)
ELSE
NULL
END As ReceiptId
FROM imp
fiddle here
This will return null if there is no match,
when a match is a sub string that contains A followed by 4 or 5 digits, separated by spaces from the rest of the string, and can be found at the start, middle or end of the string.
Try this, it will remove all characters before the A[number][number][number][number] and take the first 6 characters after that:
UPDATE app
SET
[...]
FROM imp
INNER JOIN app
ON app.ReceiptId in
(
left(stuff(ref,1, patindex('%A[0-9][0-9][0-9][0-9][ ]%', imp.ref + ' ') - 1, ''), 5),
left(stuff(ref,1, patindex('%A[0-9][0-9][0-9][0-9][0-9][ ]%', imp.ref + ' ') - 1, ''), 6)
)
When using equal, the spaces after is not evaluated

How many times is the DBRecordReader getting created?

Below is what I think of the hadoop framework processing text files. Please correct me if I am going wrong somewhere.
Each mapper acts on an input split which contains some records.
For each input split a record reader is getting created which starts reading records from the input split.
If there are n records in an input split the map method in the mapper is called n times which in turn reads a key-value pair using the record reader.
Now coming to the databases perspective
I have a database on a single remote node. I want to fetch some data from a table in this database. I would configure the parameters using DBConfigure and mention the input table using DBInputFormat. Now say if my table has 100 records in all, and I execute an SQL query which generates 70 records in the output.
I would like to know :
How are the InputSplits getting created in the above case (database) ?
What does the input split creation depend on, the number of records which my sql query generates or the total number of records in the table (database) ?
How many DBRecordReaders are getting created in the above case (database) ?
How are the InputSplits getting created in the above case (database)?
// Split the rows into n-number of chunks and adjust the last chunk
// accordingly
for (int i = 0; i < chunks; i++) {
DBInputSplit split;
if ((i + 1) == chunks)
split = new DBInputSplit(i * chunkSize, count);
else
split = new DBInputSplit(i * chunkSize, (i * chunkSize)
+ chunkSize);
splits.add(split);
}
There is the how, but to understand what it depends on let's take a look at chunkSize:
statement = connection.createStatement();
results = statement.executeQuery(getCountQuery());
results.next();
long count = results.getLong(1);
int chunks = job.getConfiguration().getInt("mapred.map.tasks", 1);
long chunkSize = (count / chunks);
So chunkSize takes the count = SELECT COUNT(*) FROM tableName and divides this by chunks = mapred.map.tasks or 1 if it is not defined in the configuration.
Then finally, each input split will have a RecordReader created to handle the type of database you are reading from for instance: MySQLDBRecordReader for MySQL database.
For more info check out the source
It appears #Engineiro explained it well by taking the actual hadoop source. Just to answer, number of DBRecordReader is equal to number of map tasks.
To explain further, the Hadoop Map side framework creates an instance of DBRecordReader for each Map task, in case where the child JVM is not reused for further Map tasks. In other words, the number of input splits is equals to the value of map.reduce.tasks in case of DBInputFormat. So, each map Task's record Reader has the meta information to construct the query to get subset of data from the table. Each Record Reader executes a pagination type of SQL which is similar to the below.
SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( select * from emp ) a WHERE rownum <= 6 + 7 ) WHERE dbif_rno >= 6
The above SQL is for the second Map tasks to return the rows between 6 and 13
To generalize for any type of Input formats, the number of Record Readers is equals to the number of Map Tasks.
This post talks about all that you want : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/

Resources