Dimension for geozones or Lat & Long in data warehouse - sql-server

I have a DimPlace dimension that has the name of the place (manually entered by the user) and the latitude and longitude of the place (automatically captured). Since the Places are entered manually the same place could be in there multiple time with different names, additionally, two distinct places could be very close to each other.
We want to be able to analyze the MPG between two "places" but we want to group them to make a larger area - i.e. using lat & long put all the various spellings of one location, as well as distinct but very close locations, in one record.
I am planning on making a new dimension for this - something like DimPlaceGeozone. I am looking for a resource to help with loading all the lat & long values mapped to ... something?? Maybe postal code, or city name? Sometimes you can find a script to load common dimensions (like DimTime) - I would love something similar for lat & long values in North America?

I've done something similar in the past... The one stumbling block I hit up front was that 2 locations, straddling a border could be physically closer together than 2 locations that are both in the same area.
I got around it by creating a "double grid" system that causes each location to fall into 4 areas. That way 2 locations that share at least 1 "area" you know they are within range of each other.
Here's an example, covering most of the United States...
IF OBJECT_ID('tempdb..#LatLngAreas', 'U') IS NOT NULL
DROP TABLE #LatLngAreas;
GO
WITH
cte_Lat AS (
SELECT
t.n,
BegLatRange = -37.9 + (t.n / 10.0),
EndLatRange = -37.7 + (t.n / 10.0)
FROM
dbo.tfn_Tally(1030, 0) t
),
cte_Lng AS (
SELECT
t.n,
BegLngRange = -159.7 + (t.n / 10.0),
EndLngRange = -159.5 + (t.n / 10.0)
FROM
dbo.tfn_Tally(3050, 0) t
)
SELECT
Area_ID = ROW_NUMBER() OVER (ORDER BY lat.n, lng.n),
lat.BegLatRange,
lat.EndLatRange,
lng.BegLngRange,
lng.EndLngRange
INTO #LatLngAreas
FROM
cte_Lat lat
CROSS JOIN cte_Lng lng;
SELECT
b3.Branch_ID,
b3.Name,
b3.Lat,
b3.Lng,
lla.Area_ID
FROM
dbo.ContactBranch b3 -- replace with DimPlace
JOIN #LatLngAreas lla
ON b3.Lat BETWEEN lla.BegLatRange AND lla.EndLatRange
AND b3.lng BETWEEN lla.BegLngRange AND lla.EndLngRange;
HTH,
Jason

Related

Weighted Average w/ Array Formula & Query That Pulls From A Separate Sheet

Link To Sheet
So I've got an array formula which I've included below. I need to adjust this so that it becomes a weighted average based on variables stored on a sheet titled Variables.
Current Formula:
=ARRAYFORMULA(QUERY(
{PROPER(ADP!A3:A),ADP!E3:S;
PROPER(ADP!J3:J),ADP!S3:S;
PROPER(ADP!Z3:Z),ADP!AG3:AG},
"select Col1, Sum(Col2)
where
Col2 is not null and
Col1 is not null
group by Col1
order by Sum(Col2)
label
Col1 'PLAYER',
Sum(Col2) 'ADP AVG'"))
Here's what I thought would work but doesn't:
=ARRAYFORMULA(QUERY(
{PROPER(ADP!A3:A),ADP!E3:E*(Variables!$F$11/Variables!$F$14);
PROPER(ADP!J3:J),ADP!S3:S*(Variables!$F$12/Variables!$F$14);
PROPER(ADP!Z3:Z),ADP!AG3:AG*(Variables!$F$13/Variables!$F$14)},
"select Col1, Sum(Col2)
where
Col2 is not null and
Col1 is not null
group by Col1
order by Sum(Col2)
label
Col1 'PLAYER',
Sum(Col2) 'ADP AVG'"))
What I'm trying to get is the value pulled in K to be multiplied by the value in VariablesF11, the value pulled in Y to be multiplied by VariablesF12, and the value in AL multiplied by the variables in F13. And have that numerator divided by the value in VariablesF14.
After our extensive chat, I'm providing here the answer we came up with, just on the chance it might somehow help someone else. But the issue in your case was less about the technicalities of the formula, and more about the structuring of multiple data sources, and the associated logic to pull the data together.
Here is the main formula:
={"Adjusted
Ranking
by " & Variables!F21;
arrayformula(
if(A2:A<>"",
( if(((D2:D>0) * Source1Used),D2:D,Variables!$F$21)*Variables!$F$12
+ if(((F2:F>0) * Source2Used),F2:F,Variables!$F$21)*Variables!$F$13
+ if(((H2:H>0) * Source3Used),H2:H,Variables!$F$21)*Variables!$F$14
+ if(((J2:J>0) * Source4Used),J2:J,Variables!$F$21)*Variables!$F$15
+ if(((L2:L>0) * Source5Used),L2:L,Variables!$F$21)*Variables!$F$16
+ if(((N2:N>0) * Source6Used),N2:N,Variables!$F$21)*Variables!$F$17 )) / Variables!$F$18) }
A2:A is the list of players' names. The D2:D>0 is a test of whether that player has a rating obtained from a particular data source.
Source1Used is a named range for a tickbox cell, where the user can indicate whether that data source is to be included in the calculations.
This formula creates an average value, using from 1 to 6 possible sources, user selectable.
The formula that gave the rating value for one specific source is as follows:
={"Rating in
Source1";ArrayFormula(if(A2:A<>"",if(C2:C,vlookup(A2:A,indirect("ADP!$" & ADP!E3 & "$10:" & ADP!E5),ADP!E6-ADP!E4+1,0),0),""))}
This takes a name in column A, checks if it is listed in a specific source's data, and if so, it pulls back the rating value from the data source. INDIRECT is used since the column locations for each data source may vary, but are obtained from a fixed table, in cells ADP!E3 and E5. E4 and E6 are the numeric values of the column letters.

How can I add values to a chart that do not exist as 0 in google data studio?

I have got 4 tables in BigQuery that keep statistics for messages in a Message Queue. The tables are : receivedMessages, processedMessages, skippedMessages and failedMessages. Each table has among other things a header.processingMetadata.approximateArrivalTimestamp which as you might have guessed it is a timestamp field.
My purpose is to create 4 charts for each one of this tables aggregating in this field as well as a 5th chart that displays the percentage of each message category each day in regards to the receivedMessages as well as the unknown status messages using the following formula :
UNKNOWN_STATUS_MESSAGES = TOTAL_RECEIVED_MESSAGES - (TOTAL_PROCESSED_MESSAGES + TOTAL_SKIPPED_MESSAGES + TOTAL_FAILED_MESSAGES)
However some days do not have skipped or failed messages, therefore there are no records in Big Query in these two tables. This results to these 2 graphics having dates missing and also not displaying correctly the UNKNOWN_STATUS_MESSAGES in the 5th graph.
I also used the following code as a metric in my graphs with no success (changing the variable name appropriately each time).
CASE WHEN TOTAL_FAILED_MESSAGES IS NULL THEN 0 ELSE TOTAL_FAILED_MESSAGES END
Is there a way to make google data studio to fill the dates with no data with 0s so I can display the charts correctly?
As long as you know the date boundaries of your chart, you can fill those holes with zeros. For instance, if you want to generate your report for last 30 days:
with dates as (
select
x as date
from
unnest(generate_date_array(date_sub(current_date(), interval 30 day), current_date())) as x
)
select
date,
received_messages,
processed_messages,
skipped_messages,
failed_messages,
received_messages - (processed_messages + skipped_messages + failed_messages) as unknown_messages from (
select
d.date,
coalesce(count(received.*), 0) as received_messages,
coalesce(count(processed.*), 0) as processed_messages,
coalesce(count(skipped.*), 0) as skipped_messages,
coalesce(count(failed.*), 0) as failed_messages
from dates d
left join dataset.receivedMessages received
on date(received.header.processingMetadata.approximateArrivalTimestamp) = d.date
left join dataset.processedMessages processed
on date(processed.header.processingMetadata.approximateArrivalTimestamp) = d.date
left join dataset.skippedMessages skipped
on date(skipped.header.processingMetadata.approximateArrivalTimestamp) = d.date
left join dataset.failedMessages failed
on date(failed.header.processingMetadata.approximateArrivalTimestamp) = d.date
group by 1
)
order by 1
1) I recommend doing a join in BigQuery with a date master table to return '0' for those date values.
2) Otherwise, in Data Studio, make sure there is a field X that has values for all dates. Then create a calculated field with formula X - X + TOTAL_SKIPPED_MESSAGES and X - X + TOTAL_FAILED_MESSAGES
As I found out it is also possible to do it in non fixed date using date parameters. So the first part of khan's answer can be rewritten as:
WITH dates AS (
select *
from unnest(generate_date_array(PARSE_DATE('%Y%m%d', #DS_START_DATE), PARSE_DATE('%Y%m%d', #DS_END_DATE), interval 1 day)) as day
)

PostGIS ST_Distance_Spheroid or Haversine

I used ST_Distance_Spheroid in PostgreSQL (with Postgis) to calculate the distance between Woking and Edinburgh like as follows:
CREATE TABLE pointsTable (
id serial NOT NULL,
name varchar(255) NOT NULL,
location Point NOT NULL,
PRIMARY KEY (id)
);
INSERT INTO pointsTable (name, location) VALUES
( 'Woking', '(51.3168, -0.56)' ),
( 'Edinburgh', '(55.9533, -3.1883)' );
SELECT ST_Distance_Spheroid(geometry(a.location), geometry(b.location), 'SPHEROID["WGS 84",6378137,298.257223563]')
FROM pointsTable a, pointsTable b
WHERE a.id=1 AND b.id=2;
I got a result of 592km (592,053.100454442 meters).
Unfortunately, when I used various sources on the web to make the same calculation I consistently got around the 543km mark which is different by 8.2%.
source 1 - 338 miles (543.958 km)
source 2 - 544.410km
source 3 - 543.8km
Luckily, the third source clarified that they were using the haversine formula. I am not sure about the other two sources.
Did I do something wrong in my queries or is this down to a difference in the formulas used? If so, which calculation is closest to the shortest distance a crow could fly, keeping a constant elevation?
You swapped the latitude and the longitude. If you put them in the right order you would get 544 430m. The distance computation is using the great circle arcs, which is the true shortest distance between points over a sphere.
WITH src AS (
select st_geomfromtext('POINT(-0.56 51.3168)',4326) pt1,
st_geomfromtext('POINT(-3.1883 55.9533)',4326) pt2)
SELECT
ST_DistanceSpheroid(pt1, pt2, 'SPHEROID["WGS 84",6378137,298.257223563]') Dist_sphere,
ST_Distance(pt1::geography, pt2::geography) Dist_great_circle
FROM src;
dist_sphere | dist_great_circle
------------------+-------------------
544430.941199621 | 544430.94119962
(1 row)
On a side note, there is a warning
ST_Distance_Spheroid signature was deprecated in 2.2.0. Please use
ST_DistanceSpheroid

SQL Server aggregating data that may contain multiple copies

I am working on some software where I need to do large aggregation of data using SQL Server. The software is helping people play poker better. The query I am using at the moment looks like this:
Select H, sum(WinHighCard + ChopHighCard + WinPair + ChopPair + Win2Pair + Chop2Pair + Win3OfAKind + Chop3OfAKind + WinStraight + ChopStraight + WinFlush + ChopFlush + WinFullHouse + ChopFullHouse + WinQuads + ChopQuads + WinStraightFlush + ChopStraightFlush) / Count(WinStraightFlush) as ResultTotal
from [FlopLookup].[dbo].[_5c3c3d]
inner join[FlopLookup].[dbo].[Lookup5c3c3d] on [FlopLookup].[dbo].[Lookup5c3c3d].[id] = [FlopLookup].[dbo].[_5c3c3d].[id]
where (H IN (1164, 1165, 1166) ) AND
(V IN (1260, 1311))
Group by H;
This works fine and is the fastest way I have found to do what I am trying to achieve. The problem is I need to enhance the functionality in a way that allows the aggregation to include multiple instances of V. So for example in the above query instead of it just including the data from 1260 and 1311 once it may need to include 1260 twice and 1311 three times. But obviously just saying
V IN (1260, 1260, 1311, 1311, 1311)
won't work because each unique value is only counted once in an IN clause.
I have come up with a solution to this problem which works but seems rather clunky. I have created another lookup table which just takes the values between 0 and 1325 and assigns them to a field called V1 and for each V1 there are 100 V2 values that for e.g. for V1 = 1260 there is a range from 126000 through to 126099 for the V2 values. Then in the main query I join to this table and do the lookup like this:
Select H, sum(WinHighCard + ChopHighCard + WinPair + ChopPair + Win2Pair + Chop2Pair + Win3OfAKind + Chop3OfAKind + WinStraight + ChopStraight + WinFlush + ChopFlush + WinFullHouse + ChopFullHouse + WinQuads + ChopQuads + WinStraightFlush + ChopStraightFlush) / Count(WinStraightFlush) as ResultTotal
from [FlopLookup].[dbo].[_5c3c3d]
inner join[FlopLookup].[dbo].[Lookup5c3c3d] on [FlopLookup].[dbo].[Lookup5c3c3d].[id] = [FlopLookup].[dbo].[_5c3c3d].[id]
inner join[FlopLookup].[dbo].[VillainJoinTable] on [FlopLookup].[dbo].[VillainJoinTable].[V1] = [FlopLookup].[dbo].[_5c3c3d].[V]
where (H IN (1164, 1165, 1166) ) AND
(V2 IN (126000, 126001, 131100, 131101, 131102) )
Group by H;
So although it works it is quite slow. It feels inefficient because it is adding data multiple times when what would probably be more appropriate is a way of doing this using multiplication, i.e. instead of passing in 126000, 126001, 126002, 126003, 126004, 126005, 126006, 126007 I instead pass in 1260 in the original query and then multiply it by 8. But I have not been able to work out a way to do this.
Any help would be appreciated. Thanks.
EDIT - Added more information at the request of Livius in the comments
H stands for "Hero" and is in the table _5c3c3d as a smallint representing the two cards the player is holding (e.g. AcKd, Js4h etc.). V stands for "Villain" and is similar to Hero but represents the cards the opponent is holding similarly encoded. The encoding and decoding takes place in the code. These two fields form the clustered index for the _5c3c3d table. The remaining field in this table is Id which is another smallint which is used to join with the table Lookup5c3c3d which contains all the equity information for the hero's hand against the villain's hand for the flop 5c3c3d.
V2 is just a field in a table I have created to try and resolve the problem described by having a table called VillainJoinTable which has V1 (which maps directly to V in _5c3c3d via a join) and V2 which can potentially contain 100 numbers per V1 (e.g. when V1 is 1260 it could contain 126000, 126001 ... 126099). This is in order to allow me to create an "IN" clause that can effectively have multiple lookups to equity information for the same V multiple times.
Here are some screenshots:
Structure of the three tables
Some data from _5c3c3d
Some data from Lookup5c3c3d
Some data from VillainJoinTable

Time values of minutes & seconds only in SSMS

I'm creating a database of music & video files. I would like one of the columns to be the "duration" or "runtime" of the file. Is there a way to show only minutes and seconds in SSMS?
I'm trying to avoid a column that looks like 00:17:30 and rather have it appear as 17:30.
You can store the amount of time of a music/video fragment in several ways. I'll list some, from what I think is the best way to store it to the worst:
As an INT. Store the length in seconds or milliseconds whatever resolution you need. Can go up to 2^31-1 seconds/milliseconds.
As a TIME. Denotes a time, restricted to 23:59:59.9999999 hours. Resolution depends on the width of the TIME column. Problematic if your music/video fragment is longer than 24 hours.
As a VARCHAR. Not really a good storage type, preferred if all you ever want to do with the time is present it. If you want to do queries based on time of music/video you'll have to convert this to another type. Not preferred.
In terms of presentation, a VARCHAR would be easiest as you wouldn't need to format it any further (that's if you stored it the way you want). A TIME value would still need tweaking if you want to format it from a query. An INT would also need preparation to select the value you want to present.
I'd argue that presentation is best kept for the presentation layer. So that would be my advice. If you still insist on selecting the value as it should be presented, I'll give you the way to do it for an INT column with the length in seconds:
DECLARE #total_seconds INT = 2460;
SELECT
CASE WHEN (#total_seconds / (60*60))=0
THEN ''
ELSE (CASE WHEN (#total_seconds / (60*60))<10 THEN '0' ELSE '' END) + CAST(#total_seconds / (60*60) AS VARCHAR)+':'
END +
CASE WHEN ((#total_seconds % (60*60)) / (60))=0
THEN ''
ELSE (CASE WHEN ((#total_seconds % (60*60)) / (60))<10 THEN '0' ELSE '' END) + CAST((#total_seconds % (60*60)) / (60) AS VARCHAR)+':'
END +
(CASE WHEN ((#total_seconds % (60*60)) % (60))<10 THEN '0' ELSE '' END) + CAST((#total_seconds % (60*60)) % (60) AS VARCHAR);

Resources