Using haversine() snowflake function to dynamically map data from closest locations - snowflake-cloud-data-platform
We are trying to find a way to use the haversine() function to map some locations to some data. The concept is we have some locations for stores and then we have data for various cities. We want to identify the closest city to each store for which we have data, and then merge the data for that city with the data for the store. Example below. Obviously we could write a python script to find the closest city and add that as a column to one table, but we were hoping to accomplish this with a query/view so that when adding new stores/cities we don’t need to re-run the mapping script. The only thing I could think of to do this would be a correlated subquery within a column which I don’t believe Snowflake supports. Is there any other way to do this?
Thanks,
J
Stores Table:
City State Store_Num Lat Lon
Buckhead GA 1 33.8399734 -84.4701434
Villanova PA 2 40.0369415 -75.365584
Boulder CO 3 40.0294202 -105.3101889
Data Table:
Date Value City Lat Lon
1/1/20 10 Atlanta 33.7678357 -84.4908155
1/2/20 15 Atlanta 33.7678357 -84.4908155
1/3/20 13 Atlanta 33.7678357 -84.4908155
1/1/20 11 Denver 39.7645183 -104.9955382
1/2/20 12 Denver 39.7645183 -104.9955382
1/3/20 14 Denver 39.7645183 -104.9955382
1/1/20 20 Philadelphia 40.0026763 -75.258455
1/2/20 25 Philadelphia 40.0026763 -75.258455
1/3/20 22 Philadelphia 40.0026763 -75.258455
1/1/20 5 Atlantic City 39.376672 -74.4879282
1/2/20 7 Atlantic City 39.376672 -74.4879282
1/3/20 10 Atlantic City 39.376672 -74.4879282
Desired Outcome:
Date Store_Num Data_City Data Value Data_Distance
1/1/20 1 Atlanta 10 8,248
1/2/20 1 Atlanta 15 8,248
1/3/20 1 Atlanta 13 8,248
1/1/20 3 Denver 11 39,864
1/2/20 3 Denver 12 39,864
1/3/20 3 Denver 14 39,864
1/1/20 2 Philadelphia 20 9,889
1/2/20 2 Philadelphia 25 9,889
1/3/20 2 Philadelphia 22 9,889
I don't know where Atlantic City has gone on your output, but if you have a small data, you can use the following query:
WITH stores (City,State,Store_Num,Lat,Lon) AS (
SELECT * FROM VALUES
('Buckhead','GA',1,33.8399734,-84.4701434),
('Villanova','PA',2,40.0369415,-75.365584),
('Boulder','CO',3,40.0294202,-105.3101889)
)
, data_table (Date,Value,City,Lat,Lon)
AS (
SELECT * FROM VALUES
('1/1/20',10,'Atlanta',33.7678357,-84.4908155),
('1/2/20',15,'Atlanta',33.7678357,-84.4908155),
('1/3/20',13,'Atlanta',33.7678357,-84.4908155),
('1/1/20',11,'Denver',39.7645183,-104.9955382),
('1/2/20',12,'Denver',39.7645183,-104.9955382),
('1/3/20',14,'Denver',39.7645183,-104.9955382),
('1/1/20',20,'Philadelphia',40.0026763,-75.258455),
('1/2/20',25,'Philadelphia',40.0026763,-75.258455),
('1/3/20',22,'Philadelphia',40.0026763,-75.258455),
('1/1/20',5,'Atlantic City',39.376672,-74.4879282),
('1/2/20',7,'Atlantic City',39.376672,-74.4879282),
('1/3/20',10,'Atlantic City',39.376672,-74.4879282)
)
SELECT d.date, s.Store_Num,d.City, d.value,
haversine( s.lat, s.lon, d.lat, d.lon) distance
FROM data_table d, stores s
qualify row_number() over (partition by d.city, d.date order by haversine( s.lat, s.lon, d.lat, d.lon) ) = 1;
About QUALIFY check the following doc:
https://docs.snowflake.com/en/sql-reference/constructs/qualify.html
The output is:
+--------+-----------+---------------+-------+---------------+
| DATE | STORE_NUM | CITY | VALUE | DISTANCE |
+--------+-----------+---------------+-------+---------------+
| 1/1/20 | 1 | Atlanta | 10 | 8.245620101 |
| 1/2/20 | 1 | Atlanta | 15 | 8.245620101 |
| 1/3/20 | 1 | Atlanta | 13 | 8.245620101 |
| 1/1/20 | 2 | Atlantic City | 5 | 105.009087658 |
| 1/2/20 | 2 | Atlantic City | 7 | 105.009087658 |
| 1/3/20 | 2 | Atlantic City | 10 | 105.009087658 |
| 1/1/20 | 3 | Denver | 11 | 39.851626235 |
| 1/2/20 | 3 | Denver | 12 | 39.851626235 |
| 1/3/20 | 3 | Denver | 14 | 39.851626235 |
| 1/1/20 | 2 | Philadelphia | 20 | 9.886319193 |
| 1/2/20 | 2 | Philadelphia | 25 | 9.886319193 |
| 1/3/20 | 2 | Philadelphia | 22 | 9.886319193 |
+--------+-----------+---------------+-------+---------------+
Related
How do you make a table into one long row in SAS?
I have a table with a number of variables such as: +-----------+------------+---------+-----------+--------+ | DateFrom | DateTo | Price | Discount | Cost | +-----------+------------+---------+-----------+--------+ | 01jan17 | 01jul17 | 17 | 4 | 5 | | 01aug17 | 01feb18 | 15 | 1 | 3 | | 01mar18 | 01dec18 | 12 | 2 | 1 | | ... | ... | ... | ... | ... | +-----------+------------+---------+-----------+--------+ However I want to split this so I have: +------------+------------+----------+-------------+---------+-------------+------------+----------+-------------+-------------+ | DateFrom1 | DateTo1 | Price1 | Discount1 | Cost1 | DateFrom2 | DateTo2 | Price2 | Discount2 | Cost2 ... | +------------+------------+----------+-------------+---------+-------------+------------+----------+-------------+-------------+ | 01jan17 | 01jul17 | 17 | 4 | 5 | 01aug17 | 01feb18 | 15 | 1 | 3 | +------------+------------+----------+-------------+---------+-------------+------------+----------+-------------+-------------+
There's a cool (not at all obvious) solution using proc summary and the idgroup statement that only takes a few lines of code. This runs in memory and you're likely to come into problems if the dataset is large, otherwise this works very well. Note that out[3] relates to the number of rows in the source data. You could easily make this dynamic by adding a prior step that calculates the number of rows and stores it in a macro variable. /* create initial dataset */ data have; input (DateFrom DateTo) (:date7.) Price Discount Cost; format DateFrom DateTo date7.; datalines; 01jan17 01jul17 17 4 5 01aug17 01feb18 15 1 3 01mar18 01dec18 12 2 1 ; run; /* transform data into 1 row */ proc summary data=have nway; output out=want (drop=_:) idgroup(out[3] (_all_)=) / autoname; run;
Database schema for product prices which can change daily
I am looking for suggestions to design a database schema based on following requirements. Product can have variant each variant can have different price Price can be different for specific day of a week Price can be different for a specific time of the day all prices will have validity for specific dates only Prices can be defined for peak, high or medium seasons Suppliers offering any product can define their own prices and above rules still applies what could be best possible schema where data is easy to retrieve without impacting performance? Thanks in advance for your suggestions. Regards Harmeet
SO isn't really a forum for suggestions as much as answers. No answer anyone gives can be definitively correct. With that said, I would keep the tables as granular as possible to allow for easy changes across products. In regards to #5, I would place start and end dates on products. If the price is no-longer valid, the product should no longer be available. This includes relations for different seasonal prices, however you would either need to hardcode the seasons or create another table to define those. For prices, if this is more than 1 region you may want a regions table in which case a currency column would be appropriate. This is operational data, not temporal data. If you want it available for historical analysis of pricing you would need to create temporal tables as well. Product Table +-------------+-----------+------------------+----------------+ | ProductName | ProductID | ProductStartDate | ProductEndDate | +-------------+-----------+------------------+----------------+ | Product1 | 1 | 01/01/2017 | 01/01/2018 | | Product2 | 2 | 01/01/2017 | 01/01/2018 | +-------------+-----------+------------------+----------------+ Variant Table +-----------+-----------+-------------+---------------+-------------+-------------+ | ProductID | VariantID | VariantName | NormalPriceID | HighPriceID | PeakPriceID | +-----------+-----------+-------------+---------------+-------------+-------------+ | 1 | 1 | Blue | 1 | 3 | 5 | | 1 | 2 | Black | 2 | 4 | 5 | +-----------+-----------+-------------+---------------+-------------+-------------+ Price Table +---------+-----+-----+-----+-----+-----+-----+-----+ | PriceID | Mon | Tue | Wed | Thu | Fri | Sat | Sun | +---------+-----+-----+-----+-----+-----+-----+-----+ | 1 | 30 | 30 | 30 | 30 | 35 | 35 | 35 | | 2 | 35 | 35 | 35 | 35 | 40 | 40 | 40 | | 3 | 33 | 33 | 33 | 33 | 39 | 39 | 39 | | 4 | 38 | 38 | 38 | 38 | 44 | 44 | 44 | | 5 | 40 | 40 | 40 | 40 | 50 | 50 | 50 | +---------+-----+-----+-----+-----+-----+-----+-----+
SQL Database Constraint | Multi-table Constraint
I need to make 2 database constraints that connect two different tables at one time. 1. The total score of the four quarters equals the total score of the game the quarters belong to. 2. The total point of all the players equals to the score of the game of that team. Here is what my tables look like. quarter table +------+--------+--------+--------+ | gNum | Period | hScore | aScore | +------+--------+--------+--------+ | 1 | 1 | 13 | 18 | | 1 | 2 | 12 | 19 | | 1 | 3 | 23 | 31 | | 1 | 4 | 32 | 18 | | | | Total | Total | | | | 80 | 86 | +------+--------+--------+--------+ Game Table +-----+--------+--------+--------+ | gID | hScore | lScore | tScore | +-----+--------+--------+--------+ | 1 | 86 | 80 | 166 | +-----+--------+--------+--------+ Player Table +-----+------+--------+--------+ | pID | gNum | Period | Points | +-----+------+--------+--------+ | 1 | 1 | 1 | 20 | | | | 2 | 20 | | | | 3 | 20 | | | | 4 | 20 | +-----+------+--------+--------+ So Virtually I need to use CHECK I think to make sure that players points = score of their team ie (hScore, aScore) and also make sure that the hScore and aScore = the total score in the Game table. I was thinking of creating a foreign key variable on one of the tables and setting up constraints on that would this be the best way of going about it? Thanks
SQL Server : Islands And Gaps
I'm struggling with an "Islands and Gaps" issue. This is for SQL Server 2008 / 2012 (we have databases on both). I have a table which tracks "available" Serial-#'s for a Pass Outlet; i.e., Buss Passes, Admissions Tickets, Disneyland Tickets, etc. Those Serial-#'s are VARCHAR, and can be any combination of numbers and characters... any length, up to the max value of the defined column... which is VARCHAR(30). And this is where I'm mightily struggling with the syntax/design of a VIEW. The table (IM_SER) which contains all this data has a primary key consisting of: ITEM_NO...VARCHAR(20), SERIAL_NO...VARCHAR(30) In many cases... particularly with different types of the "Bus Passes" involved, those Serial-#'s could easily track into the TENS of THOUSANDS. What is needed... is a simple view in SQL Server... which simply outputs the CONSECUTIVE RANGES of Available Serial-#'s...until a GAP is found (i.e. a BREAK in the sequences). For example, say we have the following Serial-#'s on hand, for a given Item-#: 123 124 125 139 140 ABC123 ABC124 ABC126 XYZ240003 XYY240004 In my example above, the output would be displayed as follows: 123 -to- 125 139 -to- 140 ABC123 -to- ABC124 ABC126 -to- ABC126 XYZ240003 to XYZ240004 In total, there would be 10 Serial-#'s...but since we're outputting the sequential ranges...only 5-lines of output would be necessary. Does this make sense? Please let me know...and, again, THANK YOU!...Mark
This should get you started... the fun part will be determining if there are gaps or not. You will have to handle each serial format a little bit differently to determine if there are gaps or not... select x.item_no,x.s_format,x.s_length,x.serial_no, LAG(x.serial_no) OVER (PARTITION BY x.item_no,x.s_format,x.s_length ORDER BY x.item_no,x.s_format,x.s_length,x.serial_no) PreviousValue, LEAD(x.serial_no) OVER (PARTITION BY x.item_no,x.s_format,x.s_length ORDER BY x.item_no,x.s_format,x.s_length,x.serial_no) NextValue from ( select item_no,serial_no, len(serial_no) as S_LENGTH, case WHEN PATINDEX('%[0-9]%',serial_no) > 0 AND PATINDEX('%[a-z]%',serial_no) = 0 THEN 'NUMERIC' WHEN PATINDEX('%[0-9]%',serial_no) > 0 AND PATINDEX('%[a-z]%',serial_no) > 0 THEN 'ALPHANUMERIC' ELSE 'ALPHA' end as S_FORMAT from table1 ) x order by item_no,s_format,s_length,serial_no http://sqlfiddle.com/#!3/5636e2/7 | item_no | s_format | s_length | serial_no | PreviousValue | NextValue | |---------|--------------|----------|-----------|---------------|-----------| | 1 | ALPHA | 4 | ABCD | (null) | ABCF | | 1 | ALPHA | 4 | ABCF | ABCD | (null) | | 1 | ALPHANUMERIC | 6 | ABC123 | (null) | ABC124 | | 1 | ALPHANUMERIC | 6 | ABC124 | ABC123 | ABC126 | | 1 | ALPHANUMERIC | 6 | ABC126 | ABC124 | (null) | | 1 | ALPHANUMERIC | 9 | XYY240004 | (null) | XYZ240003 | | 1 | ALPHANUMERIC | 9 | XYZ240003 | XYY240004 | (null) | | 1 | NUMERIC | 3 | 123 | (null) | 124 | | 1 | NUMERIC | 3 | 124 | 123 | 125 | | 1 | NUMERIC | 3 | 125 | 124 | 139 | | 1 | NUMERIC | 3 | 139 | 125 | 140 | | 1 | NUMERIC | 3 | 140 | 139 | (null) |
Transform ranged data in an Access table
I have a table in Access database as below; Name | Range | X | Y | Z ------------------------------ A | 100-200 | 1 | 2 | 3 A | 200-300 | 4 | 5 | 6 B | 100-200 | 10 | 11 | 12 B | 200-300 | 13 | 14 | 15 C | 200-300 | 16 | 17 | 18 C | 300-400 | 19 | 20 | 21 I have trying write a query that convert this into the following format. Name | X_100_200 | Y_100_200 | Z_100_200 | X_200_300 | Y_200_300 | Z_200_300 | X_300_400 | Y_300_400 | Z_300_400 A | 1 | 2 | 3 | 4 | 5 | 6 | | | B | 10 | 11 | 12 | 13 | 14 | 15 | | | C | | | | 16 | 17 | 18 | 19 | 20 | 21 After trying for a while the best method I could come-up with is to write bunch of short queries that selects the data for each Range and then put them together again using a Union query. The problem is that for this example I have shown 3 columns (X, Y and Z), but I actually have much more. Access is starting to strain with the amount of SQL I have come up with. Is there a better way to achieve this?
The answer was simple. Just use Access Pivotview. Finding it hard to export the results to Excel though.