Split Address Column value using regular expression in snowflake sql - snowflake-cloud-data-platform
There is Column in snowflake table named Address. need to split that columns in to multiple columns.
below is sample regular expression used in one of python code.
apt_pattern = r'(?i)(?P<StreetNum>[0-9]+)(?P<StreetName>.*)\s(?P<UnitType>APT|#|UNIT|NBR|STE|SUITE|BLDG|BUILDING)\s(?P<Unit>.*)$'
Like wise need to split Address column in StreetNum,StreetName,UnitType,Unit using snowflake SQL
below is sample data,
Address
616 NE CHERY DR UNIT A1008
740 NE 3RD ST # 3-1999
13456 SW HAKS BTPD ST APT 1052
460 MAIN ST BUILDING C STE 480
You can use REGEXP_SUBSTR function to extract the subgroups from the matched expression:
select
REGEXP_SUBSTR( address, '([0-9]+) (.*) (APT|#|UNIT|NBR|STE|SUITE|BLDG|BUILDING) (.*)',1,1,'e',1 ) StreetNum,
REGEXP_SUBSTR( address, '([0-9]+) (.*) (APT|#|UNIT|NBR|STE|SUITE|BLDG|BUILDING) (.*)',1,1,'e',2 ) StreetName,
REGEXP_SUBSTR( address, '([0-9]+) (.*) (APT|#|UNIT|NBR|STE|SUITE|BLDG|BUILDING) (.*)',1,1,'e',3 ) UnitType,
REGEXP_SUBSTR( address, '([0-9]+) (.*) (APT|#|UNIT|NBR|STE|SUITE|BLDG|BUILDING) (.*)',1,1,'e',4 ) Unit
from values
('616 NE CHERY DR UNIT A1008'),
('740 NE 3RD ST # 3-1999'),
('13456 SW HAKS BTPD ST APT 1052'),
('460 MAIN ST BUILDING C STE 480') tmp (address);
+-----------+--------------------+----------+--------+
| STREETNUM | STREETNAME | UNITTYPE | UNIT |
+-----------+--------------------+----------+--------+
| 616 | NE CHERY DR | UNIT | A1008 |
| 740 | NE 3RD ST | # | 3-1999 |
| 13456 | SW HAKS BTPD ST | APT | 1052 |
| 460 | MAIN ST BUILDING C | STE | 480 |
+-----------+--------------------+----------+--------+
REGEXP_SUBSTR https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html
Related
Traversing and Getting Nodes in Graph without Loop
I have a person table which keeps some personal info. like as table below. +----+------+----------+----------+--------+ | ID | name | motherID | fatherID | sex | +----+------+----------+----------+--------+ | 1 | A | NULL | NULL | male | | 2 | B | NULL | NULL | female | | 3 | C | 1 | 2 | male | | 4 | X | NULL | NULL | male | | 5 | Y | NULL | NULL | female | | 6 | Z | 5 | 4 | female | | 7 | T | NULL | NULL | female | +----+------+----------+----------+--------+ Also I keep marriage relationships between people. Like: +-----------+--------+ | HusbandID | WifeID | +-----------+--------+ | 1 | 2 | | 4 | 5 | | 1 | 5 | | 3 | 6 | +-----------+--------+ With these information we can imagine the relationship graph. Like below; Question is: How can I get all connected people by giving any of them's ID. For example; When I give ID=1, it should return to me 1,2,3,4,5,6.(order is not important) Likewise When I give ID=6, it should return to me 1,2,3,4,5,6.(order is not important) Likewise When I give ID=7, it should return to me 7. Please attention : Person nodes' relationships (edges) may have loop anywhere of graph. Example above shows small part of my data. I mean; person and marriage table may consist thousands of rows and we do not know where loops may occur. Smilar questions asked in : PostgreSQL SQL query for traversing an entire undirected graph and returning all edges found http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=118319 But I can't code the working SQL. Thanks in advance. I am using SQL Server.
From SQL Server 2017 and Azure SQL DB you can use the new graph database capabilities and the new MATCH clause to answer queries like this, eg SELECT FORMATMESSAGE ( 'Person %s (%i) has mother %s (%i) and father %s (%i).', person.userName, person.personId, mother.userName, mother.personId, father.userName, father.personId ) msg FROM dbo.persons person, dbo.relationship hasMother, dbo.persons mother, dbo.relationship hasFather, dbo.persons father WHERE hasMother.relationshipType = 'mother' AND hasFather.relationshipType = 'father' AND MATCH ( father-(hasFather)->person<-(hasMother)-mother ); My results: Full script available here. For your specific questions, the current release does not include transitive closure (the ability to loop through the graph n number of times) or polymorphism (find any node in the graph) and answering these queries may involve loops, recursive CTEs or temp tables. I have attempted this in my sample script and it works for your sample data but it's just an example - I'm not 100% it will work with other sample data.
Transforming to 3NF
I need to reduce a model DB to 3NF. However there is a column in the data thats very ambiguous. So the database has the following columns. (Apologies for formatting, I did try) Employer ID | ContractNo | Hours | emp Name | workNo | workLocation -- 123 | A1 | 10 | J Smith | W36 | New York 124 | A1 | 7 | P Jones | W36 | New York 125 | A2 | 9 | R Lewis | W37 | Los Angeles 123 | A2 | 9 | J Smith | W37 | Los Angeles Each employee has a unique ID, an employee can work at more than 1 location and each location has a unique workNo. I'm just a bit stuck on where to include the ContractNo. There is no indication in the question of what it actually is for. So my first step was splitting it up into a table with EmployerID, employee Name and hours. And a second table with WorkNo, WorkLocation. But what do I make of that bloody ContractNo?
I expect the contract is likely a separate entity, capturing the nature of the relationship between contractor and contractee. Image from QuickDBD, where I work.
Trying to to do a Sum(MAX) calculation in SQL Server
Thanks for all the help that you all provided and though it was an eye opener unfortunately it did not produce the expected results I was looking for. In an effort to better get the help I'm looking for I will try to explain what I'm looking to achieve. I think the main columns of focus are "IN", "AA_Now", "STF_Now", "dbo.Sheet1$.LOB_name", "dbo.Sheet1$.LifeCycleName" and "dbo.Sheet1$.AreaOfBusiness". Each "IN" have an "AA_Now" and "STF_Now". A group of "IN" rolls up under "dbo.Sheet1$.LOB_name". Under "dbo.Sheet1$.LOB_name" I just want the max value of the Group of "IN" that is rolled up. Now "dbo.Sheet1$.LOB_name" is rolled up under "dbo.Sheet1$.LifeCycleName" and what I want is the sum of of the max values that are rolled up under "dbo.Sheet1$.LOB_name" to show in the rollup of "dbo.Sheet1$.LifeCycleName". Finally "dbo.Sheet1$.LifeCycleName" rolls up to "dbo.Sheet1$.AreaOfBusiness". As before what I'm looking for is the sum of "dbo.Sheet1$.LifeCycleName" to show. These are only for the columns of "AA_Now" and "STF_Now" I tried doing it from a Pivot table but to no avail and figured that it would be best to sort it out in the raw data. I'm trying to to do a SUM(MAX) calculation in SQL server and getting the follow error when executing the command Msg 130, Level 15, State 1, Line 6 Cannot perform an aggregate function on an expression containing an aggregate or a subquery. I'm sure the error is caused by both ,SUM(MAX(convert(float,replace([AA_Now], 'N/A','0')))) As [AA2_Now] and ,SUM(MAX(convert(float,replace([STF_Now], 'N/A','0')))) As [STF2_Now] but have no idea how to rewrite it without causing an error. Below is the full code. SELECT dbo.CCA_Merged.id, dbo.CCA_Merged.timeStamp, dbo.CCA_Merged.name, dbo.CCA_Merged.lN ,dbo.CCA_Merged.type, dbo.CCA_Merged.id2, dbo.CCA_Merged.aG ,dbo.CCA_Merged.regionId, dbo.CCA_Merged.sgcc ,convert(float,replace([SLC_Today],'N/A','0')) As [SLC_Today] ,convert(float,replace([AA_Now],'N/A','0')) As [AA_Now] ,SUM(MAX(convert(float,replace([AA_Now],'N/A','0')))) As [AA2_Now] ,convert(float,replace([SLCO_Today],'N/A','0')) As [SLCO_Today] ,convert(float,replace([CABN_Today],'N/A','0')) As [CABN_Today] ,convert(float,replace([COF_Today],'N/A','0')) As [COF_Today] ,convert(float,replace([HT_Today],'N/A','0')) As [HT_Today] ,convert(float,replace(replace([CH_Today],'N/A','0'),'-','0')) As [CH_Today] ,convert(float,replace([SLC_Now],'N/A','0')) As [SLC_Now] ,convert(float,replace([SLCO_Now],'N/A','0')) As [SLCO_Now] ,convert(float,replace([SLC_Thirty],'N/A','0')) As [SLC_Thirty] ,convert(float,replace(replace([SLCO_Thirty],'N/A','0'),'-','0')) As [SLCO_Thirty] ,convert(float,replace([ACWT_Today],'N/A','0')) As [ACWT_Today] ,convert(float,replace([CQ_Now],'N/A','0')) As [CQ_Now] ,convert(float,replace([LCQ_Now],'N/A','0')) As [LCQ_Now] ,convert(float,replace([SLCH_Now],'N/A','0')) As [SLCH_Now] ,convert(float,replace([STF_Now],'N/A','0')) As [STF_Now] ,SUM(MAX(convert(float,replace([STF_Now],'N/A','0')))) As [STF2_Now] ,dbo.Sheet1$.AreaOfBusiness, dbo.Sheet1$.LifeCycleName, dbo.Sheet1$.LOB_name FROM dbo.Sheet1$ RIGHT OUTER JOIN dbo.CCA_Merged ON dbo.Sheet1$.Skill_Name = dbo.CCA_Merged.lN Group by ROLLUP (stf_now) ,dbo.CCA_Merged.id, dbo.CCA_Merged.timeStamp, dbo.CCA_Merged.name, dbo.CCA_Merged.lN ,dbo.CCA_Merged.type, dbo.CCA_Merged.id2, dbo.CCA_Merged.aG ,dbo.CCA_Merged.regionId ,dbo.CCA_Merged.sgcc,AA_Now,SLC_Today,SLCO_Today,CABN_Today,COF_Today,HT_Today,CH_Today ,SLC_Now,SLCO_Now,SLC_Thirty,SLCO_Thirty,ACWT_Today,CQ_Now,LCQ_Now,SLCH_Now ,dbo.Sheet1$.AreaOfBusiness, dbo.Sheet1$.LifeCycleName, dbo.Sheet1$.LOB_name I'm relatively new with SQL Server and any help would be greatly appreciated. Thanks in advance Updated Stripped down script SELECT dbo.CCA_Merged.lN ,convert(float,replace([STF_Now],'N/A','0')) As [STF_Now] ,dbo.Sheet1$.LOB_name, dbo.Sheet1$.LifeCycleName, dbo.Sheet1$.AreaOfBusiness FROM dbo.Sheet1$ RIGHT OUTER JOIN dbo.CCA_Merged ON dbo.Sheet1$.Skill_Name = dbo.CCA_Merged.lN Group by stf_now ,AA_Now,dbo.CCA_Merged.lN,dbo.Sheet1$.AreaOfBusiness, dbo.Sheet1$.LifeCycleName, dbo.Sheet1$.LOB_name Order by AreaOfBusiness DESC +----+---------+----------+---------------+----------------+ | LN | STF_Now | LOB_name | LifeCycleName | AreaOfBusiness | +----+---------+----------+---------------+----------------+ | A | 46 | BSW | BS | Business | | B | 46 | BSW | BS | Business | | C | 0 | BOSS | BS | Business | | D | 112 | MSD | BS | Business | | E | 112 | MSD | BS | Business | | F | 42 | BHV | BR | Business | | G | 23 | BCR | BR | Business | | H | 23 | BHV | BR | Business | | I | 55 | BSW2 | BS | Business | | J | 1 | BSW2 | BS | Business | | K | 46 | BSW | BS | Business | | L | 112 | MSD | BS | Business | | M | 112 | MSD | BS | Business | | N | 57 | BSW | BS | Business | | O | 0 | BOSS | BS | Business | | P | 38 | MSD | BS | Business | | Q | 38 | MSD | BS | Business | | R | 19 | BHV | BR | Business | | S | 0 | BCR | BR | Business | | T | 19 | BHV | BR | Business | | U | 2 | BSW | BS | Business | | V | 1 | BSW | BS | Business | | W | 57 | BSW | BS | Business | | X | 38 | MSD | BS | Business | | Y | 38 | MSD | BS | Business | +----+---------+----------+---------------+----------------+ Below is the the expected results in 3 added columns LOB_Name2 (This is the Max of STF_Now resulting from LN) 57 BSW 0 BOSS 112 MSD 42 BHV 23 BCR 55 BSW2 LifeCycleName2 (This is the Sum of the Max of the Rollup of LOB_Name2) 224 BS 65 BR AreaOfBusiness2 (This is the Sum of the Rollup of LifeCycleName2) 289 Business
You can't sum a max because it would be the same amount anyhow, if you have the same group by. You probably need to have an inner and outer parts with different group by, something like: select product_group, sum(max_cost) from ( select product, product_group, max(cost) as max_cost from orders group by product_group,product ) X group by product_group This imaginary SQL will fetch maximum cost for each product, and them sum them up to the product group level. That's the only way I can figure out you'd actually need to sum a max
You're having issues because you're trying to do two nested aggregates. This will roughly give you what you're after, if SUM(MAX) is actually what you're trying to do. However, as James pointed out, You'll need some group by logic that pulls this together. SELECT ... , SUM(AA2_Now_Max) ... , SUM(STF2_Now_Max) FROM( SELECT dbo.CCA_Merged.id, dbo.CCA_Merged.timeStamp, dbo.CCA_Merged.name, dbo.CCA_Merged.lN ,dbo.CCA_Merged.type, dbo.CCA_Merged.id2, dbo.CCA_Merged.aG ,dbo.CCA_Merged.regionId, dbo.CCA_Merged.sgcc ,convert(float,replace([SLC_Today],'N/A','0')) As [SLC_Today] ,convert(float,replace([AA_Now],'N/A','0')) As [AA_Now] ,MAX(convert(float,replace([AA_Now],'N/A','0'))) As [AA2_Now_Max] ,convert(float,replace([SLCO_Today],'N/A','0')) As [SLCO_Today] ,convert(float,replace([CABN_Today],'N/A','0')) As [CABN_Today] ,convert(float,replace([COF_Today],'N/A','0')) As [COF_Today] ,convert(float,replace([HT_Today],'N/A','0')) As [HT_Today] ,convert(float,replace(replace([CH_Today],'N/A','0'),'-','0')) As [CH_Today] ,convert(float,replace([SLC_Now],'N/A','0')) As [SLC_Now] ,convert(float,replace([SLCO_Now],'N/A','0')) As [SLCO_Now] ,convert(float,replace([SLC_Thirty],'N/A','0')) As [SLC_Thirty] ,convert(float,replace(replace([SLCO_Thirty],'N/A','0'),'-','0')) As [SLCO_Thirty] ,convert(float,replace([ACWT_Today],'N/A','0')) As [ACWT_Today] ,convert(float,replace([CQ_Now],'N/A','0')) As [CQ_Now] ,convert(float,replace([LCQ_Now],'N/A','0')) As [LCQ_Now] ,convert(float,replace([SLCH_Now],'N/A','0')) As [SLCH_Now] ,convert(float,replace([STF_Now],'N/A','0')) As [STF_Now] ,MAX(convert(float,replace([STF_Now],'N/A','0'))) As [STF2_Now_Max] ,dbo.Sheet1$.AreaOfBusiness, dbo.Sheet1$.LifeCycleName, dbo.Sheet1$.LOB_name FROM dbo.Sheet1$ RIGHT OUTER JOIN dbo.CCA_Merged ON dbo.Sheet1$.Skill_Name = dbo.CCA_Merged.lN Group by ROLLUP (stf_now) ,dbo.CCA_Merged.id, dbo.CCA_Merged.timeStamp, dbo.CCA_Merged.name, dbo.CCA_Merged.lN ,dbo.CCA_Merged.type, dbo.CCA_Merged.id2, dbo.CCA_Merged.aG ,dbo.CCA_Merged.regionId ,dbo.CCA_Merged.sgcc,AA_Now,SLC_Today,SLCO_Today,CABN_Today,COF_Today,HT_Today,CH_Today ,SLC_Now,SLCO_Now,SLC_Thirty,SLCO_Thirty,ACWT_Today,CQ_Now,LCQ_Now,SLCH_Now ,dbo.Sheet1$.AreaOfBusiness, dbo.Sheet1$.LifeCycleName, dbo.Sheet1$.LOB_name ) x GROUP BY ... This other question is similar- follow the pattern! SQL: SUM the MAX values of results returned
You need to add another query level to SUM your MAX values. The idea would be to MAX in one select and then SUM the results using a outer query. I use AVG and MAX in the example below, however, any aggregate function could be used. SELECT LocationID, MaxAverageSalePriceByLocation=MAX(AvgerageSalePriceByUserLocation) FROM ( SELECT UserID, AvgerageSalePriceByUserLocation=AVG(SalePrice) FROM MyTable GROUP BY UserID,LocationID )AS A GROUP BY LocationID
If you want to sum all the rows Max values then use OVER() Sum(Max(CONVERT(FLOAT, Replace([STF_Now], 'N/A', '0'))))OVER() AS [STF2_Now] If you want to sum all the rows Max values for each group then use OVER(Partition by) Sum(Max(CONVERT(FLOAT, Replace([STF_Now], 'N/A', '0'))))OVER(partition by grp1,grp2,..) AS [STF2_Now] Note Converting the numeric data to float could lead to approximation issues.. Use Numeric with precision and scale
Table cursor in Perl
I want to iterate over a big table that don't fits in the memory. In Java, I can use a cursor and load the contents as needed and not overflow the memory. How I do the same with Perl? The database I'm using is PostgreSQL and DBI.
Just use a database cursor in PostgreSQL. An example from the manual: BEGIN WORK; -- Set up a cursor: DECLARE liahona SCROLL CURSOR FOR SELECT * FROM films; -- Fetch the first 5 rows in the cursor liahona: FETCH FORWARD 5 FROM liahona; code | title | did | date_prod | kind | len -------+-------------------------+-----+------------+----------+------- BL101 | The Third Man | 101 | 1949-12-23 | Drama | 01:44 BL102 | The African Queen | 101 | 1951-08-11 | Romantic | 01:43 JL201 | Une Femme est une Femme | 102 | 1961-03-12 | Romantic | 01:25 P_301 | Vertigo | 103 | 1958-11-14 | Action | 02:08 P_302 | Becket | 103 | 1964-02-03 | Drama | 02:28 -- Fetch the previous row: FETCH PRIOR FROM liahona; code | title | did | date_prod | kind | len -------+---------+-----+------------+--------+------- P_301 | Vertigo | 103 | 1958-11-14 | Action | 02:08 -- Close the cursor and end the transaction: CLOSE liahona; COMMIT WORK;
I used a PostgreSQL cursor from PostgreSQL database. my $sql = "SOME QUERY HERE"; $dbh->do("DECLARE csr CURSOR WITH HOLD FOR $sql"); my $sth = $dbh->prepare("fetch 100 from csr"); $sth->execute; while(my $ref = $sth->fetchrow_hashref()) { //... - processing here if ($count % 100 == 0){ $sth->execute; } }
What's wrong with: my $s = $h->prepare(select ...); $s->execute; while(my $row = $fetchrow_arrayref) { ; # do something }
Look at DBD::Pg docs for an example. Use DBI fetchrow_* functions in a while() loop for smaller memory allocation, avoid fetchall_*. Other database options related to memory usage: LongReadLen - maximum length of 'long' type fields (LONG, BLOB, CLOB, MEMO, etc.) RowCacheSize (not used in DBD::Pg) - A hint to the driver indicating the size of the local row cache that the application would like the driver to use for future "SELECT" statements.
Why is this query returning unwanted results?
Good morning, I have a problem with this query: SELECT P.txt_nome AS Pergunta, IP.nome AS Resposta, COUNT(*) AS Qtd FROM tb_resposta_formulario RF INNER JOIN formularios F ON F.id_formulario = RF.id_formulario INNER JOIN tb_pergunta P ON P.id_pergunta = RF.id_pergunta INNER JOIN tb_resposta_formulario_combo RFC ON RFC.id_resposta_formulario = RF.id_resposta_formulario INNER JOIN itens_perguntas IP ON IP.id_item_pergunta = RFC.id_item_pergunta WHERE RF.id_formulario = 2 GROUP BY P.txt_nome, IP.nome This is the actual result of this query: |Pergunta| Resposta |Qtd| |Produto |Combo 1MB | 3 | |Produto |Combo 2MB | 5 | |Produto |Combo 4MB | 1 | |Produto |Combo 6MB | 1 | |Produto |Combo 8MB | 4 | |Região |MG | 3 | |Região |PR | 2 | |Região |RJ | 3 | |Região |SC | 1 | |Região |SP | 5 | These are the results I was expecting: |Produto | Região |Qtd| |Combo 1MB | MG | 3 | |Combo 2MB | SP | 5 | |Combo 4MB | SC | 1 | |Combo 6MB | RJ | 1 | |Combo 8MB | PR | 2 | I am using the PIVOT and UNPIVOT operators but the result is not satisfactory. Has anyone already faced this situation before? Do you have any insight you can offer? I already analyzed these links: SQL Server 2005 Pivot on Unknown Number of Columns Transpose a set of rows as columns in SQL Server 2000 SQL Server 2005, turn columns into rows Pivot Table and Concatenate Columns PIVOT in sql 2005 Att, Pelegrini
The "obvious" answer is: because the query is incorrect. We really know nothing about the table structure and what you're trying to achieve. Concerning at least one very basic problem in your query: you're expecting the columns |Produto | Região |Qtd| in your response, yet the query unambiguously selects the columns Pergunta, Reposta and Qtd, which coincides with the result you're getting. How well are you acquainted with SQL at all? It may be worth it to read an introductory text. I'd suggest this as a good introduction. (Uses Oracle, but the principles are the same)