Extract only uppercase text from db field - database

I "inherited" a db with a field in a table where there are lowercase and uppercase mixed together, eg.
gateway 71, HOWARD BLVD, Chispa, NY
They aren't easily "splittable" with code because they don't always come in this form. What I need is a way to extract only uppercase letters. Is this possible with SQLite?

This is a case where it may just be easier (and perhaps quicker) to SELECT with any additional WHERE requirements you may have and create a cursor to iterate over the results doing your uppercase checks in code.
Another option with SQLite would be to create a custom function, so you could do something like:
SELECT foo WHERE MYISALLUPPERFUNC(foo) = 1;

As NuSkooler mentions, this is probably easier and faster to do using a cursor; an especially attractive option if you will only have to do this once.
Here's a quick example (using built-in SQLite from Python REPL):
import sqlite3
with sqlite3.connect(":memory:") as conn:
conn.execute('''create table t (c, newc);''')
conn.commit()
conn.execute('''insert into t (c) values (?);''', ('testing MAIN ST',))
conn.commit()
results = conn.execute('select c from t;').fetchall()
for line in results:
tokens = line[0].split()
filtered_tokens = [i for i in tokens if i.isupper()]
newc = ' '.join(filtered_tokens)
conn.execute('update t set newc = ?;',(newc,))
conn.commit()
conn.execute('''select c,newc from t;''').fetchone()
# (u'testing MAIN ST', u'MAIN ST')

Related

Query Snowflake Named Internal Stage by Column NAME and not POSITION

My company is attempting to use Snowflake Named Internal Stages as a data lake to store vendor extracts.
There is a vendor that provides an extract that is 1000+ columns in a pipe delimited .dat file. This is a canned report that they extract. The column names WILL always remain the same. However, the column locations can change over time without warning.
Based on my research, a user can only query a file in a named internal stage using the following syntax:
--problematic because the order of the columns can change.
select t.$1, t.$2 from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]dat.gz') t;
Is there anyway to use the column names instead?
E.g.,
Select t.first_name from #mystage1 (file_format => 'myformat', pattern=>'.data.[.]csv.gz') t;
I appreciate everyone's help and I do realize that this is an unusual requirement.
You could read these files with a UDF. Parse the CSV inside the UDF with code aware of the headers. Then output either multiple columns or one variant.
For example, let's create a .CSV inside Snowflake we can play with later:
create or replace temporary stage my_int_stage
file_format = (type=csv compression=none);
copy into '#my_int_stage/fx3.csv'
from (
select *
from snowflake_sample_data.tpcds_sf100tcl.catalog_returns
limit 200000
)
header=true
single=true
overwrite=true
max_file_size=40772160
;
list #my_int_stage
-- 34MB uncompressed CSV, because why not
;
Then this is a Python UDF that can read that CSV and parse it into an Object, while being aware of the headers:
create or replace function uncsv_py()
returns table(x variant)
language python
imports=('#my_int_stage/fx3.csv')
handler = 'X'
runtime_version = 3.8
as $$
import csv
import sys
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]
class X:
def process(self):
with open(import_dir + 'fx3.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield(row, )
$$;
And then you can read this UDF that outputs a table:
select *
from table(uncsv_py())
limit 10
A limitation of what I showed here is that the Python UDF needs an explicit name of a file (for now), as it doesn't take a whole folder. Java UDFs do - it will just take longer to write an equivalent UDF.
https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-tabular-functions.html
https://docs.snowflake.com/en/user-guide/unstructured-data-java.html

Oracle ROWTOCOL Function oddities

I have a requirement to pull data in a specific format and I'm struggling slightly with the ROWTOCOL function and was hoping a fresh pair of eyes might be able to help.
I'm using 10g Oracle DB (10.2) so LISTAGG which appears to do what I need to achieve is not an option.
I need to aggregate a number of usernames into a string delimited with a '$' but I also need to concatenate another column to to build up email addresses.
select
rowtocol('select username_id from username where user_id = '||s.user_id|| 'order by USERNAME_ID asc','#'||d.domain_name||'$')
from username s, domain d
where s.user_id = d.user_id
(I've simplified the query specific to just this function as the actual query is quite large and all works except for this particular function.)
in the DOMAIN Table I have a number of domains such as 'hotmail.com','gmail.com' etc
I need to concatenate the username, an '#' symbol followed by the domain and all delimited with a '$'
such as ......
joe.bloggs#gmail.com$joeblogs#gmail.com$joe_bloggs#gmail.com
I've battled with this and I've got close but in reverse?!.....
gmail.com$joe.bloggs#gmail.com$joeblogs#gmail.com$joe_bloggs
I've also noticed that if I play around with the delimiter (,'#'||d.domain_name||'$') it has a tendency to drop off the first character as can be seen above the preceding '#' has been dropped from the first email address.
Can anyone offer any suggestions as to how to get this working?
Many Thanks in advance!
Assuming you're using the rowtocol function from OTN, and have tables something like:
create table username (user_id number, username_id varchar2(20));
create table domain (user_id number, domain_name varchar2(20));
insert into username values (1, 'joe.bloggs');
insert into username values (1, 'joebloggs');
insert into username values (1, 'joe_bloggs');
insert into domain values (1, 'gmail.com');
Then your original query gets three rows back:
gmail.com$joe.bloggs
gmail.com$joe_bloggs#gmail.com$joebloggs
gmail.com$joe_bloggs#gmail.com$joebloggs
You're passing the data from each of your user IDs to a separate call to rowtocol, which isn't really what you want. You can get the result I think you're after by reversing it; pass the main query that joins the two tables as the select argument to the function, and have that passed query do the username/domain concatenation - that is a separate step to the string aggregation:
select
rowtocol('select s.username_id || ''#'' || d.domain_name from username s join domain d on d.user_id = s.user_id', '$')
from dual;
which gets a single result:
joe.bloggs#gmail.com$joe_bloggs#gmail.com$joebloggs#gmail.com
Whether that fits into your larger query, which you haven't shown, is a separate question. You might need to correlate it with the rest of your query.
There are other ways to string aggregation in Oracle, but this function is one way, and you already have it installed. I'd look at alternatives though, such as ThomasG's answer, which make it a bit clearer what's going on I think.
As Alex told you in comments, this ROWTOCOL isn't a standard function so if you don't show its code, there's nothing we can do to fix it.
However you can accomplish what you want in Oracle 10 using the XMLAGG built-in function.
try this :
SELECT
rtrim (xmlagg (xmlelement (e, s.user_id || '#' || d.domain_name || '$')).extract ('//text()'), '$') whatever
FROM username s
INNER JOIN domain d ON s.user_id = d.user_id

R RODBCext and Parameterizing IN statement?

I've been working to parameterize a SQL Statement that uses the IN statement in the WHERE clause. I'm using rodbcext library for parameterizing but it seems to lack expansion of a list.
I was hoping to write code such as
sqlExecute("SELECT * FROM table WHERE name IN (?)", c("paul","ringo","john", "george")
I'm using the following code but wondered if there's an easier way.
library(RODBC)
library(RODBCext)
# Search inputs
names <- c("paul", "ringo", "john", "george")
# Build SQL statement
qmarks <- replicate(length(names), "?")
stringmarks <- paste(qmarks, collapse = ",")
sql <- paste("SELECT * FROM tableA WHERE name IN (", stringmarks, ")")
# expand to Columns - seems to be the magic step required
bindnames <- rbind(names)
# Execute SQL statement
dbhandle <- RODBC::odbcDriverConnect(connectionString)
result <- RODBCext::sqlExecute(dbhandle, sql, bindnames, fetch = TRUE)
RODBC::odbcClose(dbhandle)
It works but feel I'm using R to expand the strings in the wrong way (bit new to R - so many ways to do the same thing wrong). Somebody will probably say "that creates factors - never do that" :-)
I found this article which suggest I'm on the right track but it doesn't discuss having to expand the "?" and turn the list into columns of a data.frame
R RODBC putting list of numbers into an IN() statement
Thank you.
UPDATE: As Benjamin shows below - the sqlExecute function can handle a list() of inputs. However upon inspection of the resulting SQL I discovered that it uses cursors to rollup the results. This significantly increases the CPU and I/O over the sample code I show above.
While the library can indeed solve this for you, for large results it may be too expensive. There are two answers and it depends upon your needs.
Since your only parameter in the query is in collection for IN, you could get away with
sqlExecute(dbhandle,
"SELECT * FROM table WHERE name IN (?)",
list(c("paul","ringo","john", "george")),
fetch = TRUE)
sqlExecute will bind the values in the list to the question mark. Here, it will actually repeat the query four times, once for each value in the vector. It may seem kind of silly to do it this way, but when trying to pass strings, it's a lot safer in many ways to let the binding take care of setting up the appropriate quote structure rather than trying to paste it in yourself. You will generate fewer errors this way and avoid a lot of database security concerns.
What if you declare a variable table in a character object and then concatenate with the query.
library(RODBC)
library(RODBCext)
# Search inputs
names <- c("paul", "ringo", "john", "george")
# Build SQL statement
sql_top <- paste0( "SET NOCOUNT ON \r\n DECLARE #LST_NAMES TABLE (ID NVARCHAR(20)) \r\n INSERT INTO #LST_NAMES VALUES ('", paste(names, collapse = "'), ('" ) , "')")
sql_body <- paste("SELECT * FROM tableA WHERE name IN (SELECT id FROM #LST_NAMES)")
sql <- paste0(sql_top, "\r\n", sql_body)
# Execute SQL statement
dbhandle <- RODBC::odbcDriverConnect(connectionString)
result <- RODBCext::sqlExecute(dbhandle, sql, bindnames, fetch = TRUE)
RODBC::odbcClose(dbhandle)
The query will be (the set no count on is important to retrieve the results)
SET NOCOUNT ON
DECLARE #LST_NAMES TABLE (ID NVARCHAR(20))
INSERT INTO #LST_NAMES VALUES ('paul'), ('ringo'), ('john'), ('george')
SELECT * FROM tableA WHERE name IN (SELECT id FROM #LST_NAMES)

How to add a parameter to an existing stored procedure in SQL Server

I want to add a new parameter to an existing stored procedure. Body of this procedure may have been already customized by users so I can't drop and recreate it. I don't need to modify the body, just the signature.
So I thought to do a replacement of the last existing parameter by itself + the new parameter.
replace(OBJECT_DEFINITION (OBJECT_ID(id)),'#last_param varchar(max)=null','#last_param varchar(max)=null, #new_param varchar(max)=null')
It works fine if the following string is found
#last_param varchar(max)=null
but doesn't work if there is spaces in the string.
I would like to use a regex to be sure it works in all cases but I'm not sure it's possible in SQL Server.
Can you help me please ?
Thanks
SQL Server does not natively support regular expressions. You'll have to look at more manual string-analyzing with the available string functions. Something like this:
set #obDef = OBJECT_DEFINITION(OBJECT_ID(id))
set #startLastParam = PATINDEX('%#last_param%varchar%(%max%)%=%null%', #obDef)
if #startLastParam = 0 begin
-- handle lastParam not found
end else begin
set #endLastParam = CHARINDEX('null', #obDef, #startLastParam) + 4 -- 4 = len('null')
set #newDef = STUFF(#obDef, #endLastParam, 0, ', #new_param varchar(max)=null')
end
This isn't very fool-proof/safe though. PATINDEX() only gives you the same % wildcard you know from LIKE, it may match no character, it may match half the stored proc to find the word max somewhere entirely outside the signature.
So don't just run this in your customers production ;) but if you are certain about the current stored proc signature, this might just do the trick for you.

How to retrieve multiple rows from stored procedure with Scala?

Say you have a stored procedure or function returning multiple rows, as discussed in How to return multiple rows from the stored procedure? (Oracle PL/SQL)
What would be a good way, using Scala, to "select * from table (all_emps);" (taken from URL above) and read the multiple rows of data that would be the result?
As far as I can see it is not possible to do this using Squeryl. Is there a scalaified tool like Squeryl that I can use, or do I have to drop to JDBC?
Functions that return tables are an Oracle specific feature, I doubt an ORM (be it Scala or even Java) would have support for such a proprietary extension.
So I think you're more or less on your own :).
Probably the easiest way is to use a plain JDBC java.sql.Statement and execute "select * from table (all_emps)" with the executeQuery method.
To address the second part of your question about a way to select from table in a more scala-esque way, I am using Slick. Quoting from their example documentation:
case class Coffee(name: String, supID: Int, price: Double)
implicit val getCoffeeResult = GetResult(r => Coffee(r.<<, r.<<, r.<<))
Database.forURL("...") withSession {
Seq(
Coffee("Colombian", 101, 7.99),
Coffee("Colombian_Decaf", 101, 8.99),
Coffee("French_Roast_Decaf", 49, 9.99)
).foreach(c => sqlu"""
insert into coffees values (${c.name}, ${c.supID}, ${c.price})
""").execute)
val sup = 101
val q = sql"select * from coffees where sup_id = $sup".as[Coffee]
// A bind variable to prevent SQL injection ^
q.foreach(println)
}
Though I am not sure how it's dealing (if at all) with stored procs/functions.

Resources