Finding "hidden patterns" in string in MSSQL data using regex (Positive Lookahead) - sql-server

I'm working on finding obfuscated data in string fields. I have a regex that works in a Python script, but I realized that what I was doing would likely be done more efficiently directly in the database. BUT MSSQL doesn't support lookaround (Positive Lookahead.)
Eventually I want to dynamically feed an sql statement a list of targets to search for in candidates, but for now I am just trying to figure out how to get the match working.
Target:
"4X5G"
Candidates:
"Ipsum 47 loreix 5-g blue scuba rock." 4X5G are 13 characters apart, I would want to return this a potential candidate.
"Ipsum 47 loreix blue scuba 5-g rock." 4X5G are 24 characters apart, I would NOT want to return this a potential candidate.
Traditional REGEX:
(?=.{9,14}g)4.*?x.*?5.*?g
What DOESN'T work:
WHERE [field] LIKE '%(?=.{9,14}g)4.*?x.*?5.*?g%'

By using the information here: LINK I was able to add a CLR and create a scalar function in my database that works great. I can feed it complicated regex statements and get back results.
Thanks!

Related

Salesforce CASESAFEID(Id): Last 03 digits of all the record ids are coming up same

I am testing CASESAFEID(Id) function to get the 18-digit ids in my report. I created a formula field and used that field in a report. I am noticing that the last 03 characters of most of the records in this field are the same. I could not find the reason or logic for these 03 characters on google search to posting it here.
My formula field:
My report:
I am using trailhead playground for this testing.
Yes, that can happen. IDs that have uppercase letters on same positions will have same 3 "digit" suffix. You don't have to worry about that? There are some posts if you're really interested in the algorithm.
https://astadiaemea.wordpress.com/2010/06/21/15-or-18-character-ids-in-salesforce-com-%E2%80%93-do-you-know-how-useful-unique-ids-are-to-your-development-effort/
https://salesforce.stackexchange.com/questions/1653/what-are-salesforce-ids-composed-of
https://developer.salesforce.com/docs/atlas.en-us.object_reference.meta/object_reference/field_types.htm (scroll down to ID field type)
They're essentially a checksum-type value to ensure that valid Salesforce Ids do not differ from one another only in case. This provides safety for tools like Excel that treat abc and AbC as the same value.
The behavior you are observing is normal. There's no need to test this formula function as such; it's a standard part of the platform.

Text Manipulation in SQL

I have a column named body(ntext,null). Basically anything in the body of the message will come out as one string of text. See example:
Report Count SITE Type ACCOUNT NUMBER STMT CD COLL SCHEME Previously Touched Resi Aging 98 Cleveland - 609 Former 22449903 1 RQ-1 1160201
I want the result to look like this:
Report Count SITE Type ACCOUNT NUMBER STMT CD
98 Cleveland - 609 Former 22449903 1 RQ-1 1160201
How can I get this output? Would it be easier to do in EXCEL using VBA verses SQL?
I am not an expert in SQL. I am still learning.
You COULD try to get this out of Sql but I think most would agree that Sql is not designed for extensively formatting text.
As a DBA, I would steer you towards making those fields discrete if possible using normalization or at the very least having a key/value pair table rather than a blob of text that represents both fields and data.
You could also consider a datatype of XML if you find that you need to store different fields and different responses for each row.

SSIS Fuzzy Lookup Transform not matching values with spaces in on and not in the other

I'm working on an SSIS project to transfer data from a legacy system to a new system. This is the first time I've used SSIS but so far all has gone well, until now.
I have to match product names from the old and new system. Some are clean matches and a standard lookup catches those. Some are not and I'm using a fuzzy lookup on the no match output of the main lookup to try and catch those afterwards. This works on some thing but what seems like the most obvious matches it completely misses. for example
Source data: FG 45J
Target data: FG45J
This is not matched by the fuzzy lookup. I've tried ticking and unticking the spaces delimiter box to no avail. The threshold is set to zero so everything gets through but similarity and confidence are zero on the relevant output records. Some others do return non zero similarity etc but they don't have spaces. Matches to return is set to one although I've tried setting this up to four to see if this made any difference and it didn't. I expect I've missed something but I can't work out what.
Any help would be greatly appreciated

python encoding characters in jinja2

It's a similar one with one of my other questions. I try to solve all the side effects of the first one.
I have stored few non-ascii characters on my database. If I make few "encoding-decoding" stuffs, I managed to work with the database queries. But I have another problem.
If I use the
self.response.out.write(mystring)
in one of my entities ( looks like this -> u'\u0395\u03c0\u03b9\u03c3\u03c4\u03ae\u03bc\u03b5\u03c2')
I can see it without any problem. But, I have a javascript which create a graph and needs a list with those strings. If I pass the list to the javascript like it is from the database, the javascript doesn't work at all. If I use the
tag2 = tag.encode("utf-8")
for every entity on the list and then pass the new list, I see all the non-ascii characters like this one -> ÎÏιÏÏήμεÏ

Make SQL Server index small numbers

We're using SQL Server 2005 in a project. The users of the system have the ability to search some objects by using 'keywords'. The way we implement this is by creating a full-text catalog for the significant columns in each table that may contain these 'keywords' and then using CONTAINS to search for the keywords the user inputs in the search box in that index.
So, for example, let say you have the Movie object, and you want to let the user search for keywords in the title and body of the article, then we'd index both the Title and Plot column, and then do something like:
SELECT * FROM Movies WHERE CONTAINS(Title, keywords) OR CONTAINS(Plot, keywords)
(It's actually a bit more advanced than that, but nothing terribly complex)
Some users are adding numbers to their search, so for example they want to find 'Terminator 2'. The problem here is that, as far as I know, by default SQL Server won't index short words, thus doing a search like this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"')
is actually equivalent to doing this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator"') <-- notice the missing '2'
and we are getting a plethora of spurious results.
Is there a way to force SQL Server to index small words? Preferably, I'd rather index only numbers like 1, 2, 21, etc. I don't know where to define the indexing criteria, or even if it's possible to be as specific as that.
Well, I did that, removed the "noise-words" from the list, and now the behaviour is a bit different, but still not what you'd expect.
A search won't for "Terminator 2" (I'm just making this up, my employer might not be really happy if I disclose what we are doing... anyway, the terms are a bit different but the principle the same), I don't get anything, but I know there are objects containing the two words.
Maybe I'm doing something wrong? I removed all numbers 1 ... 9 from my noise configuration for ENG, ENU and NEU (neutral), regenerated the indexes, and tried the search.
These "small words" are considered "noise words" by the full text index. You can customize the list of noise words. This blog post provides more details. You need to repopulate your full text index when you change the noise words file.
I knew about the noise words file, but I'm not why your "Terminator 2" example is still giving you issues. You might want to try asking this on the MSDN Database Engine forum where people that specialize in this sort of thing hang out.
You can combine CONTAINS (or CONTAINSTABLE) with simple where conditions:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"') and Title like '%Terminator 2%'
While the CONTAINS find all Terminator the where will eliminate 'Terminator 1'.
Of course the engine is smart enough to start with the CONTAINS not the like condition.

Resources