Identify Unique Sentences in SQL String with Repeat Sentences - sql-server

I have a VARCHAR(MAX) string field in SQL that, in some instances, contains up to 100k char in LEN. Lets call it [Notes_Comments].
The contents of the field, [Notes_Comments], are notes and comments and in several instances have those notes and comments copied/pasted in multiple times. There is no delineation as to where the old and new parts start/end. Punctuation is also hit and miss.
I would like to figure out how to identify the unique sentences, or strings/substrings, within the [Notes_Comments] field to essentially get a pure, deduplicated, view of contents.
I built a SplitString function and was able to start parsing out 2,3 and 5 word phrases but it isn't really getting me to where I need to get for full sentences.
Example [Notes_Comments]:
I want to get this
10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words 4:46 PM Trying to figure out how to remove repeat parts of the one string field. Notes Some duplicate stuff. Now some new stuff has been added. Notes Added a little more. Must deduplicate portions of text that appear more than once. Notes One more section.
From this
10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words Notes Some duplicate stuff 10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words 4:46 PM Trying to figure out how to remove repeat parts of the one string field. Notes Some duplicate stuff. Now some new stuff has been added. Notes Added a little more. Must deduplicate portions of text that appear more than once. 10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words 4:46 PM Trying to figure out how to remove repeat parts of the one string field. Notes Some duplicate stuff. Now some new stuff has been added. Notes Added a little more. Must deduplicate portions of text that appear more than once. Notes One more section.
Here is my SplitString function:
CREATE FUNCTION dbo.SplitString
(
#str NVARCHAR(max),
#separator CHAR(1)
)
RETURNS TABLE
AS
RETURN (
WITH tokens(p, a, b) AS (
SELECT
CAST(1 AS bigint),
CAST(1 AS bigint),
CHARINDEX(#separator, #str)
UNION all
SELECT
p + 1,
b + 1,
CHARINDEX(#separator, #str, b + 1)
FROM tokens
WHERE b > 0
)
SELECT
p-1 ItemIndex,
SUBSTRING(
#str,
a,
CASE WHEN b > 0 THEN b-a ELSE LEN(#str) END)
AS word
FROM tokens
);
GO

To help going forward I would recommend taking that "Notes" column off of the table it is currently on, and making a new table of "EntityId", and "Note". EntityId being that of the parent record you want to associate the note to. In this way you have a one-to-many relationship between your entity and notes.
This will require quite a bit of work to change over, but maintaining it will be much easier and prevent similar issues to the one you are currently having.
For a purely sql approach I would change the table relationships first. Perform your split, and insert each individual message into the new table. After doing that you can write a separate procedure for identifying duplicates and perform deletes on all but one of those.
I should note that below is an application approach.
If you choose to update your table relationships, implement it first, then de-duping will become a two-step process, but relatively simple to implement.
For de-duping, you have to account for the fact that you said punctuation is hit or miss.
I would split on the :??AM and :??PM syntax of your timestamps. You'll have to work out the syntax but that should help you identify where an entry begins and ends. Split those into a list, Trim() each one as you add it, if you don't need the punctuation at all, do a replace on "." and "," and other punctuation characters you don't want as you add them to your collection as well.
Now you have two options, if you went with the table update, you could insert all of the values and then filter for duplicates after, in either sql or in code.
Or you can create a new list, do some looping and add only unique items to this new list. If you kept your single column concat them into one string and update your db, or if you went with the one-to-many design perform an insert for each unique entry.
If you want to do less exact matching, build a hash for each substring. Perhaps "timestamp" + " " + first X chars + last Y chars. Then compare your hashes for equality, again keeping only one. I recommend this pattern because it looks from your example that duplicate notes have duplicate timestamps. If not try only first x and last y as a hash.
I highly recommend updating your table structure to save you similar headaches in the future. The rest is a little bit of regex, and string splitting and comparison. Also, experiment in debug with the hashing if you try to use it to make sure you don't cut out too much before committing it to your db.

If you are going to split the strings in code you'll need to start by putting together a regex. I would match on the substring ":.. AM" or ":.. PM". Matching on just a ":" seems a little dangerous given that this is a note field and users can probably type just about anything they want.
Please note this is untested, but the logic should be sound. You may need to tinker with the match object accessors, and test that the index value is splitting exactly where you want, etc etc.
Start by reading your field into a string. Then utilize regex to find the messages within the string. Something like the following.
string pattern = #":.?*M";
Regex rgx = new Regex(pattern, RegexOptions.None);
MatchCollection matches = rgx.Matches(input);
The match objects in the matches collection contain an index for the starting point where the match was found. You'll want to loop through them and extract the substrings into individual strings. Something like:
List<string> allNotes = new List<string>();
for(i = 0, i < matches.Count, i++)
{
string nextNote = string.Empty;
int currentIndex = matches[i].index - 2;
int endIndex = -1;
if(matches[i+1] != null)
{
endIndex = matches[i + 1].index - 2;
}
if(endIndex != -1)
{
nextNote = substring(currentIndex, (endIndex - currentIndex);
}
else
{
nextNote = substring(currentIndex);
}
if(!string.IsNullOrEmpty(nextNote))
{
allNotes.Add(nextNote);
}
}
These substrings should span the beginning of the timestamp through the character before the next time stamp.
You can then do another function to de-dupe the list. Somethinglike:
Dictionary <string, string> dedupedDict = new Dictionary<string, string>();
foreach(string note in allNotes)
{
string noteHash = string.concat(note.substring(0, 6), note.substring(note.length - 7));
if(!dedupedDict.Keys.Contains(noteHash))
{
dedupedDict.Add(noteHash, note);
}
}
return dedupedDict.Values().ToList();
You might need to include some checking for strings being the minimum length to avoid index out of bounds exceptions, and/or change the hash size(my example should be the first and last 6 characters). You may also want to Trim() and/or Replace(".", "") on the punctuation at some point if you think it will help your matching.
Then use the list of deduped notes to populate your new CallNotes table. You'll need your call Id as well.
Once you've cleaned up your current values, you can use the same methods to intercept new notes before they are put into your database.

Related

AT NEW with substring access?

I have a solution that includes a LOOP which I would like to spare. So I wonder, whether you know a better way to do this.
My goal is to loop through an internal, alphabetically sorted standard table. This table has two columns: a name and a table, let's call it subtable. For every subtable I want to do some stuff (open an xml page in my xml framework).
Now, every subtable has a corresponding name. I want to group the subtables according to the first letter of this name (meaning, put the pages of these subtables on one main page -one main page for every character-). By grouping of subtables I mean, while looping through the table, I want to deal with the subtables differently according to the first letter of their name.
So far I came up with the following solution:
TYPES: BEGIN OF l_str_tables_extra,
first_letter(1) TYPE c,
name TYPE string,
subtable TYPE REF TO if_table,
END OF l_str_tables_extra.
DATA: ls_tables_extra TYPE l_str_tables_extra.
DATA: lt_tables_extra TYPE TABLE OF l_str_tables_extra.
FIELD-SYMBOLS: <ls_tables> TYPE str_table."Like LINE OF lt_tables.
FIELD-SYMBOLS: <ls_tables_extra> TYPE l_str_tables_extra.
*"--- PROCESSING LOGIC ------------------------------------------------
SORT lt_tables ASCENDING BY name.
"Add first letter column in order to use 'at new' later on
"This is the loop I would like to spare
LOOP AT lt_tables ASSIGNING <ls_tables>.
ls_tables_extra-first_letter = <ls_tables>-name+0(1). "new column
ls_tables_extra-name = <ls_tables>-name.
ls_tables_extra-subtable = <ls_tables>-subtable.
APPEND ls_tables_extra TO lt_tables_extra.
ENDLOOP.
LOOP AT lt_tables_extra ASSIGNING <ls_tables_extra>.
AT NEW first_letter.
"Do something with subtables with same first_letter.
ENDAT.
ENDLOOP.
I wish I could use
AT NEW name+0(1)
instead of
AT NEW first_letter
, but offsets and lengths are not allowed.
You see, I have to inlcude this first loop to add another column to my table which is kind of unnecessary because there is no new info gained.
In addition, I am interested in other solutions because I get into trouble with the framework later on for different reasons. A different way to do this might help me out there, too.
I am happy to hear any thoughts about this! I could not find anything related to this here on stackoverflow, but I might have used not optimal search terms ;)
Maybe the GROUP BY addition on LOOP could help you in this case:
LOOP AT i_tables
INTO DATA(wa_line)
" group lines by condition
GROUP BY (
" substring() because normal offset would be evaluated immediately
name = substring( val = wa_line-name len = 1 )
) INTO DATA(o_group).
" begin of loop over all tables starting with o_group-name(1)
" loop over group object which contains
LOOP AT GROUP o_group
ASSIGNING FIELD-SYMBOL(<fs_table>).
" <fs_table> contains your table
ENDLOOP.
" end of loop
ENDLOOP.
why not using a IF comparison?
data: lf_prev_first_letter(1) type c.
loop at lt_table assigning <ls_table>.
if <ls_table>-name(1) <> lf_prev_first_letter. "=AT NEW
"do something
lf_prev_first_letter = <ls_table>-name(1).
endif.
endloop.

Search entries in Go GAE datastore using partial string as a filter

I have a set of entries in the datastore and I would like to search/retrieve them as user types query. If I have full string it's easy:
q := datastore.NewQuery("Products").Filter("Name =", name).Limit(20)
but I have no idea how to do it with partial string, please help.
q := datastore.NewQuery("Products").Filter("Name >", name).Limit(20)
There is no like operation on app engine but instead you can use '<' and '>'
example:
'moguz' > 'moguzalp'
EDIT: GAH! I just realized that your question is Go-specific. My code below is for Python. Apologies. I'm also familiar with the Go runtime, and I can work on translating to Python to Go later on. However, if the principles described are enough to get you moving in the right direction, let me know and I wont' bother.
Such an operation is not directly supported on the AppEngine datastore, so you'll have to roll your own functionality to meet this need. Here's a quick, off-the-top-of-my-head possible solution:
class StringIndex(db.Model):
matches = db.StringListProperty()
#classmathod
def GetMatchesFor(cls, query):
found_index = cls.get_by_key_name(query[:3])
if found_index is not None:
if query in found_index.matches:
# Since we only query on the first the characters,
# we have to roll through the result set to find all
# of the strings that matach query. We keep the
# list sorted, so this is not hard.
all_matches = []
looking_at = found_index.matches.index(query)
matches_len = len(foundIndex.matches)
while start_at < matches_len and found_index.matches[looking_at].startswith(query):
all_matches.append(found_index.matches[looking_at])
looking_at += 1
return all_matches
return None
#classmethod
def AddMatch(cls, match) {
# We index off of the first 3 characters only
index_key = match[:3]
index = cls.get_or_insert(index_key, list(match))
if match not in index.matches:
# The index entity was not newly created, so
# we will have to add the match and save the entity.
index.matches.append(match).sort()
index.put()
To use this model, you would need to call the AddMatch method every time that you add an entity that would potentially be searched on. In your example, you have a Product model and users will be searching on it's Name. In your Product class, you might have a method AddNewProduct that creates a new entity and puts it into the datastore. You would add to that method StringIndex.AddMatch(new_product_name).
Then, in your request handler that gets called from your AJAXy search box, you would use StringIndex.GetMatchesFor(name) to see all of the stored products that begin with the string in name, and you return those values as JSON or whatever.
What's happening inside the code is that the first three characters of the name are used for the key_name of an entity that contains a list of strings, all of the stored names that begin with those three characters. Using three (as opposed to some other number) is absolutely arbitrary. The correct number for your system is dependent on the amount of data that you are indexing. There is a limit to the number of strings that can be stored in a StringListProperty, but you also want to balance the number of StringIndex entities that are in your datastore. A little bit of math with give you a reasonable number of characters to work with.
If the number of keywords is limited you could consider adding an indexed list property of partial search strings.
Note that you are limited to 5000 indexes per entity, and 1MB for the total entity size.
But you could also wait for Cloud SQL and Full Text Search API to be avaiable for the Go runtime.

Lua string library choices for finding and replacing text

I'm new to Lua programming, having come over from python to basically make a small addon for world of warcraft for a friend. I'm looking into various ways of finding a section of text from a rather large string of plain text. I need to extract the information from the text that I need and then process it in the usual way.
The string of text could be a number of anything, however the below is what we are looking to extract and process
-- GSL --
items = ["itemid":"qty" ,"itemid":"qty" ,"itemid":"qty" ,]
-- ENDGSL --
We want to strip the whole block of text from a potentially large block of text surrounding it, then remove the -- GSL -- and -- ENDGSL -- to be left with:
items = ["itemdid":"qty …
I've looked into various methods, and can't seem to get my head around any of them.
Anyone have any suggestions on the best method to tackle this problem?
EDIT: Additional problem,
Based on the accepted answer I've changed the code slightly to the following.
function GuildShoppingList:GUILDBANKFRAME_OPENED()
-- Actions to be taken when guild bank frame is opened.
if debug == "True" then self:Print("Debug mode on, guild bank frame opened") end
gslBankTab = GetCurrentGuildBankTab()
gslBankInfo = GetGuildBankText(gslBankTab)
p1 = gslBankInfo:match('%-%- GSL %-%-%s+(.*)%s+%-%- ENDGSL %-%-')
self:Print(p1)
end
The string has now changed slightly the information we are parsing is
{itemid:qty, itemid:qty, itemid:qty, itemid:qty}
Now, this is a string that's being called in p1. I need to update the s:match method to strip the { } also, and iterate over each item and its key seperated by, so I'm left with
itemid:qty
itemid:qty
itemid:qty
itemid:qty
Then I can identify each line individually and place it where it needs to go.
try
s=[[-- GSL --
items = ["itemid":"qty" ,"itemid":"qty" ,"itemid":"qty" ,]
-- ENDGSL --]]
print(s:match('%-%- GSL %-%-%s+(.*)%s+%-%- ENDGSL %-%-'))
The key probably is that - is a pattern modifier that needs quoting if you want a literal hyphen. More info on patterns in the Lua Reference Manual, chapter 5.4.1
Edit:
To the additional problem of looping through keys of what is almost an array, you could do 2 things:
Either loop over it as a string, assuming both key and quantity are integers:
p="{1:10, 2:20, 3:30}"
for id,qty in p:gmatch('(%d+):(%d+)') do
--do something with those keys:
print(id,qty)
end
Or slightly change the string, evaluate it as a Lua table:
p="{1:10, 2:20, 3:30}"
p=p:gsub('(%d+):','[%1]=') -- replace : by = and enclose keys with []
t=loadstring('return '..p)() -- at this point, the anonymous function
-- returned by loadstring get's executed
-- returning the wanted table
for k,v in pairs(t) do
print(k,v)
end
If the formats of keys or quantities is not simply integer, changing it in the patterns should be trivial.

App engine - easy text search

I was hoping to implement an easy, but effective text search for App Engine that I could use until official text search capabilities for app engine are released. I see there are libraries out there, but its always a hassle to install something new. I'm wondering if this is a valid strategy:
1) Break each property that needs to be text-searchable into a set(list) of text fragments
2) Save record with these lists added
3) When searching, just use equality filters on the list properties
For example, if I had a record:
{
firstName="Jon";
lastName="Doe";
}
I could save a property like this:
{
firstName="Jon";
lastName="Doe";
// not case sensative:
firstNameSearchable=["j","o", "n","jo","on","jon"];
lastNameSerachable=["D","o","e","do","oe","doe"];
}
Then to search, I could do this and expect it to return the above record:
//pseudo-code:
SELECT person
WHERE firstNameSearchable=="jo" AND
lastNameSearchable=="oe"
Is this how text searches are implemented? How do you keep the index from getting out of control, especially if you have a paragraph or something? Is there some other compression strategy that is usually used? I suppose if I just want something simple, this might work, but its nice to know the problems that I might run into.
Update:::
Ok, so it turns out this concept is probably legitimate. This blog post also refers to it: http://googleappengine.blogspot.com/2010/04/making-your-app-searchable-using-self.html
Note: the source code in the blog post above does not work with the current version of Lucene. I installed the older version (2.9.3) as a quick fix since google is supposed to come out with their own text search for app engine soon enough anyway.
The solution suggested in the response below is a nice quick fix, but due to big table's limitations, only works if you are querying on one field because you can only use non-equality operators on one property in a query:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")
If you want to query on more than one property, you can save indexes for each property. In my case, I'm using this for some auto-suggest functionality on small text fields, not actually searching for word and phrase matches in a document (you can use the blog post's implementation above for this). It turns out this is pretty simple and I don't really need a library for it. Also, I anticipate that if someone is searching for "Larry" they'll start by typing "La..." as opposed to starting in the middle of the word: "arry". So if the property is for a person's name or something similar, the index only has the substrings starting with the first letter, so the index for "Larry" would just be {"l", "la", "lar", "larr", "larry"}
I did something different for data like phone numbers, where you may want to search for one starting from the beginning or middle digits. In this case, I just stored the entire set of substrings starting with strings of length 3, so the phone number "123-456-7890" would be: {"123","234", "345", ..... "123456789", "234567890", "1234567890"}, a total of (10*((10+1)/2))-(10+9) = 41 indexes... actually what I did was a little more complex in order to remove some unlikely to-be-used substrings, but you get the idea.
Then your query would be:
(Pseaudo Code)
SELECT * from Person WHERE
firstNameSearchIndex == "lar"
phonenumberSearchIndex == "1234"
The way that app engine works is that if the query substrings match any of the substrings in the property, then that is counted as a match.
In practice, this won't scale. For a string of n characters, you need n factorial index entries. A 500 character string would need 1.2 * 10^1134 indexes to capture all possible substrings. You will die of old age before your entity finishes writing to the datastore.
Implementations like search.SearchableModel create one index entry per word, which is a bit more realistic. You can't search for arbitrary substrings, but there is a trick that lets you match prefixes:
From the docs:
db.GqlQuery("SELECT * FROM MyModel
WHERE prop >= :1 AND prop < :2",
"abc", u"abc" + u"\ufffd")
This matches every MyModel entity with
a string property prop that begins
with the characters abc. The unicode
string u"\ufffd" represents the
largest possible Unicode character.
When the property values are sorted in
an index, the values that fall in this
range are all of the values that begin
with the given prefix.

Searching for and matching elements across arrays

I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?
It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.
Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru
It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

Resources