Lua string library choices for finding and replacing text - arrays

I'm new to Lua programming, having come over from python to basically make a small addon for world of warcraft for a friend. I'm looking into various ways of finding a section of text from a rather large string of plain text. I need to extract the information from the text that I need and then process it in the usual way.
The string of text could be a number of anything, however the below is what we are looking to extract and process
-- GSL --
items = ["itemid":"qty" ,"itemid":"qty" ,"itemid":"qty" ,]
-- ENDGSL --
We want to strip the whole block of text from a potentially large block of text surrounding it, then remove the -- GSL -- and -- ENDGSL -- to be left with:
items = ["itemdid":"qty …
I've looked into various methods, and can't seem to get my head around any of them.
Anyone have any suggestions on the best method to tackle this problem?
EDIT: Additional problem,
Based on the accepted answer I've changed the code slightly to the following.
function GuildShoppingList:GUILDBANKFRAME_OPENED()
-- Actions to be taken when guild bank frame is opened.
if debug == "True" then self:Print("Debug mode on, guild bank frame opened") end
gslBankTab = GetCurrentGuildBankTab()
gslBankInfo = GetGuildBankText(gslBankTab)
p1 = gslBankInfo:match('%-%- GSL %-%-%s+(.*)%s+%-%- ENDGSL %-%-')
self:Print(p1)
end
The string has now changed slightly the information we are parsing is
{itemid:qty, itemid:qty, itemid:qty, itemid:qty}
Now, this is a string that's being called in p1. I need to update the s:match method to strip the { } also, and iterate over each item and its key seperated by, so I'm left with
itemid:qty
itemid:qty
itemid:qty
itemid:qty
Then I can identify each line individually and place it where it needs to go.

try
s=[[-- GSL --
items = ["itemid":"qty" ,"itemid":"qty" ,"itemid":"qty" ,]
-- ENDGSL --]]
print(s:match('%-%- GSL %-%-%s+(.*)%s+%-%- ENDGSL %-%-'))
The key probably is that - is a pattern modifier that needs quoting if you want a literal hyphen. More info on patterns in the Lua Reference Manual, chapter 5.4.1
Edit:
To the additional problem of looping through keys of what is almost an array, you could do 2 things:
Either loop over it as a string, assuming both key and quantity are integers:
p="{1:10, 2:20, 3:30}"
for id,qty in p:gmatch('(%d+):(%d+)') do
--do something with those keys:
print(id,qty)
end
Or slightly change the string, evaluate it as a Lua table:
p="{1:10, 2:20, 3:30}"
p=p:gsub('(%d+):','[%1]=') -- replace : by = and enclose keys with []
t=loadstring('return '..p)() -- at this point, the anonymous function
-- returned by loadstring get's executed
-- returning the wanted table
for k,v in pairs(t) do
print(k,v)
end
If the formats of keys or quantities is not simply integer, changing it in the patterns should be trivial.

Related

Identify Unique Sentences in SQL String with Repeat Sentences

I have a VARCHAR(MAX) string field in SQL that, in some instances, contains up to 100k char in LEN. Lets call it [Notes_Comments].
The contents of the field, [Notes_Comments], are notes and comments and in several instances have those notes and comments copied/pasted in multiple times. There is no delineation as to where the old and new parts start/end. Punctuation is also hit and miss.
I would like to figure out how to identify the unique sentences, or strings/substrings, within the [Notes_Comments] field to essentially get a pure, deduplicated, view of contents.
I built a SplitString function and was able to start parsing out 2,3 and 5 word phrases but it isn't really getting me to where I need to get for full sentences.
Example [Notes_Comments]:
I want to get this
10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words 4:46 PM Trying to figure out how to remove repeat parts of the one string field. Notes Some duplicate stuff. Now some new stuff has been added. Notes Added a little more. Must deduplicate portions of text that appear more than once. Notes One more section.
From this
10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words Notes Some duplicate stuff 10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words 4:46 PM Trying to figure out how to remove repeat parts of the one string field. Notes Some duplicate stuff. Now some new stuff has been added. Notes Added a little more. Must deduplicate portions of text that appear more than once. 10:46 AM Talked to blah about blah 11:46 AM They said they didn’t know what I meant 12:46 PM Built a SplitString function and parsed words into a table with a word index 1:46 PM Good progress to identify words but not full strings 2:46 PM Removing noise words 3:46 PM Removing known categories of words 4:46 PM Trying to figure out how to remove repeat parts of the one string field. Notes Some duplicate stuff. Now some new stuff has been added. Notes Added a little more. Must deduplicate portions of text that appear more than once. Notes One more section.
Here is my SplitString function:
CREATE FUNCTION dbo.SplitString
(
#str NVARCHAR(max),
#separator CHAR(1)
)
RETURNS TABLE
AS
RETURN (
WITH tokens(p, a, b) AS (
SELECT
CAST(1 AS bigint),
CAST(1 AS bigint),
CHARINDEX(#separator, #str)
UNION all
SELECT
p + 1,
b + 1,
CHARINDEX(#separator, #str, b + 1)
FROM tokens
WHERE b > 0
)
SELECT
p-1 ItemIndex,
SUBSTRING(
#str,
a,
CASE WHEN b > 0 THEN b-a ELSE LEN(#str) END)
AS word
FROM tokens
);
GO
To help going forward I would recommend taking that "Notes" column off of the table it is currently on, and making a new table of "EntityId", and "Note". EntityId being that of the parent record you want to associate the note to. In this way you have a one-to-many relationship between your entity and notes.
This will require quite a bit of work to change over, but maintaining it will be much easier and prevent similar issues to the one you are currently having.
For a purely sql approach I would change the table relationships first. Perform your split, and insert each individual message into the new table. After doing that you can write a separate procedure for identifying duplicates and perform deletes on all but one of those.
I should note that below is an application approach.
If you choose to update your table relationships, implement it first, then de-duping will become a two-step process, but relatively simple to implement.
For de-duping, you have to account for the fact that you said punctuation is hit or miss.
I would split on the :??AM and :??PM syntax of your timestamps. You'll have to work out the syntax but that should help you identify where an entry begins and ends. Split those into a list, Trim() each one as you add it, if you don't need the punctuation at all, do a replace on "." and "," and other punctuation characters you don't want as you add them to your collection as well.
Now you have two options, if you went with the table update, you could insert all of the values and then filter for duplicates after, in either sql or in code.
Or you can create a new list, do some looping and add only unique items to this new list. If you kept your single column concat them into one string and update your db, or if you went with the one-to-many design perform an insert for each unique entry.
If you want to do less exact matching, build a hash for each substring. Perhaps "timestamp" + " " + first X chars + last Y chars. Then compare your hashes for equality, again keeping only one. I recommend this pattern because it looks from your example that duplicate notes have duplicate timestamps. If not try only first x and last y as a hash.
I highly recommend updating your table structure to save you similar headaches in the future. The rest is a little bit of regex, and string splitting and comparison. Also, experiment in debug with the hashing if you try to use it to make sure you don't cut out too much before committing it to your db.
If you are going to split the strings in code you'll need to start by putting together a regex. I would match on the substring ":.. AM" or ":.. PM". Matching on just a ":" seems a little dangerous given that this is a note field and users can probably type just about anything they want.
Please note this is untested, but the logic should be sound. You may need to tinker with the match object accessors, and test that the index value is splitting exactly where you want, etc etc.
Start by reading your field into a string. Then utilize regex to find the messages within the string. Something like the following.
string pattern = #":.?*M";
Regex rgx = new Regex(pattern, RegexOptions.None);
MatchCollection matches = rgx.Matches(input);
The match objects in the matches collection contain an index for the starting point where the match was found. You'll want to loop through them and extract the substrings into individual strings. Something like:
List<string> allNotes = new List<string>();
for(i = 0, i < matches.Count, i++)
{
string nextNote = string.Empty;
int currentIndex = matches[i].index - 2;
int endIndex = -1;
if(matches[i+1] != null)
{
endIndex = matches[i + 1].index - 2;
}
if(endIndex != -1)
{
nextNote = substring(currentIndex, (endIndex - currentIndex);
}
else
{
nextNote = substring(currentIndex);
}
if(!string.IsNullOrEmpty(nextNote))
{
allNotes.Add(nextNote);
}
}
These substrings should span the beginning of the timestamp through the character before the next time stamp.
You can then do another function to de-dupe the list. Somethinglike:
Dictionary <string, string> dedupedDict = new Dictionary<string, string>();
foreach(string note in allNotes)
{
string noteHash = string.concat(note.substring(0, 6), note.substring(note.length - 7));
if(!dedupedDict.Keys.Contains(noteHash))
{
dedupedDict.Add(noteHash, note);
}
}
return dedupedDict.Values().ToList();
You might need to include some checking for strings being the minimum length to avoid index out of bounds exceptions, and/or change the hash size(my example should be the first and last 6 characters). You may also want to Trim() and/or Replace(".", "") on the punctuation at some point if you think it will help your matching.
Then use the list of deduped notes to populate your new CallNotes table. You'll need your call Id as well.
Once you've cleaned up your current values, you can use the same methods to intercept new notes before they are put into your database.

Power Query M loop table / lookup via a self-join

First of all I'm new to power query, so I'm taking the first steps. But I need to try to deliver sometime at work so I can gain some breathing time to learn.
I have the following table (example):
Orig_Item Alt_Item
5.7 5.10
79.19 79.60
79.60 79.86
10.10
And I need to create a column that will loop the table and display the final Alt_Item. So the result would be the following:
Orig_Item Alt_Item Final_Item
5.7 5.10 5.10
79.19 79.60 79.86
79.60 79.86 79.86
10.10
Many thanks
Actually, this is far too complicated for a first Power Query experience.
If that's what you've got to do, then so be it, but you should be aware that you are starting with a quite difficult task.
Small detail: I would expect the last Final_Item to be 10.10. According to the example, the Final_Item will be null if Alt_Item is null. If that is not correct, well that would be a nice first step for you to adjust the code below accordingly.
You can create a new blank query, copy and paste this code in the Advanced Editor (replacing the default code) and adjust the Source to your table name.
let
Source = Table.Buffer(Table1),
AddedFinal_Item =
Table.AddColumn(
Source,
"Final_Item",
each if [Alt_Item] = null
then null
else List.Last(
List.Generate(
() => [Final_Item = [Alt_Item], Continue = true],
each [Continue],
each [Final_Item =
Table.First(
Table.SelectRows(
Source,
(x) => x[Orig_Item] = [Final_Item]),
[Alt_Item = "not found"]
)[Alt_Item],
Continue = Final_Item <> "not found"],
each [Final_Item])))
in
AddedFinal_Item
This code uses function List.Generate to perform the looping.
For performance reasons, the table should always be buffered in memory (Table.Buffer), before invoking List.Generate.
List.Generate is one of the most complex Power Query functions.
It requires 4 arguments, each of which is a function in itself.
In this case the first argument starts with () and the other 3 with each (it should be clear from the outline above: they are aligned).
Argument 1 defines the initial values: a record with fields Final_Item and Continue.
Argument 2 is the condition to continue: if an item is found.
Argument 3 is the actual transformation in each iteration: the Source table is searched (with Table.SelectRows) for an Orig_Item equal to Alt_Item. This is wrapped in Table.First, which returns the first record (if any found) and accepts a default value if nothing found, in this case a record with field Alt_Item with value "not found", From this result the value of record field [Alt_Item] is returned, which is either the value of the first record, or "not found" from the default value.
If the value is "not found", then Continue becomes false and the iterations will stop.
Argument 4 is the value that will be returned: Final_Item.
List.Generate returns a list of all values from each iteration. Only the last value is required, so List.Generate is wrapped in List.Last.
Final remark: actual looping is rarely required in Power Query and I think it should be avoided as much as possible. In this case, however, it is a feasible solution as you don't know in advance how many Alt_Items will be encountered.
An alternative for List.Generate is using a resursive function.
Also List.Accumulate is close to looping, but that has a fixed number of iterations.
This can be solved simply with a self-join, the open question is how many layers of indirection you'll be expected to support.
Assuming just one level of indirection, no duplicates on Orig_Item, the solution is:
let
Source = #"Input Table",
SelfJoin1 = Table.NestedJoin( Source, {"Alt_Item"}, Source, {"Orig_Item"}, "_tmp_" ),
Expand1 = ExpandTableColumn( SelfJoin1, "_tmp_", {"Alt_Item"}, {"_lkp_"} ),
ChkJoin1 = Table.AddColumn( Expand1, "Final_Item", each (if [_lkp_] = null then [Alt_Item] else [_lkp_]), type number)
in
ChkJoin1
This is doable with the regular UI, using Merge Queries, then Expand Column and adding a custom column.
If yo want to support more than one level of indirection, turn it into a function to be called X times. For data-driven levels of indirection, you wrap the calls in a list.generate that drop the intermediate tables in a structured column, though that's a much more advanced level of PQ.

easier use of loops and vectors in spss to combine variables

I have a student who has gathered data in a survey online whereby each response was given a variable, rather than the variable having whatever the response was. We need a scoring algorithm which reads the statements and integrates. I can do this with IF statements per item, e.g.,
if Q1_1=1 var1=1.
if Q1_2=1 var1=2.
if Q1_3=1 var1=3.
if Q1_4=1 var1=4.
Doing this for a 200 item survey (now more like 1000) will be a drag and subject to many typos unless automated. I have no experience of vectors and loops in SPSS, but some reading suggests this is the way to approach the problem.
I would like to run if statements as something like (pseudocode):
for items=1 1 to 30
for responses=1 to 4
if Q1_2_1=1 a=1.
if Q1_2=1 a=2.
if Q1_3=1 a=3.
if Q1_4=1 a=4.
compute newitem(items)=a.
next response.
next item.
Which I would hope would produce a new variable (newitem1 to 30) which has one of the 4 responses for it's original corresponding 4 variable information.
Never written serious spss code before: please advise!
This will do the Job:
* creating some sample data.
data list free (",")/Item1_1 to Item1_4 Item2_1 to Item2_4 Item3_1 to Item3_4.
begin data
1,,,,,1,,,,,1,,
,1,,,1,,,,1,,,,
,,,1,,,1,,,,,1,
end data.
* now looping over the items and constructing the "NewItems".
do repeat Item1=Item1_1 to Item1_4
/Item2=Item2_1 to Item2_4
/Item3=Item3_1 to Item3_4
/Val=1 to 4.
if Item1=1 NewItem1=Val.
if Item2=1 NewItem2=Val.
if Item3=1 NewItem3=Val.
end repeat.
execute.
In this way you run all you loops simultaneously.
Note that "ItemX_1 to ItemX_4" will only work if these four variables are consecutive in the dataset. If they aren't, you have to name each of them separately - "ItemX_1 ItemX_2 ItemX_3 itemX_4".
Now if you have many such item sets, all named regularly as in the example, the following macro can shorten the process:
define !DoItems (ItemList=!cmdend)
!do !Item !in (!ItemList)
do repeat !Item=!concat(!Item,"_1") !concat(!Item,"_2") !concat(!Item,"_3") !concat(!Item,"_4")/Val=1 2 3 4.
if !item=1 !concat("New",!Item)=Val.
end repeat.
!doend
execute.
!enddefine.
* now you just have to call the macro and list all your Item names:
!DoItems ItemList=Item1 Item2 Item3.
The macro will work with any item name, as long as the variables are named ItemName_1, ItemName_2 etc'.

Reduced Survey Frequency - Salesforce Workflow

Hoping you can help me review the logic below for errors. I am looking to create a workflow that will send a survey out to end users on a reduced frequency. Basically, it will check the Account object of the Case for a field, 'Reduced Survey Frequency', which contains a # and will not send a survey until that # of days has passed since the last date set on the Contact field 'Last Survey Date'. Please review the code and let me know any recommended changes!
AND( OR(ISPICKVAL(Status,"Closed"), ISPICKVAL(Status,"PM Sent")),
OR(CONTAINS(RecordType.Name,"Portal Case"),CONTAINS(RecordType.Name,"Standard Case"),
CONTAINS(RecordType.Name,"Portal Closed"),
CONTAINS(RecordType.Name,"Standard Closed")),
NOT( Don_t_sent_survey__c )
,
OR(((TODAY()- Contact.Last_Survey_Date__c) >= Account.Reduced_Survey_Frequency__c ),Account.Reduced_Survey_Frequency__c==0,
ISBLANK(Account.Reduced_Survey_Frequency__c),
ISBLANK(Contact.Last_Survey_Date__c)
))
Thanks,
Brian H.
Personally I prefer the syntax where && and || are used instead of AND(), OR()functions. It just reads bit nicer to me, no need to trace so many commas, keep track of indentation in the more complex logic... But if you're more used to this Excel-like flow - go for it. In the end it has to be readable for YOU.
Also I'd consider reordering this a bit - simple checks, most likely to fail first.
The first part - irrelevant to your question
Don't use RecordType.Name because these Names can be translated to say French and it will screw your logic up for users who will select non-English as their preferred language. Use RecordType.DeveloperName, it's safer.
CONTAINS - do you really have so many record types that share this part in their name? What's wrong with normal = comparison? You could check if the formula would be more readable with CASE() statement. Or maybe flip the logic if there are say 6 rec types and you've explicitly listed 4 (this might have to be reviewed though when you add new rec. type). If you find yourself copy-pasting this block of 4 checks frequently - consider making a helper formula field with it...
The second part
ISBLANK checks could be skipped if you'll properly use the "treat nulls as blanks / as zeroes" setting at the bottom of formula editor. Because you're making check like
OR(...,
Account.Reduced_Survey_Frequency__c==0,
ISBLANK(Account.Reduced_Survey_Frequency__c),
...
)
which is essentially what this thing was designed for. I'd flip it to "treat nulls as zeroes" (but that means the ISBLANK check will never "fire"). If you're not comfortable with that - you can also "safely compare or substract" by using
BLANKVALUE(Account.Reduced_Survey_Frequency__c,0)
Which will have the similar "treat null as zero" effect but only in this one place.
So... I'd end up with something like this:
(ISPICKVAL(Status,'Closed') || ISPICKVAL(Status, 'PM Sent')) &&
(RecordType.DeveloperName = 'Portal_Case' ||
RecordType.DeveloperName = 'Standard_Case' ||
RecordType.DeveloperName = 'Portal_Closed' ||
RecordType.DeveloperName = 'Standard_Closed'
) &&
NOT(Don_t_sent_survey__c) &&
(Contact.Last_Survey_Date__c + Account.Reduced_Survey_Frequency__c < TODAY())
No promises though ;)
You can easily test them by enabling debug logs. You'll see there the workflow formula together with values that are used to evaluate it.
Another option is to make a temporary formula field with same logic and observe (in a report?) where it goes true/false for mass spot check.

Search entries in Go GAE datastore using partial string as a filter

I have a set of entries in the datastore and I would like to search/retrieve them as user types query. If I have full string it's easy:
q := datastore.NewQuery("Products").Filter("Name =", name).Limit(20)
but I have no idea how to do it with partial string, please help.
q := datastore.NewQuery("Products").Filter("Name >", name).Limit(20)
There is no like operation on app engine but instead you can use '<' and '>'
example:
'moguz' > 'moguzalp'
EDIT: GAH! I just realized that your question is Go-specific. My code below is for Python. Apologies. I'm also familiar with the Go runtime, and I can work on translating to Python to Go later on. However, if the principles described are enough to get you moving in the right direction, let me know and I wont' bother.
Such an operation is not directly supported on the AppEngine datastore, so you'll have to roll your own functionality to meet this need. Here's a quick, off-the-top-of-my-head possible solution:
class StringIndex(db.Model):
matches = db.StringListProperty()
#classmathod
def GetMatchesFor(cls, query):
found_index = cls.get_by_key_name(query[:3])
if found_index is not None:
if query in found_index.matches:
# Since we only query on the first the characters,
# we have to roll through the result set to find all
# of the strings that matach query. We keep the
# list sorted, so this is not hard.
all_matches = []
looking_at = found_index.matches.index(query)
matches_len = len(foundIndex.matches)
while start_at < matches_len and found_index.matches[looking_at].startswith(query):
all_matches.append(found_index.matches[looking_at])
looking_at += 1
return all_matches
return None
#classmethod
def AddMatch(cls, match) {
# We index off of the first 3 characters only
index_key = match[:3]
index = cls.get_or_insert(index_key, list(match))
if match not in index.matches:
# The index entity was not newly created, so
# we will have to add the match and save the entity.
index.matches.append(match).sort()
index.put()
To use this model, you would need to call the AddMatch method every time that you add an entity that would potentially be searched on. In your example, you have a Product model and users will be searching on it's Name. In your Product class, you might have a method AddNewProduct that creates a new entity and puts it into the datastore. You would add to that method StringIndex.AddMatch(new_product_name).
Then, in your request handler that gets called from your AJAXy search box, you would use StringIndex.GetMatchesFor(name) to see all of the stored products that begin with the string in name, and you return those values as JSON or whatever.
What's happening inside the code is that the first three characters of the name are used for the key_name of an entity that contains a list of strings, all of the stored names that begin with those three characters. Using three (as opposed to some other number) is absolutely arbitrary. The correct number for your system is dependent on the amount of data that you are indexing. There is a limit to the number of strings that can be stored in a StringListProperty, but you also want to balance the number of StringIndex entities that are in your datastore. A little bit of math with give you a reasonable number of characters to work with.
If the number of keywords is limited you could consider adding an indexed list property of partial search strings.
Note that you are limited to 5000 indexes per entity, and 1MB for the total entity size.
But you could also wait for Cloud SQL and Full Text Search API to be avaiable for the Go runtime.

Resources