Complex string match SPSS (v20)

Complex string match SPSS (v20) - arrays

I got a problem I cant figure out using SPSS (v20).
There is a master list with 10.000 strings. Think of it as an array like so:
['Sao Paolo S.P.', 'IDE MUNICH', '1_New YORK', 'BabylonX', ...]
I have a dataset with a variable, that contains strings similar to the beforementioned array, but that are not the exact same ones.
Like so:
What I need to do is: check if OldString (from the dataset) is part of any of the strings in the master array.
Obviously 123 Babylon (from the dataset) will be related to BabylonX (from the array).
Obviously 1234 Sao Paolo S (from the dataset) will be related to Sao Paolo S.P. (from the array).
and so on...
If a match is detected, then the string from the (master)array should be filled in in a new variable NewString.
Is there any way to achieve that? Using VBA, Perl, PHP this is dead easy, but using SPSS I got no clue how to combine those steps.

The following syntax is a possible way to do loop the match comparison using char.index.
*First I'm turning your master list into a dataset -
this can be done differently depending how the data is
stored right now. In this example I just copy-pasted
from your post into the syntax.
data list free/masterstring (a20).
begin data
'Sao Paolo S.P.', 'IDE MUNICH', '1_New YORK', 'BabylonX'
end data.
*now I create a new syntax with a comparison command for each string in the list.
cd "c:\your path".
string cmd (a100).
compute cmd=concat("if char.index('",lower(rtrim(masterstring)),"',
lower(rtrim(mystring)))>0 matchedstr='",rtrim(masterstring),"'.").
write out="check strings.sps"/cmd.
* the syntax is ready, at this point you will go back to your original dataset.
For the example I'm creating a small example dataset.
data list free/mystring (a20).
begin data
"123 Babylon" "babylon" "Sao Paolo" "1234 Sao Paolo S"
end data.
*now we can run the syntax created earlier on the present dataset.
string matchedstr (a50).
insert file="check strings.sps".
exe.
What you should see in the result is that "babylon" was recognized as part of "BabylonX" (the command equalizes lower/upper cases) and therefore "BabylonX" appears in the matchedstring. The same for "Sao Paolo" and "Sao Paolo S.P.".
Note: if mystring matches more than one string in the list, the present syntax will only capture the last match.

Related

SQL Server: STRING_SPLIT() result in a computed column

I couldn't find good documentation on this, but I have a table that has a long string as one of it's columns. Here's some example data of what it looks like:
Hello:Goodbye:Apple:Orange
Example:Seagull:Cake:Chocolate
I would like to create a new computed column using the STRING_SPLIT() function to return the third value in the string table.
Result #1: "Apple"
Result #2: "Cake"
What is the proper syntax to achieve this?

At this time your answer is not possible.
The output rows might be in any order. The order is not guaranteed to
match the order of the substrings in the input string.
STRING_SPLIT reference
There is no way to guarantee which item was the third item in the list using string_split and the order may change without warning.
If you're willing to build your own, I'd recommend reading up on the work done by
Brent Ozar and Jeff Moden.

You shouldn't be storing data like that in the first place. This points to a potentially serious database design problem. BUT you could convert this string into JSON by replacing : with ",", surround it with [" and "] and retrieve the third array element , eg :
declare #value nvarchar(200)='Example:Seagull:Cake:Chocolate'
select json_value('["' + replace(#value,':','","' )+ '"]','$[2]')
The string manipulations convert the string value to :
["Example","Seagull","Cake","Chocolate"]
After that, JSON_VALUE parses the JSON string and retrieves the 3rd item in the array using a JSON PATH expression.
Needless to say, this will be slow and can't take advantage of indexing. If those values are meant to be read or written individually, they should be stored in separate columns. They'll probably take less space than one long string.
If you have a lot of optional fields but only a subset contain values at any time, you could use sparse columns. This way you could have thousands of rows, only a few of which would contain data at any time

Excel formula to search for one of multiple strings and return the string

I have names that are listed with salutations (ex. Mr. Mrs. Dr.). I am struggling with a formula that will search for the existence of those text strings, and, if one exists, return the salutation.
So, I would like the formula to look at "Dr. Nancy Briggs," and return "Dr."
Versions I have been trying include:
=IF(ISNUMBER(SEARCH({"Mr.","Mrs.","Dr."},C13)),LEFT(C13,FIND(" ",C13,1)-1),"")
=IF(OR(ISNUMBER(SEARCH("Mr.",C24)),ISNUMBER(SEARCH("Mrs.",C24)),ISNUMBER(SEARCH("Dr.",C24))),LEFT(C24,FIND(" ",C24,1)-1),"")
the salutations are always at the front, so I can extract them using the LEFT function. But, ideally, I would like to extract them from anywhere.
The second formula works, but is clunky. Your help is so very much appreciated!

You can use a nested IF:
=IF(ISNUMBER(SEARCH("Mr.")),"Mr.",IF(ISNUMBER(SEARCH("Mrs.")),"Mrs.",IF(ISNUMBER(SEARCH("Dr.")),"Dr.","No Salutation")))
If you have OFFICE 365:
You can use CONCAT as an Array formula:
=CONCAT(IF(ISNUMBER(SEARCH({"Mr.","Mrs.","Dr."},C13)),{"Mr.","Mrs.","Dr."},"")
Being an array formula it needs to be confirmed with Ctrl-Shift-Enter instead of Enter
Or IFS()
=IFS(ISNUMBER(SEARCH("Mr.")),"Mr.",ISNUMBER(SEARCH("Mrs.")),"Mrs.",ISNUMBER(SEARCH("Dr.")),"Dr.",TRUE,"No Salutation")
The only real changes is that I am not trying to pull the return from the string as you only want the actual salutation.

Remove Duplicate adjacent Sub-String from String in Microsoft SQL Server

I am using SQL Server 2008 and I have a column in a table, which has values like below. It basically shows departure and arrival information.
-->Heathrow/Dublin*Dublin/Heathrow
-->Gatwick/Liverpool*Liverpool/Carlisle *Carlisle/Gatwick
-->Heathrow/Dublin*Liverpool/Heathrow
(The 3rd example shown above is slightly different where the person did not depart from Dublin, instead departed from a Liverpool).
This makes the column too lengthy, and I want to remove only the adjacent duplicates, so the information can be shown like below:
-->Heathrow/Dublin/Heathrow
-->Gatwick/Liverpool/Carlisle/Gatwick
-->Heathrow/Dublin***Liverpool/Heathrow
So, this would still show the correct travel route, but omits only the contiguous duplicates. Also, in the 3rd case, since the departure and arrival information location is not the same, Iwould like to show it as ***.
I found a post here that removes all duplicates (Find and Remove Repeated Substrings) but this is slightly different from the solution that I need.
Could someone share any thoughts please?

The first step is to adapt the process defined in the following link so that it splits based on /:
T-SQL split string
This returns a table which you would then loop through checking if the value contains an *. In that case you would get the text values before and after the * and compare them. Use CHARINDEX to get the position of the *, and SUBSTRING to get the values before and after. Once you have those check both values and append to your output string accordingly.

So you have a database column that contains this text string? Is your concern to display the data to the user in a new format, or to update the data in your database table with a new value?
Do you have access to the original data from which this text string was built? It would probably be easier to re-create the string in the format you desire than it would be to edit the existing string programmatically.
If you don't have access to this data, it would probably be a lot simpler to update your data (or reformat it for display) if you do the string manipulation in a high-level language such as c# or java.
If you're reformatting it for display, write the string manipulation code in whatever language is appropriate, right before displaying it. If you're updating your table, you could write a program to process the table, reading each record, building the replacement string, and updating the record before moving on to the next one.
The bottom line is that T-SQL is just not a good language for doing this sort of string examination and manipulation. If you can build a fresh string from the original data, or do your manipulation in a high-level language, you'll have an easier job of it and end up with more maintainable code.

I wrote a code for the first example you gave. You still need to
improve it for the rest ...
DECLARE #STR VARCHAR(50)='Heathrow/Dublin*Dublin/Heathrow'
IF (SELECT SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1)) =
(SELECT SUBSTRING(#STR,CHARINDEX('*',#STR)+1,LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1))))
BEGIN
SELECT STUFF(#STR,CHARINDEX('*',#STR),LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1))+1,'')
END
ELSE
BEGIN
SELECT STUFF(#STR,CHARINDEX('*',#STR),LEN(SUBSTRING(#STR,CHARINDEX('*',#STR)+1,LEN(SUBSTRING(#STR,CHARINDEX('/',#STR)+1,CHARINDEX('*',#STR)-CHARINDEX('/',#STR)-1)))),'***')
END

How to use NSPredicate with NSPredicateEditor on different data (Multiple Predicates?)

I've got an array of filepaths and I've got a NSPredicateEditor setup in my UI where the user can combine a NSPredicate to find a file. He should be able to filter by name, type, size and date.
There are several problems I have now:
I can only get one predicate object from the editor. When I use
"predicateForRow:" it returns (null)
If the user wants to filter the file by name AND size or date, I
can't just use this predicate on my array anymore because those
information are not contained in it
Can I split up a predicate into different predicates without
converting it into a NSString object, then search for every #" OR " |
#" AND " and seperating the components into an array and then
converting every NSString into a new predicate?
In the NSPredicateEditor settings I've some options for the "left Expression":
Keypaths, Constant Values, Strings, Integer Numbers, Floating Point Numbers and Dates. I want to display a dropdown menu to the user with "name", "type", "date", "size". But then the generated predicate automatically looks like this:
"name" MATCHES[c] "nameTest" OR "type" MATCHES[c] "jpg" OR size == 100
Because the array is filled with strings, a search for "name", "type" etc. and those strings do not respond to #"myString"*.name*m the filter always returns 0 objects. Is there a way to show the Name, Type, Size and Date in the Menu, but write "self" into the predicate without doing it by hand?
I've already searched in the official Apple tutorials, on Stackoverflow, Google, and even Youtube to find a clue. This problem troubles me for almost one week now. Thanks for you time! If you need more information please let me know!

You have come to the right place! :)
I can only get one predicate object from the editor.
Correct. It is an NSPredicateEditor, not an NSPredicatesEditor. ;)
When I use "predicateForRow:" it returns (null)
I'm not sure I would use that method. My general rule of thumb is to largely ignore that NSPredicateEditor is a subclass of NSRuleEditor, mainly because it's such a highly specialized subclass that many of the superclass methods don't make that much sense on a predicate editor (like all the stuff about criteria, row selection, etc). It's possible that they're somehow relevant, but if they are, I haven't figured out how yet.
To get the predicate from the editor, you do:
NSPredicate *predicate = [myPredicateEditor objectValue];
If the user wants to filter the file by name AND size or date
You mean (name = [something]) AND (size = [something] OR date = [something])?
If so, NSPredicateEditor can do that if you've set the nesting mode to "Compound".
I can't just use this predicate on my array anymore because those information are not contained in it
What information do you need?
Can I split up a predicate into different predicates without converting it into a NSString object, then search for every #" OR " | #" AND " and seperating the components into an array and then converting every NSString into a new predicate?
Yes, but that is a BAD idea. It's bad because NSPredicate already contains all the information you need, and converting it to a different format and doing string manipulations just isn't necessary and can potentially lead to complications (like if someone can type in a value for "name", what happens if they type in " OR "?).
I'm having a hard time trying to figure out what it is you're trying to do. It sounds like you have an array of NSString objects that you want to filter based on a predicate that the user creates? If so, then what do these name, date, and size key paths mean? What are you trying to do?

SSIS suitability

I'm tring to create an SSIS package to import some dataset files, however given that I seem to be hitting a brick
wall everytime I achieve a small part of the task I need to take a step back and perform a sanity check on what I'm
trying to achieve, and if you good people can advise whether SSIS is the way to go about this then I would
appreciate it.
These are my questions from this morning :-
debugging SSIS packages - debug.writeline
Changing an SSIS dts variables
What I'm trying to do is have a For..Each container enumerate over the files in a share on the SQL Server. For each
file it finds a script task runs to check various attributes of the filename, such as looking for a three letter
code, a date in CCYYMM, the name of the data contained therein, and optionally some comments. For example:-
ABC_201007_SalesData_[optional comment goes here].csv
I'm looking to parse the name using a regular expression and put the values of 'ABC', '201007', and
'SalesData' in variables.
I then want to move the file to an error folder if it doesn't meet certain criteria :-
Three character code
Six character date
Dataset name (e.g. SalesData, in this example)
CSV extension
I then want to lookup the Character code, the date (or part thereof), and the Dataset name against a lookup table
to mark off a 'checklist' of received files from each client.
Then, based on the entry in the checklist, I want to kick off another SSIS package.
So, for example I may have a table called 'Checklist' with these columns :-
Client code Dataset SSIS_Package
ABC SalesData NorthSalesData.dtsx
DEF SalesData SouthSalesData.dtsx
If anyone has a better way of achieving this I am interested in hearing about it.
Thanks in advance

That's an interesting scenario, and should be relatively easy to handle.
First, your choice of the Foreach Loop is a good one. You'll be using the Foreach File Enumerator. You can restrict the files you iterate over to be just CSVs so that you don't have to "filter" for those later.
The Foreach File Enumerator puts the filename (full path or just file name) into a variable - let's call that "FileName". There's (at least) two ways you can parse that - expressions or a Script Task. Depends which one you're more comfortable with. Either way, you'll need to create three variables to hold the "parts" of the filename - I'll call them "FileCode", "FileDate", and "FileDataset".
To do this with expressions, you need to set the EvaluateAsExpression property on FileCode, FileDate, and FileDataset to true. Then in the expressions, you need to use FINDSTRING and SUBSTRING to carve up FileName as you see fit. Expressions don't have Regex capability.
To do this in a Script Task, pass the FileName variable in as a ReadOnly variable, and the other three as ReadWrite. You can use the Regex capabilities of .Net, or just manually use IndexOf and Substring to get what you need.

Unfortunately, you have just missed the SQLLunch livemeeting on the ForEach loop: http://www.bidn.com/blogs/BradSchacht/ssis/812/sql-lunch-tomorrow
They are recording the session, however.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight