Organizing and Searching for (dates , strings , countries) in matlab - arrays

I just got done doing a project here on R and am now doing some work with matlab.
I need to make 3 vectors :
DOD
Country
Age
Count and store a .txt list with 236 data points the data in the text file looks like this:
Unknown woman
Cause of death: found dead, with eyes removed.
Location of death: Jardim dos Ipês Itaquaquecetuba, São Paulo, Brazil
Date of death: August 9th, 2014
Cris
Cause of death: multiple gunshot wounds
Location of death: Portal da Foz, Foz do Iguaçu, Brazil
Date of death: September 13th, 2014
Betty Skinner (52 years old)
Cause of death: blunt force trauma to the head
Location of death: Cleveland, Ohio, USA
Date of death: December 4th, 2013
Brittany Stergis (22 years old)
Cause of death: gunshot wound to the head
Location of death: Cleveland, Ohio, USA
Date of death: December 5th, 2013
I have no idea how to look for string and organize them but would appreciate any ideas how to get started.

You can use textscan to read the file into a cell array of strings, and then use regexp to parse the strings to get your desired fields.
First, we read the text file into a cell array of strings:
fid = fopen('deaths.txt');
scanned_fields = textscan(fid, '%s', 'Delimiter','\n');
text_array = scanned_fields{1};
fclose(fid);
While textscan is capable of some rudimentary parsing, it's not sophisticated enough for what we're doing. So we're just using it to read each line as a single string: format %s means we are expecting a string, and setting Delimiter to \n means that the strings are separated by newline characters.
Next, we can unleash the awesome power of regular expressions to parse your string of dead women:
format = {
'(?<name>[ \w]*)'
' \('
'(?<age>[\d]*)'
' years old\) - Cause of death: '
'(?<cause>[ \w]*)'
' - Location of death: '
'(?<city>[ \w]*)'
', '
'(?<province>[ \w]*)'
', '
'(?<country>[ \w]*)'
' - Date of death: '
'(?<date>[ ,\w]*)'
};
format = [format{:}];
Here we're just defining a format string. I've broken it up like this to make it a little clearer what's going on. Let's go through it line-by-line:
(?<name>[ \w]*) The parentheses indicate that this is a chunk of text (a.k.a. a "token") that we wish to capture. The ?<name> says that we will call this token "name". Finally, the [ \w]* specifies what kind of text to match. The stuff inside the square brackets specifies which characters to look for: spaces () and/or alphanumeric characters (\w). The * outside the square brackets indicates that we will accept any number of these characters.
\( Next we are looking for a space and an open parenthesis. The backslash in front of the parenthesis is to indicate that we are looking for a literal parenthesis, i.e. this parenthesis should not be interpreted as the start of another token to capture.
(?<age>[\d]*) Another token to capture. This one is called "age" and contains any number of \d (numeric characters).
years old \) - Cause of death: More text to look for. Again, we will be matching this text, but we will not capturing it (because it is not enclosed in parentheses).
(?<city>[ \w]*) Another token to capture. This one is called "city" and contains any number of spaces and/or alphanumeric characters.
, Comma, space
(?<province>[ \w]*), (?<country>[ \w]*) - Date of death: You get the idea
(?<date>[ ,\w]*) Our final token, called "date", which contains any number of spaces, commas, and/or alphanumeric characters.
Then we parse the strings into a struct array:
parsed_fields = regexp(text_array, format, 'names');
parsed_fields = [parsed_fields{:}]'
This is what the output should look like:
>> parsed_fields(1)
ans =
name: 'Jacqueline Cowdrey'
age: '50'
cause: 'unknown'
city: 'Worthing'
province: 'West Sussex'
country: 'United Kingdom'
date: 'November 20th, 2013'
So you can get your vector of countries pretty straightforward-ly:
Country = {parsed_fields.country}';
Age is a simple numeric conversion:
Age_str = {parsed_fields.age};
Age = cellfun(#str2double, Age_str)';
Date as a string is pretty easy:
Date_str = {parsed_fields.date}';
But it's nice to have it as a MATLAB "serial date number", which allows arithmetic computations and reformatting into different types of representation formats. Unfortunately, having the day as "20th" instead of "20" is incompatible with the conversion functions, so we'll need to first strip off the "st", "nd", "rd" from "1st", "2nd", "3rd", etc:
Date_str = regexprep(Date_str, '(?<day>[\d]+)(st|nd|rd|th)', '$<day>');
Date_num = datenum(Date_str, 'mmmm dd, yyyy');
Some other notes:
If the file is very large, you may wish to use fgetl to read it one line at a time (and then also parse it one line at a time) rather than reading the entire file into memory as we did above.
In your example, it looks like the entries are separated by an extra newline. I'm not sure if that's case in your actual data or if that's just a stackoverflow thing, but if you need to remove these newlines you can do so with:
is_empty_line = cellfun(#isempty, text_array);
text_array = text_array(~is_empty_line);
In your example, there were a lot of typos (an extra space here and there, sometimes the colons or dashes were other symbols). If these typos exist in your actual data, you will need to adjust the format specification to account for this. For example, instead of using - to match (space, dash, space), you can use \s*\W\s* to match (any number of whitespace characters, a single non-alphanumeric character, any number of whitespace characters).
If syntax like format = [format{:}]; or Country = {parsed_fields.country}'; look strange to you, these are equivalent to:
format = [format{1} format{2} format{3} ... format{end}];
Country = cell(length(parsed_fields),1);
for ii = 1:length(parsed_fields)
Country{ii} = parsed_fields(ii).country;
end
MATLAB R2014b added a new datetime class, so there may be a better way to deal with that nowadays.

Sorry about my previous answer; I had misunderstood how exactly the data is formatted.
As before, let's first read the text file into a cell array of strings:
fid = fopen('deaths.txt');
scanned_fields = textscan(fid, '%s', 'Delimiter','\n');
text_array = scanned_fields{1};
fclose(fid);
While textscan is capable of some rudimentary parsing, it's not sophisticated enough for what we're doing. So we're just using it to read each line as a single string: format %s means we are expecting a string, and setting Delimiter to \n means that the strings are separated by newline characters.
In the sample data you posted, each entry is 4 lines (name, cause, location, date) followed by an empty line. As long as we can rely on this formatting, this provides an easy way to split up the data (instead of the regexp parsing I proposed in my previous answer).
name_str_array = text_array(1:5:end);
cause_str_array = text_array(2:5:end);
loc_str_array = text_array(3:5:end);
date_str_array = text_arary(4:5:end);
So for example, name_strs is going to be every 5th line, starting with line #1. Likewise, cause_strs is every 5th line, starting with line #2. Just be careful that there are not any extra or missing lines in the data.
Next we will parse each of these to get the information that we want. In my previous answer, I proposed parsing all of the strings at once, but I think it would be easier to understand if we went through it one entry at a time. For example, let's consider the first entry.
name_str = name_str_array{1};
loc_str = loc_str_array{1};
date_str = date_str_array{1};
Let's start with the easiest one: parsing the date.
date_format = 'Date of death:\s*(?<date>.*)';
parsed_fields = regexp(date_str, date_format, 'names');
DOD = parsed_fields.date;
The format we're looking for is the string Date of death:, followed by any number of whitespace characters (\s*), followed by the chunk of text (aka "token") that we wish to capture: (?<date>.*)
The parentheses indicate that this is a token we wish to capture, the ?<date> indicates that we wish to call this token "date", and the .* specifies which characters to look for. The . is the universal wildcard, i.e. it matches all possible characters. The * indicates that we are interested in any number of repeats. So in essence, this .* means "match all remaining characters in the string".
Calling regexp with the names option causes it to return a struct with the named tokens as its fields.
Next, let's do the country. This one is a little trickier because there is a variable number of city/region specifiers. But the country will always be the last one, so that's the one we'll grab.
country_format = '(?<country>\w[ \w]*)$';
parsed_fields = regexp(loc_str, country_format, 'names');
Country = parsed_fields.country;
This format specification is the token (?<country>\w[ \w]*) followed by the end of the string (denoted by the special character $). In the token specification we are matching an alphanumeric character (\w) followed by any number of spaces and/or alphanumeric characters ([ \w]*). The reason for specifying this leading \w is so that we don't match the space between the previous comma and the start of the country name.
Finally, let's do the age. This one is tricky because not every entry has an age. At least it's easy because the age (if it exists) is the only numeric data in the line. Hence:
age_format = '(?<age>[\d]+)';
parsed_fields = regexp(name_str, age_format, 'names');
if isempty(parsed_fields)
Age = -1;
else
Age = str2double(parsed_fields.age);
end
The format specification is simply the token (?<age>[\d]+), which specifies that we are looking for numeric characters (\d), and we are looking for one or more of them (+).
After parsing, we check whether or not there was a match. If not (parsed_fields is empty), then we assign Age a value of -1. Otherwise, we convert the parsed age field into a number.
So putting it all together:
date_format = 'Date of death:\s*(?<date>.*)';
country_format = '(?<country>\w[ \w]*)[\W]?$';
age_format = '(?<age>[\d]+)';
nEntries = length(date_str_array);
DOD = cell(nEntries, 1);
Country = cell(nEntries, 1);
Age = zeros(nEntries, 1);
for ii = 1:nEntries
name_str = name_str_array{ii};
loc_str = loc_str_array{ii};
date_str = date_str_array{ii};
parsed_fields = regexp(date_str, date_format, 'names');
assert(~isempty(parsed_fields), 'Could not parse date from:\n%s', date_str);
DOD{ii} = parsed_fields.date;
parsed_fields = regexp(loc_str, country_format, 'names');
assert(~isempty(parsed_fields), 'Could not parse country from:\n%s', loc_str);
Country{ii} = parsed_fields.country;
parsed_fields = regexp(name_str, age_format, 'names');
if isempty(parsed_fields)
Age(ii) = -1;
else
Age(ii) = str2double(parsed_fields.age);
end
end
I added the assert statements to help debug what's going on if you encounter errors in parsing.
For example, you may also notice that I added an [\W]? to the country format. This is because while running it on your example data, I encountered one country that contained a period at the end of the line (i.e. it ended with "Brazil." instead of just "Brazil"). So now we're looking to match a non-alphanumeric character (\W) repeated zero or 1 times (?), and it's outside of the parentheses so it is not being captured as part of the "country" token.

Related

Regex string with 2+ different numbers and some optional characters in Snowflake syntax

I would like to check if a specific column in one of my tables meets the following conditions:
String must contain at least three characters
String must contain at least two different numbers [e.g. 123 would work but 111 would not]
Characters which are allowed in the string:
Numbers (0-9)
Uppercase letters
Lowercase letters
Underscores (_)]
Dashes (-)
I have some experience with Regex but am having issues with Snowflake's syntax. Whenever I try using the '?' regex character (to mark something as optional) I receive an error. Can someone help me understand a workaround and provide a solution?
What I have so far:
SELECT string,
LENGTH(string) AS length
FROM tbl
WHERE REGEXP_LIKE(string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$')
ORDER BY length;
Thanks!
Your regex looks a little confusing and invalid, and it doesn't look like it quite meets your needs either. I read this expression as a string that:
Must start with one or more digits, at least 3 or more times
The confusing part to me is the '+' is a quantifier, which is not quantifiable with {3,} but somehow doesn't produce an error for me
Optionally followed by either a dash or plus sign
Followed by an uppercase character zero or one times (giving back as needed)
Followed by and ending with a lowercase character zero or one times (giving back as needed)
Questions
You say that your string must contain 3 characters and at least 2 different numbers, numbers are characters but I'm not sure if you mean 3 letters...
Are you considering the numbers to be characters?
Does the order of the characters matter?
Can you provide an example of the error you are receiving?
Notes
Checking for a second digit that is not the same as the first involves the concept of a lookahead with a backreference. Snowflake does not support backreferences.
One thing about pattern matching with regular expressions is that order makes a difference. If order is not of importance to you, then you'll have multiple patterns to match against.
Example
Below is how you can test each part of your requirements individually. I've included a few regexp_substr functions to show how extraction can work to check if something exists again.
Uncomment the WHERE clause to see the dataset filtered. The filters are written as expressions so you can remove any/all of the regexp_* columns.
select randstr(36,random(123)) as r_string
,length(r_string) AS length
,regexp_like(r_string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$') as reg
,regexp_like(r_string,'.*[A-Za-z]{3,}.*') as has_3_consecutive_letters
,regexp_like(r_string,'.*\\d+.*\\d+.*') as has_2_digits
,regexp_substr(r_string,'(\\d)',1,1) as first_digit
,regexp_substr(r_string,'(\\d)',1,2) as second_digit
,first_digit <> second_digit as digits_1st_not_equal_2nd
,not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) as first_digit_does_not_appear_again
,has_3_consecutive_letters and has_2_digits and first_digit_does_not_appear_again as test
from table(generator(rowcount => 10))
//where regexp_like(r_string,'.*[A-Za-z]{3,}.*') // has_3_consecutive_letters
// and regexp_like(r_string,'.*\\d+.*\\d+.*') // has_2_digits
// and not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) // first_digit_does_not_appear_again
;
Assuming the digits need to be contiguous, you can use a javascript UDF to find the number in a string with with the largest number of distinct digits:
create or replace function f(S text)
returns float
language javascript
returns null on null input
as
$$
const m = S.match(/\d+/g)
if (!m) return 0
const lengths = m.map(m=> [...new Set (m.split(''))].length)
const max_length = lengths.reduce((a,b) => Math.max(a,b))
return max_length
$$
;
Combined with WHERE-clause, this does what you want, I believe:
select column1, f(column1) max_length
from t
where max_length>1 and length(column1)>2 and column1 rlike '[\\w\\d-]+';
Yielding:
COLUMN1 | MAX_LENGTH
------------------------+-----------
abc123def567ghi1111_123 | 3
123 | 3
111222 | 2
Assuming this input:
create or replace table t as
select * from values ('abc123def567ghi1111_123'), ('xyz111asdf'), ('123'), ('111222'), ('abc 111111111 abc'), ('12'), ('asdf'), ('123 456'), (null);
The function is even simpler if the digits don't have to be contiguous (i.e. count the distinct digits in a string). Then core logic changes to:
const m = S.match(/\d/g)
if (!m) return 0
const length = [...new Set (m)].length
return length
Hope that's helpful!

Codename One - String replace with empty character

I like to normalize the phone numbers I get from the contacts in the local phone book. To do that, I want to remove any spaces, dashes, plus signs etc from the number.
CN1 only offers the String.replace(oldchar, newchar) function, instead of String operations. From this post,
How to represent empty char in Java Character class, this should be the way to go:
primaryPhoneNumber = primaryPhoneNumber.replace(' ', Character.MIN_VALUE);
however, this approach has several implications.
the char in the console output looks like a space, but its not. its a string terminator.
+49 234-63446
0 234 63446
when using this normalized string literal, including the Character.Min_Value in a database, the database query involving this string crashes:
Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00
How to properly remove spaces and other chars and replace them with a "nothing" character?
You can use:
String p = StringUtils.replaceAll(phone, " ", "");

How to Convert Get Text Value to ArrayList in Robot framework

I would like to know how to convert this value to ArrayList?
${doc1}= Open Excel Document filename=${OpenExcel} doc_id=doc1
${view_bicccicmdu}= Read Excel Row row_num=1 max_num=6 sheet_name=UpperTT
${view_bicccicmduCheckLength}= Get Length ${view_bicccicmdu}
${HG}= Get Text ${ClickAV.CheckColumn}
${HGLenght}= Get Line Count ${HG}
Should Be Equal ${HGLenght} ${view_bicccicmduCheckLength}
Should Contain ${HG} ${view_bicccicmdu} ignore_case=True
Close Excel Document
But the result is
${HG} = Nodename
Transdate
BICC Support FAX Detection
Trunk Group Number
Bill Trunk Group Number
MGW Name Trunk
Group Name
Sub-Route Name
Circuit Type
Group Direction
Circuit Selection Mode
I need to convert it to be ArrayList and should count to be 11 Records, What should I do?
You can use the String Library and Split the string using \n as your separator, because in your case your data is separated by a line break, You can split the string into a list.
Splits the string using separator as a delimiter string.
If a separator is not given, any whitespace string is a separator. In
that case also possible consecutive whitespace as well as leading and
trailing whitespace is ignored.
Split words are returned as a list. If the optional max_split is
given, at most max_split splits are done, and the returned list will
have maximum max_split + 1 elements
You can do the following.
*** Test Cases ***
Test
${HG} = Set Variable Nodename\n ransdate\n ICC Support FAX Detection\n Trunk Group Number\n Bill Trunk Group Number\n MGW Name Trunk\n Group Name\n Sub-Route Name\n Circuit Type\n Group Direction\n Circuit Selection Mode\n
#{words} = Split String ${HG} \n
${HGLenght}= Get length ${words}
log ${words}
Results
${HGLenght} = 11
${words} = ['Nodename', 'ransdate', 'ICC Support FAX Detection', 'Trunk Group Number', 'Bill Trunk Group Number', 'MGW Name Trunk', 'Group Name', 'Sub-Route Name', 'Circuit Type', 'Group Direction', 'Circuit Selection Mode']
Hope This Helps
Thank you again, #WojTek T
My final code is
`${HG}= Get Text ${ClickAV.CheckColumn}
#{words} = Split String ${HG} \n
${UPPER1}= Evaluate "${words}".upper()
${UPPER2}= Evaluate "${view_dnc}".upper()
${HGLenght}= Get Line Count ${HG}
Should Be Equal ${HGLenght} ${view_dncCheckLength}
Should Contain ${UPPER1} ${UPPER2}`
I try to use "Get List Item" with Table name but It doesn't work, I should do this solution for my last question that I asking u. haha
Thank you again.

SQL Select statement until a character

I'm looking to extract all the text up until a '\' (backslash).
The substring is required to remove all proceeding characters (17 in total) and so I would like to return all after the 17th until it comes across a backslash.
I've tried using charindex but it doesn't seem to stop at the \ it returns characters afterward. My code is as follows
SELECT path, substring(path,17, CHARINDEX('\',Path)+ LEN(Path)) As Data
FROM [Table].[dbo].[Projects]
WHERE Path like '\ENQ%\' AND
Deleted = '0'
Example
The below screen shot shows the basic query and result i.e the whole string
I then use substring to remove the first X characters as there will always be the same amount of proceeding characters
But what Im actually after is (based on the above result) the "Testing 1" "Testing 2" and "Testing ABC" section
The substring is required to remove all proceeding characters (17 in total) and so I would like to return all after the 17th until it comes across a backslash.
select
substring(path,17,CHARINDEX('\',Path)-17)
from
table
To overcome Invalid length parameter passed to the LEFT or SUBSTRING function error, you can use CASE
select
substring(path,17,
CASE when CHARINDEX('\',Path,17)>0
Then CHARINDEX('\',Path)-17)
else VA end
)
from
table

Parsing currency value in Excel VBA

Ok... I guess my brain has finally gone on vacation without me today. I have extracted 2 fields from a website and I get this string
vData = "Amount Owed [EXTRACT]$125.00[EXTRACT]
vData was declared as an array (string).
I split vData on [EXTRACT] and I end up with 2 string variables like this:
varA = "Amount Owed"
varB = "$125.00"
To parse varB, I use
Dim varC as Currency
varC = val(varB)
I want to use varC in an If statement:
If Val(VarC) <> 0 Then
I was expecting varC to be 125 but when I look at varC is 0. I can't figure out why varC = 0 and not 125.
Use CCur(varB) instead of Val(varB). CCur converts to a Currency type, so knows about $ and other things.
Explanation from the MS docs:
... when you use CCur, different decimal separators, different thousand separators, and various currency options are properly recognized depending on the locale setting of your computer.
Val is only looking for straight numbers:
The Val function stops reading the string at the first character it can't recognize as part of a number. Symbols and characters that are often considered parts of numeric values, such as dollar signs and commas, are not recognized.

Resources