Tokenizing a raw string in c - c

I would like to tokenize a string, but in a very special way.
I have the following string, formed by 3 groups of words, separated by a space:
string = abc def ghi
The thing is that I would like to load into another string all the content of string variable until the second space. That is, I would like to get:
result = abc def
And not only abc (that solution was in other forums). Please, note that the length of each word could differ.
How would I do that?

I would like to load in one string all the content of string variable
until the second space
How about:
char *space = strchr(string, ' ');
if (!space)
error;
space++;
space = strchr(space, ' ');
if (!space)
error;
Or if you know there will always be exactly 3 words, do a single strrchr (reverse). Or maybe do 2 sscanfs and then join the strings, or 2 strtoks etc.

Related

How to loop assigning characters in a string to variable?

I need to take a string and assign each character to a new string variable for a Text To Speech engine to read out each character separately, mainly to control the speed at which it's read out by adding pauses in between each character.
The string contains a number which can vary in length from 6 digits to 16 digits, and I've put the below code together for 6 digits but would like something neater to handle any different character count.
I've done a fair bit of research but can't seem to find a solution, plus I'm new to Groovy / programming.
OrigNum= "12 34 56"
Num = OrigNum.replace(' ','')
sNum = Num.split("(?!^)")
sDigit1 = sNum[0]
sDigit2 = sNum[1]
sDigit3 = sNum[2]
sDigit4 = sNum[3]
sDigit5 = sNum[4]
sDigit6 = sNum[5]
Edit: The reason for needing a new variable for each character is the app that I'm using doesn't let the TTS engine run any code. I have to specifically declare a variable beforehand for it to be read out
Sample TTS input: "The number is [var:sDigit1] [pause] [var:sDigit2] [pause]..."
I've tried using [var:sNum[0]] [var:sNum[1]] to read from the map instead but it is not recognised.
Read this about dynamically creating variable names.
You could use a map in your stuation, which is cleaner and more groovy:
Map digits = [:]
OrigNum.replaceAll("\\s","").eachWithIndex { digit, index ->
digits[index] = digit
}
println digits[0] //first element == 1
println digits[-1] //last element == 6
println digits.size() // 6
Not 100% sure what you need, but to convert your input String to output you could use:
String origNum = "12 34 56"
String out = 'The number is ' + origNum.replaceAll( /\s/, '' ).collect{ "[var:$it]" }.join( ' [pause] ' )
gives:
The number is [var:1] [pause] [var:2] [pause] [var:3] [pause] [var:4] [pause] [var:5] [pause] [var:6]

Codename One - String replace with empty character

I like to normalize the phone numbers I get from the contacts in the local phone book. To do that, I want to remove any spaces, dashes, plus signs etc from the number.
CN1 only offers the String.replace(oldchar, newchar) function, instead of String operations. From this post,
How to represent empty char in Java Character class, this should be the way to go:
primaryPhoneNumber = primaryPhoneNumber.replace(' ', Character.MIN_VALUE);
however, this approach has several implications.
the char in the console output looks like a space, but its not. its a string terminator.
+49 234-63446
0 234 63446
when using this normalized string literal, including the Character.Min_Value in a database, the database query involving this string crashes:
Caused by: org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00
How to properly remove spaces and other chars and replace them with a "nothing" character?
You can use:
String p = StringUtils.replaceAll(phone, " ", "");

Octave - Adding '\n' to String Array is Not Creating a New Line

I want to change ',' character to '\n' and save it to the text file
All files are in this format:
546,234,453,685,.....,234
I want to make it like:
546
234
453
685
...
234
My initiation to this problem is like this:
fid=fopen(files{i});
strArr=fscanf(fid,'%s');
newstrArr=strrep(strArr,',','\n');
% Take each .txt input
for j=1:length(newstrArr)
Array=[Array newstrArr(j)];
endfor
Let me explain step by step:
1st I open the current text file
fid=fopen(files{i});
2nd I find the strings in text file
strArr=fscanf(fid,'%s');
Please Note that you can't replace %s with %d. (Correct me if I am wrong)
3rd I replace commas with newline character
newstrArr=strrep(strArr,',','\n');
4th I add each character to a new array with for loop
for j=1:length(newstrArr)
Array=[Array newstrArr(j)];
endfor
However When I display, using;
disp(Array);
I have this output
How can I properly replace the commas with newlines?
Regards
The issue is that you are inserting a literal '\n' (the characters \ and n) and not a newline character. This is because in Octave, a single-quote enclosed string ignores escape sequences. If you want Octave to respect escape sequences you could use a double-quoted string which will convert \n into a newline.
strrep(strArr, ',', "\n");
Or if you want your code to be MATLAB-compatible, you'll want to instead use char(10) (an actual new-line character). This is because MATLAB does not have double-quote enclosed strings.
output = strrep(strArr, ',', char(10));
Another option would be to split your input at the , and use sprintf to add the newlines (it'll treat \n as a newline)
values = strsplit(strArr, ',');
output = sprintf('%s\n', values{:});
If you just want to save each entry to a new line in a file, you can use fprintf instead.
values = strsplit(strArr, ',');
fout = fopen('output.txt', 'w');
fprintf(foug, '%s\n', values{:});
fclose(fout);
If you really just want to replace "," with newline simply do
in = fileread ("yourfile");
out = strrep (in, ",", "\n")
out = 546
234
453
685
234
Btw, see the difference between "\n" (in GNU Octave a newline) and '\n' (literally \n)
Another option is to use regexprep(), this has the advantage of being MATLAB compatible. Assuming that the newline convention you want is \n, then
regexprep('123,456,789',',','\n')
ans = 123
456
789
When output to a file via fprintf() the result looks like
123
456
789
provided the text editor understands the newline convention.

even two string are same but when compare result are coming false

I am comparing two string.I am reading String 1 i.e expectedResult from excelsheet and String 2 i.e actualResult i am getting from web page by using " getElementByXPath("errorMsg_userPass").getText();
but when i equate two string even though they are same result of comparison are coming false i.e they are not same.
enter image description here
I don't know why it is happening like this .Please Help
use trim() to remove leading and trailing spaces!!
I recommend you looking at the exact bytes of the actual and expected strings. There might be for instance an unbreakable space instead of a regular space and then they will look the same but won't be the same for equals.
You can see the difference by running the following snippet:
String a = new String("a\u00A0b");
String b = new String ("a b");
System.out.println(a + "|" + Arrays.toString(a.getBytes()));
System.out.println(b + "|" + Arrays.toString(b.getBytes()));
Which will output:
a b|[97, -62, -96, 98]
a b|[97, 32, 98]

Organizing and Searching for (dates , strings , countries) in matlab

I just got done doing a project here on R and am now doing some work with matlab.
I need to make 3 vectors :
DOD
Country
Age
Count and store a .txt list with 236 data points the data in the text file looks like this:
Unknown woman
Cause of death: found dead, with eyes removed.
Location of death: Jardim dos Ipês Itaquaquecetuba, São Paulo, Brazil
Date of death: August 9th, 2014
Cris
Cause of death: multiple gunshot wounds
Location of death: Portal da Foz, Foz do Iguaçu, Brazil
Date of death: September 13th, 2014
Betty Skinner (52 years old)
Cause of death: blunt force trauma to the head
Location of death: Cleveland, Ohio, USA
Date of death: December 4th, 2013
Brittany Stergis (22 years old)
Cause of death: gunshot wound to the head
Location of death: Cleveland, Ohio, USA
Date of death: December 5th, 2013
I have no idea how to look for string and organize them but would appreciate any ideas how to get started.
You can use textscan to read the file into a cell array of strings, and then use regexp to parse the strings to get your desired fields.
First, we read the text file into a cell array of strings:
fid = fopen('deaths.txt');
scanned_fields = textscan(fid, '%s', 'Delimiter','\n');
text_array = scanned_fields{1};
fclose(fid);
While textscan is capable of some rudimentary parsing, it's not sophisticated enough for what we're doing. So we're just using it to read each line as a single string: format %s means we are expecting a string, and setting Delimiter to \n means that the strings are separated by newline characters.
Next, we can unleash the awesome power of regular expressions to parse your string of dead women:
format = {
'(?<name>[ \w]*)'
' \('
'(?<age>[\d]*)'
' years old\) - Cause of death: '
'(?<cause>[ \w]*)'
' - Location of death: '
'(?<city>[ \w]*)'
', '
'(?<province>[ \w]*)'
', '
'(?<country>[ \w]*)'
' - Date of death: '
'(?<date>[ ,\w]*)'
};
format = [format{:}];
Here we're just defining a format string. I've broken it up like this to make it a little clearer what's going on. Let's go through it line-by-line:
(?<name>[ \w]*) The parentheses indicate that this is a chunk of text (a.k.a. a "token") that we wish to capture. The ?<name> says that we will call this token "name". Finally, the [ \w]* specifies what kind of text to match. The stuff inside the square brackets specifies which characters to look for: spaces () and/or alphanumeric characters (\w). The * outside the square brackets indicates that we will accept any number of these characters.
\( Next we are looking for a space and an open parenthesis. The backslash in front of the parenthesis is to indicate that we are looking for a literal parenthesis, i.e. this parenthesis should not be interpreted as the start of another token to capture.
(?<age>[\d]*) Another token to capture. This one is called "age" and contains any number of \d (numeric characters).
years old \) - Cause of death: More text to look for. Again, we will be matching this text, but we will not capturing it (because it is not enclosed in parentheses).
(?<city>[ \w]*) Another token to capture. This one is called "city" and contains any number of spaces and/or alphanumeric characters.
, Comma, space
(?<province>[ \w]*), (?<country>[ \w]*) - Date of death: You get the idea
(?<date>[ ,\w]*) Our final token, called "date", which contains any number of spaces, commas, and/or alphanumeric characters.
Then we parse the strings into a struct array:
parsed_fields = regexp(text_array, format, 'names');
parsed_fields = [parsed_fields{:}]'
This is what the output should look like:
>> parsed_fields(1)
ans =
name: 'Jacqueline Cowdrey'
age: '50'
cause: 'unknown'
city: 'Worthing'
province: 'West Sussex'
country: 'United Kingdom'
date: 'November 20th, 2013'
So you can get your vector of countries pretty straightforward-ly:
Country = {parsed_fields.country}';
Age is a simple numeric conversion:
Age_str = {parsed_fields.age};
Age = cellfun(#str2double, Age_str)';
Date as a string is pretty easy:
Date_str = {parsed_fields.date}';
But it's nice to have it as a MATLAB "serial date number", which allows arithmetic computations and reformatting into different types of representation formats. Unfortunately, having the day as "20th" instead of "20" is incompatible with the conversion functions, so we'll need to first strip off the "st", "nd", "rd" from "1st", "2nd", "3rd", etc:
Date_str = regexprep(Date_str, '(?<day>[\d]+)(st|nd|rd|th)', '$<day>');
Date_num = datenum(Date_str, 'mmmm dd, yyyy');
Some other notes:
If the file is very large, you may wish to use fgetl to read it one line at a time (and then also parse it one line at a time) rather than reading the entire file into memory as we did above.
In your example, it looks like the entries are separated by an extra newline. I'm not sure if that's case in your actual data or if that's just a stackoverflow thing, but if you need to remove these newlines you can do so with:
is_empty_line = cellfun(#isempty, text_array);
text_array = text_array(~is_empty_line);
In your example, there were a lot of typos (an extra space here and there, sometimes the colons or dashes were other symbols). If these typos exist in your actual data, you will need to adjust the format specification to account for this. For example, instead of using - to match (space, dash, space), you can use \s*\W\s* to match (any number of whitespace characters, a single non-alphanumeric character, any number of whitespace characters).
If syntax like format = [format{:}]; or Country = {parsed_fields.country}'; look strange to you, these are equivalent to:
format = [format{1} format{2} format{3} ... format{end}];
Country = cell(length(parsed_fields),1);
for ii = 1:length(parsed_fields)
Country{ii} = parsed_fields(ii).country;
end
MATLAB R2014b added a new datetime class, so there may be a better way to deal with that nowadays.
Sorry about my previous answer; I had misunderstood how exactly the data is formatted.
As before, let's first read the text file into a cell array of strings:
fid = fopen('deaths.txt');
scanned_fields = textscan(fid, '%s', 'Delimiter','\n');
text_array = scanned_fields{1};
fclose(fid);
While textscan is capable of some rudimentary parsing, it's not sophisticated enough for what we're doing. So we're just using it to read each line as a single string: format %s means we are expecting a string, and setting Delimiter to \n means that the strings are separated by newline characters.
In the sample data you posted, each entry is 4 lines (name, cause, location, date) followed by an empty line. As long as we can rely on this formatting, this provides an easy way to split up the data (instead of the regexp parsing I proposed in my previous answer).
name_str_array = text_array(1:5:end);
cause_str_array = text_array(2:5:end);
loc_str_array = text_array(3:5:end);
date_str_array = text_arary(4:5:end);
So for example, name_strs is going to be every 5th line, starting with line #1. Likewise, cause_strs is every 5th line, starting with line #2. Just be careful that there are not any extra or missing lines in the data.
Next we will parse each of these to get the information that we want. In my previous answer, I proposed parsing all of the strings at once, but I think it would be easier to understand if we went through it one entry at a time. For example, let's consider the first entry.
name_str = name_str_array{1};
loc_str = loc_str_array{1};
date_str = date_str_array{1};
Let's start with the easiest one: parsing the date.
date_format = 'Date of death:\s*(?<date>.*)';
parsed_fields = regexp(date_str, date_format, 'names');
DOD = parsed_fields.date;
The format we're looking for is the string Date of death:, followed by any number of whitespace characters (\s*), followed by the chunk of text (aka "token") that we wish to capture: (?<date>.*)
The parentheses indicate that this is a token we wish to capture, the ?<date> indicates that we wish to call this token "date", and the .* specifies which characters to look for. The . is the universal wildcard, i.e. it matches all possible characters. The * indicates that we are interested in any number of repeats. So in essence, this .* means "match all remaining characters in the string".
Calling regexp with the names option causes it to return a struct with the named tokens as its fields.
Next, let's do the country. This one is a little trickier because there is a variable number of city/region specifiers. But the country will always be the last one, so that's the one we'll grab.
country_format = '(?<country>\w[ \w]*)$';
parsed_fields = regexp(loc_str, country_format, 'names');
Country = parsed_fields.country;
This format specification is the token (?<country>\w[ \w]*) followed by the end of the string (denoted by the special character $). In the token specification we are matching an alphanumeric character (\w) followed by any number of spaces and/or alphanumeric characters ([ \w]*). The reason for specifying this leading \w is so that we don't match the space between the previous comma and the start of the country name.
Finally, let's do the age. This one is tricky because not every entry has an age. At least it's easy because the age (if it exists) is the only numeric data in the line. Hence:
age_format = '(?<age>[\d]+)';
parsed_fields = regexp(name_str, age_format, 'names');
if isempty(parsed_fields)
Age = -1;
else
Age = str2double(parsed_fields.age);
end
The format specification is simply the token (?<age>[\d]+), which specifies that we are looking for numeric characters (\d), and we are looking for one or more of them (+).
After parsing, we check whether or not there was a match. If not (parsed_fields is empty), then we assign Age a value of -1. Otherwise, we convert the parsed age field into a number.
So putting it all together:
date_format = 'Date of death:\s*(?<date>.*)';
country_format = '(?<country>\w[ \w]*)[\W]?$';
age_format = '(?<age>[\d]+)';
nEntries = length(date_str_array);
DOD = cell(nEntries, 1);
Country = cell(nEntries, 1);
Age = zeros(nEntries, 1);
for ii = 1:nEntries
name_str = name_str_array{ii};
loc_str = loc_str_array{ii};
date_str = date_str_array{ii};
parsed_fields = regexp(date_str, date_format, 'names');
assert(~isempty(parsed_fields), 'Could not parse date from:\n%s', date_str);
DOD{ii} = parsed_fields.date;
parsed_fields = regexp(loc_str, country_format, 'names');
assert(~isempty(parsed_fields), 'Could not parse country from:\n%s', loc_str);
Country{ii} = parsed_fields.country;
parsed_fields = regexp(name_str, age_format, 'names');
if isempty(parsed_fields)
Age(ii) = -1;
else
Age(ii) = str2double(parsed_fields.age);
end
end
I added the assert statements to help debug what's going on if you encounter errors in parsing.
For example, you may also notice that I added an [\W]? to the country format. This is because while running it on your example data, I encountered one country that contained a period at the end of the line (i.e. it ended with "Brazil." instead of just "Brazil"). So now we're looking to match a non-alphanumeric character (\W) repeated zero or 1 times (?), and it's outside of the parentheses so it is not being captured as part of the "country" token.

Resources