File Reading In C with a specific format - c

I wanted read a text file in C and I want to perform search on this file. This is the content of the text file:
(EDIT: The original format looks a bit different as there are no newlines in the file. It has been reformatted to remove the whitespace between the text strings and filtered through a multicolumn program for a 80col screen.)
^%1~3~31225~2999 ^%1~8~33983~5304
~MAC100 ~MAC100
~RAJU ~LATHA CHERIAN
CR ~ELIM VILLA
~CHEMPOLA ~1
~VT : 2999 ~9847569922
~9847569922 ~32166
~29408 ~Message for bill gro~1960.0
~Message for bill gro~750.0 ~160.0
~250.0 ~0.0
~0.0 ~1~scheme name
~1~scheme name ~0
~0 ~June
~June ~VA019_95784~-
~VA019_93159~- ~0.0
~0.0 ~0~amc date 1~amc date 2~990
~1~amc date 1~amc date 2~990 ~15.0
~15.0 ~150.0
~150.0 ~narration
~narration ^%1~9~31588~3235
^%1~5~30882~2496 ~MAC100
~MAC100 ~BABU
~VISWAMPARAN T. P. ~NADUMPARMBIL
~THALAKOTTUCHALIL ~0
~C 4771 ~9847569922
~9847569922 ~29771
~29065 ~Message for bill gro~3304.0
~Message for bill gro~4320.0 ~160.0
~160.0 ~0.0
~0.0 ~1~scheme name
~1~scheme name ~0
~0 ~June
~June ~VA019_93516~-
~VA019_92833~- ~0.0
~0.0 ~0~amc date 1~amc date 2~990
~0~amc date 1~amc date 2~990 ~15.0
~15.0 ~150.0
~150.0 ~narration
~narration ^?
This is database in format of billing system. I want to do a generic search function this file based on the name and the id (which is the ^%1~9~**31588**~3235, here 31588 like that). This is file of records. Each record are begin with ^%1.~ the ~ is used to separate column values of each record. The first and last characters are not needed (^%1 in of each records and ^? at last of the file). Please help me to do this.

You first should define (or understand) precisely your input format (what are the possible & forbidden characters), perhaps using some EBNF notation.
Then you could process your input line by line (using fgets or getline) and parse each line individually (using sscanf or strtol and extra manual parsing)

Read Line by Line using fgets(), and then tokenize using strtok on the basis "~". hope that works

Related

How to Convert Get Text Value to ArrayList in Robot framework

I would like to know how to convert this value to ArrayList?
${doc1}= Open Excel Document filename=${OpenExcel} doc_id=doc1
${view_bicccicmdu}= Read Excel Row row_num=1 max_num=6 sheet_name=UpperTT
${view_bicccicmduCheckLength}= Get Length ${view_bicccicmdu}
${HG}= Get Text ${ClickAV.CheckColumn}
${HGLenght}= Get Line Count ${HG}
Should Be Equal ${HGLenght} ${view_bicccicmduCheckLength}
Should Contain ${HG} ${view_bicccicmdu} ignore_case=True
Close Excel Document
But the result is
${HG} = Nodename
Transdate
BICC Support FAX Detection
Trunk Group Number
Bill Trunk Group Number
MGW Name Trunk
Group Name
Sub-Route Name
Circuit Type
Group Direction
Circuit Selection Mode
I need to convert it to be ArrayList and should count to be 11 Records, What should I do?
You can use the String Library and Split the string using \n as your separator, because in your case your data is separated by a line break, You can split the string into a list.
Splits the string using separator as a delimiter string.
If a separator is not given, any whitespace string is a separator. In
that case also possible consecutive whitespace as well as leading and
trailing whitespace is ignored.
Split words are returned as a list. If the optional max_split is
given, at most max_split splits are done, and the returned list will
have maximum max_split + 1 elements
You can do the following.
*** Test Cases ***
Test
${HG} = Set Variable Nodename\n ransdate\n ICC Support FAX Detection\n Trunk Group Number\n Bill Trunk Group Number\n MGW Name Trunk\n Group Name\n Sub-Route Name\n Circuit Type\n Group Direction\n Circuit Selection Mode\n
#{words} = Split String ${HG} \n
${HGLenght}= Get length ${words}
log ${words}
Results
${HGLenght} = 11
${words} = ['Nodename', 'ransdate', 'ICC Support FAX Detection', 'Trunk Group Number', 'Bill Trunk Group Number', 'MGW Name Trunk', 'Group Name', 'Sub-Route Name', 'Circuit Type', 'Group Direction', 'Circuit Selection Mode']
Hope This Helps
Thank you again, #WojTek T
My final code is
`${HG}= Get Text ${ClickAV.CheckColumn}
#{words} = Split String ${HG} \n
${UPPER1}= Evaluate "${words}".upper()
${UPPER2}= Evaluate "${view_dnc}".upper()
${HGLenght}= Get Line Count ${HG}
Should Be Equal ${HGLenght} ${view_dncCheckLength}
Should Contain ${UPPER1} ${UPPER2}`
I try to use "Get List Item" with Table name but It doesn't work, I should do this solution for my last question that I asking u. haha
Thank you again.

Ui-mask definition for alphanumeric

I am using ui-mask="AA/AAA/9999999/999/9999999" in my angularjs to read PF number
the format will be Format : (Region Code/Office Code/Est Code/Extn No/Member Acc No) Example HR/FBD/0003256/000/0000125.
Now, istead of Extn No 000 i need to allow alphanumeric like 000,00A,00R for last chanracter.how to do it pls suggest.

Replace values only if they are different

I have a vcf file like this:
http://www.1000genomes.org/node/101
Here's the example from that site:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
After the header lines, each line has fields that contain genotypes starting with the 10th field. The 10th field is below the NA0001 heading; the 11th field is genotype NA0002, etc. I have a file with 123 different genotypes, so going from position 10 to 133 (NA0001 until NA0123). What is shown in these fields can be 0/0, 0/1, 0/2 .... till 8/9 for instance. Now I want to replace all the non-equal ones. So I would like to keep 0/0, 1/1, 2/2, etc. And replace 0/1, 0/2, 1/2, 4/5, 4/6 etc by ./.
I would like to write this in a C script. Thought about using sed y/regexp/replacement/ but no idea how to write all those unequal values in a regular expression. And on other positions in the file there could also be these values, so really only positions 10 till 133 should be replaced. And it needs to be replaced; I will be needing the rest of the file with the new values.
Hope it is clear. Anyone any idea how to do this?
This regex should do what you want: \s(\d)[|\/](?!\1)\d: Replace matches with ./.:
Breakdown:
\s(\d) matches a space followed by a single digit, capturing the digit in capture group #1
[|\/] matches a pipe or slash (since it seems that the VCF format allows either)
(?!\1)\d uses a negative lookahead to ensure that the next character is not the same as capture group #1, and matches the digit
Caveats:
I matched a leading space and trailing : to try to ensure it matches only the intended values. I couldn't work out a good way to limit it to fields 10 and after.
Example using perl:
perl -pe 's#\s(\d)[|/](?!\1)\d:# ./.:#g' testfile.vcf > testfile_afterchange.vcf
Note: I used # as the delimiter to avoid having to escape the / characters in the regex.

Organizing and Searching for (dates , strings , countries) in matlab

I just got done doing a project here on R and am now doing some work with matlab.
I need to make 3 vectors :
DOD
Country
Age
Count and store a .txt list with 236 data points the data in the text file looks like this:
Unknown woman
Cause of death: found dead, with eyes removed.
Location of death: Jardim dos Ipês Itaquaquecetuba, São Paulo, Brazil
Date of death: August 9th, 2014
Cris
Cause of death: multiple gunshot wounds
Location of death: Portal da Foz, Foz do Iguaçu, Brazil
Date of death: September 13th, 2014
Betty Skinner (52 years old)
Cause of death: blunt force trauma to the head
Location of death: Cleveland, Ohio, USA
Date of death: December 4th, 2013
Brittany Stergis (22 years old)
Cause of death: gunshot wound to the head
Location of death: Cleveland, Ohio, USA
Date of death: December 5th, 2013
I have no idea how to look for string and organize them but would appreciate any ideas how to get started.
You can use textscan to read the file into a cell array of strings, and then use regexp to parse the strings to get your desired fields.
First, we read the text file into a cell array of strings:
fid = fopen('deaths.txt');
scanned_fields = textscan(fid, '%s', 'Delimiter','\n');
text_array = scanned_fields{1};
fclose(fid);
While textscan is capable of some rudimentary parsing, it's not sophisticated enough for what we're doing. So we're just using it to read each line as a single string: format %s means we are expecting a string, and setting Delimiter to \n means that the strings are separated by newline characters.
Next, we can unleash the awesome power of regular expressions to parse your string of dead women:
format = {
'(?<name>[ \w]*)'
' \('
'(?<age>[\d]*)'
' years old\) - Cause of death: '
'(?<cause>[ \w]*)'
' - Location of death: '
'(?<city>[ \w]*)'
', '
'(?<province>[ \w]*)'
', '
'(?<country>[ \w]*)'
' - Date of death: '
'(?<date>[ ,\w]*)'
};
format = [format{:}];
Here we're just defining a format string. I've broken it up like this to make it a little clearer what's going on. Let's go through it line-by-line:
(?<name>[ \w]*) The parentheses indicate that this is a chunk of text (a.k.a. a "token") that we wish to capture. The ?<name> says that we will call this token "name". Finally, the [ \w]* specifies what kind of text to match. The stuff inside the square brackets specifies which characters to look for: spaces () and/or alphanumeric characters (\w). The * outside the square brackets indicates that we will accept any number of these characters.
\( Next we are looking for a space and an open parenthesis. The backslash in front of the parenthesis is to indicate that we are looking for a literal parenthesis, i.e. this parenthesis should not be interpreted as the start of another token to capture.
(?<age>[\d]*) Another token to capture. This one is called "age" and contains any number of \d (numeric characters).
years old \) - Cause of death: More text to look for. Again, we will be matching this text, but we will not capturing it (because it is not enclosed in parentheses).
(?<city>[ \w]*) Another token to capture. This one is called "city" and contains any number of spaces and/or alphanumeric characters.
, Comma, space
(?<province>[ \w]*), (?<country>[ \w]*) - Date of death: You get the idea
(?<date>[ ,\w]*) Our final token, called "date", which contains any number of spaces, commas, and/or alphanumeric characters.
Then we parse the strings into a struct array:
parsed_fields = regexp(text_array, format, 'names');
parsed_fields = [parsed_fields{:}]'
This is what the output should look like:
>> parsed_fields(1)
ans =
name: 'Jacqueline Cowdrey'
age: '50'
cause: 'unknown'
city: 'Worthing'
province: 'West Sussex'
country: 'United Kingdom'
date: 'November 20th, 2013'
So you can get your vector of countries pretty straightforward-ly:
Country = {parsed_fields.country}';
Age is a simple numeric conversion:
Age_str = {parsed_fields.age};
Age = cellfun(#str2double, Age_str)';
Date as a string is pretty easy:
Date_str = {parsed_fields.date}';
But it's nice to have it as a MATLAB "serial date number", which allows arithmetic computations and reformatting into different types of representation formats. Unfortunately, having the day as "20th" instead of "20" is incompatible with the conversion functions, so we'll need to first strip off the "st", "nd", "rd" from "1st", "2nd", "3rd", etc:
Date_str = regexprep(Date_str, '(?<day>[\d]+)(st|nd|rd|th)', '$<day>');
Date_num = datenum(Date_str, 'mmmm dd, yyyy');
Some other notes:
If the file is very large, you may wish to use fgetl to read it one line at a time (and then also parse it one line at a time) rather than reading the entire file into memory as we did above.
In your example, it looks like the entries are separated by an extra newline. I'm not sure if that's case in your actual data or if that's just a stackoverflow thing, but if you need to remove these newlines you can do so with:
is_empty_line = cellfun(#isempty, text_array);
text_array = text_array(~is_empty_line);
In your example, there were a lot of typos (an extra space here and there, sometimes the colons or dashes were other symbols). If these typos exist in your actual data, you will need to adjust the format specification to account for this. For example, instead of using - to match (space, dash, space), you can use \s*\W\s* to match (any number of whitespace characters, a single non-alphanumeric character, any number of whitespace characters).
If syntax like format = [format{:}]; or Country = {parsed_fields.country}'; look strange to you, these are equivalent to:
format = [format{1} format{2} format{3} ... format{end}];
Country = cell(length(parsed_fields),1);
for ii = 1:length(parsed_fields)
Country{ii} = parsed_fields(ii).country;
end
MATLAB R2014b added a new datetime class, so there may be a better way to deal with that nowadays.
Sorry about my previous answer; I had misunderstood how exactly the data is formatted.
As before, let's first read the text file into a cell array of strings:
fid = fopen('deaths.txt');
scanned_fields = textscan(fid, '%s', 'Delimiter','\n');
text_array = scanned_fields{1};
fclose(fid);
While textscan is capable of some rudimentary parsing, it's not sophisticated enough for what we're doing. So we're just using it to read each line as a single string: format %s means we are expecting a string, and setting Delimiter to \n means that the strings are separated by newline characters.
In the sample data you posted, each entry is 4 lines (name, cause, location, date) followed by an empty line. As long as we can rely on this formatting, this provides an easy way to split up the data (instead of the regexp parsing I proposed in my previous answer).
name_str_array = text_array(1:5:end);
cause_str_array = text_array(2:5:end);
loc_str_array = text_array(3:5:end);
date_str_array = text_arary(4:5:end);
So for example, name_strs is going to be every 5th line, starting with line #1. Likewise, cause_strs is every 5th line, starting with line #2. Just be careful that there are not any extra or missing lines in the data.
Next we will parse each of these to get the information that we want. In my previous answer, I proposed parsing all of the strings at once, but I think it would be easier to understand if we went through it one entry at a time. For example, let's consider the first entry.
name_str = name_str_array{1};
loc_str = loc_str_array{1};
date_str = date_str_array{1};
Let's start with the easiest one: parsing the date.
date_format = 'Date of death:\s*(?<date>.*)';
parsed_fields = regexp(date_str, date_format, 'names');
DOD = parsed_fields.date;
The format we're looking for is the string Date of death:, followed by any number of whitespace characters (\s*), followed by the chunk of text (aka "token") that we wish to capture: (?<date>.*)
The parentheses indicate that this is a token we wish to capture, the ?<date> indicates that we wish to call this token "date", and the .* specifies which characters to look for. The . is the universal wildcard, i.e. it matches all possible characters. The * indicates that we are interested in any number of repeats. So in essence, this .* means "match all remaining characters in the string".
Calling regexp with the names option causes it to return a struct with the named tokens as its fields.
Next, let's do the country. This one is a little trickier because there is a variable number of city/region specifiers. But the country will always be the last one, so that's the one we'll grab.
country_format = '(?<country>\w[ \w]*)$';
parsed_fields = regexp(loc_str, country_format, 'names');
Country = parsed_fields.country;
This format specification is the token (?<country>\w[ \w]*) followed by the end of the string (denoted by the special character $). In the token specification we are matching an alphanumeric character (\w) followed by any number of spaces and/or alphanumeric characters ([ \w]*). The reason for specifying this leading \w is so that we don't match the space between the previous comma and the start of the country name.
Finally, let's do the age. This one is tricky because not every entry has an age. At least it's easy because the age (if it exists) is the only numeric data in the line. Hence:
age_format = '(?<age>[\d]+)';
parsed_fields = regexp(name_str, age_format, 'names');
if isempty(parsed_fields)
Age = -1;
else
Age = str2double(parsed_fields.age);
end
The format specification is simply the token (?<age>[\d]+), which specifies that we are looking for numeric characters (\d), and we are looking for one or more of them (+).
After parsing, we check whether or not there was a match. If not (parsed_fields is empty), then we assign Age a value of -1. Otherwise, we convert the parsed age field into a number.
So putting it all together:
date_format = 'Date of death:\s*(?<date>.*)';
country_format = '(?<country>\w[ \w]*)[\W]?$';
age_format = '(?<age>[\d]+)';
nEntries = length(date_str_array);
DOD = cell(nEntries, 1);
Country = cell(nEntries, 1);
Age = zeros(nEntries, 1);
for ii = 1:nEntries
name_str = name_str_array{ii};
loc_str = loc_str_array{ii};
date_str = date_str_array{ii};
parsed_fields = regexp(date_str, date_format, 'names');
assert(~isempty(parsed_fields), 'Could not parse date from:\n%s', date_str);
DOD{ii} = parsed_fields.date;
parsed_fields = regexp(loc_str, country_format, 'names');
assert(~isempty(parsed_fields), 'Could not parse country from:\n%s', loc_str);
Country{ii} = parsed_fields.country;
parsed_fields = regexp(name_str, age_format, 'names');
if isempty(parsed_fields)
Age(ii) = -1;
else
Age(ii) = str2double(parsed_fields.age);
end
end
I added the assert statements to help debug what's going on if you encounter errors in parsing.
For example, you may also notice that I added an [\W]? to the country format. This is because while running it on your example data, I encountered one country that contained a period at the end of the line (i.e. it ended with "Brazil." instead of just "Brazil"). So now we're looking to match a non-alphanumeric character (\W) repeated zero or 1 times (?), and it's outside of the parentheses so it is not being captured as part of the "country" token.

Lex/Flex - Split the phone number Up?

I am making a program which got to split the phone-number apart, each part has been divided by a hyphen (or spaces, or '( )' or empty).
Exp: Input: 0xx-xxxx-xxxx or 0xxxxxxxxxx or (0xx)xxxx-xxxx
Output: code 1: 0xx
code 2: xxxx
code 3: xxxx
But my problem is: sometime "Code 1" is just 0x -> so "Code 2" must be xxxxx (1st part always have hyphen or a parenthesis when 2 digit long)
Anyone can give me a hand, It would be grateful.
According to your comments, the following regex will extract the information you need
^\(?(0\d{1,2})\)?[- ]?(\d{4,5})[- ]?(\d{4})$
Break down:
^\(?(0\d{1,2})\)? matches 0x, 0xx, (0xx) and (0x) at he beggining of the string
[- ]? as parenthesis can only be used for the first group, the only valid separators left are space and the hyphen. ? means 0 or 1 time.
(\d{4,5}) will match the second group. As the length of the 3rd group is fixed (4 digits), the regex will automatically calculate the length of the Group1 and 2.
(\d{4})$ matches the 4 digits at the end of the number.
See it in action
You can the extract data from capture group 1,2 and 3
Note: As mentionned in the comments of the OP, this only extracts data from correctly formed numbers. It will match some ill-formed numbers.

Resources