I have to parse some tables from an ASCII text file. Here's a partial sample:
QSMDRYCELL 11.00 11.10 11.00 11.00 -.90 11 11000 1.212
RECKITTBEN 192.50 209.00 192.50 201.80 5.21 34 2850 5.707
RUPALIINS 150.00 159.00 150.00 156.25 6.29 4 80 .125
SALAMCRST 164.00 164.75 163.00 163.25 -.45 80 8250 13.505
SINGERBD 779.75 779.75 770.00 773.00 -.89 8 95 .735
SONARBAINS 68.00 69.00 67.50 68.00 .74 11 3050 2.077
The table consists of 1 column of text and 8 columns of floating point numbers. I'd like to capture each column via regex.
I'm pretty new to regular expressions. Here's the faulty regex pattern I came up with:
(\S+)\s+(\s+[\d\.\-]+){8}
But the pattern captures only the first and the last columns. RegexBuddy also emits the following warning:
You repeated the capturing group
itself. The group will capture only
the last iteration. Put a capturing
group around the repeated group to
capture all iterations.
I've consulted their help file, but I don't have a clue as to how to solve this.
How can I capture each column separately?
In C# (modified from this example):
string input = "QSMDRYCELL 11.00 11.10 11.00 11.00 -.90 11 11000 1.212";
string pattern = #"^(\S+)\s+(\s+[\d.-]+){8}$";
Match match = Regex.Match(input, pattern, RegexOptions.MultiLine);
if (match.Success) {
Console.WriteLine("Matched text: {0}", match.Value);
for (int ctr = 1; ctr < match.Groups.Count; ctr++) {
Console.WriteLine(" Group {0}: {1}", ctr, match.Groups[ctr].Value);
int captureCtr = 0;
foreach (Capture capture in match.Groups[ctr].Captures) {
Console.WriteLine(" Capture {0}: {1}",
captureCtr, capture.Value);
captureCtr++;
}
}
}
Output:
Matched text: QSMDRYCELL 11.00 11.10 11.00 11.00 -.90 11 11000 1.212
...
Group 2: 1.212
Capture 0: 11.00
Capture 1: 11.10
Capture 2: 11.00
...etc.
If you want to know what the warning is appearing for, it's because your capture group matches multiple times (8, as you specified) but the capture variable can only have one value. It is assigned the last value matched.
As described in question 1313332, retrieving these multiple matches is generally not possible with a regular expression, although .NET and Perl 6 have some support for it.
The warning suggests that you could put another group around the whole set, like this:
(\S+)\s+((\s+[\d\.\-]+){8})
You would then be able to see all the columns, but of course they would not be separated. Because it's generally not possible to capture them separately, the more common intention is to capture all of it, and the warning helps remind you of this.
Unfortunately you need to repeat the (…) 8 times to get each column separately.
^(\S+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)$
If code is possible, you can first match those numeric columns as a whole
>>> rx1 = re.compile(r'^(\S+)\s+((?:[-.\d]+\s+){7}[-.\d]+)$', re.M)
>>> allres = rx1.findall(theAsciiText)
then split the columns by spaces
>>> [[p] + q.split() for p, q in allres]
Related
I have browsed almost all possible pages on the subject and I still can't find a way to extract a matched data dataset with the MatchThem package.
By analogy, MatchIt allows via the function match.data() to extract the dataset of matched data for example 3:1. Although MatchThem's complete() function is the equivalent, this function apparently does not allow to extract exclusively the imputed AND matched dataset.
Here is an example of multiple imputation with 3:1 matching from which I am trying to extract multiple matched datasets:
library(mice)
library(MatchThem)
#Multiple imputations
mids_object <- mice(data, maxit = 5, m=3, seed= 20211022, printFlag = F) # m=3 is voluntarily low for this example.
#Matching
mimids_object <- matchthem(primary_subtype ~ age + bmi + ps, data = mids_object, approach = "within" ,ratio= 3, method = "optimal")
#Details of matched data
print(mimids_object)
Printing | dataset: #1
A matchit object
method: Variable ratio 3:1 optimal pair matching
distance: Propensity score
- estimated with logistic regression
number of obs: 761 (original), 177 (matched)
target estimand: ATT
covariates: age, bmi, ps
#Extracting matched dataset
complete(mimids_object, action = "long") -> complete_mi_matched
#Summary of extracted dataset to check correct number of match
summary(complete_mi_matched$primary_subtype)
classic ADK SRC
702 59
It should show the matched proportion 3:1 with 177 matched (177 classic ADK and 59 SRC)
I am missing something. Thanks in advance for your help or suggestions.
I just learned about the "do" loop today and would like to try using it for data entry in SAS. I have tried most examples online, but I still cannot figure it out.
My dataset in an experiment with 6 treatments (1 to 6) using 2 sets of cues, 3 each, Visual and Audio. There's lag measured in seconds, which are 5, 10, and 15, which there are 2 sets.
Basically it looks like this:
Table
The entries I want are:
1. Obs_no, ranging from 1 to 18 (total of 18 observations, this allows me to easily delete outliers with an IF THEN)
2. Treatment type, which are Auditory and Visual.
3.Treatment number, 1 to 6, 3 sets.
4. Lag, 5, 10 or 15.
5. And the data itself
So far, my code makes 2 and 5 possible, it also makes the rest possible with an IF THEN statement and input statement, although I assume there's a way easier method:
data AVCue;
do cue = 'Auditory','Visual';
do i = 1 to 3;
input AVCue ##;
output;
end;
end;
datalines;
.204 .167 .202 .257 .283 .256
.170 .182 .198 .279 .235 .281
.181 .187 .236 .269 .260 .258
;
Lag and the rest was made possible using an IF THEN statement and the crude method of input:
data AVCue;
set AVCue;
IF i=1 THEN Lag=5;
IF i=2 THEN Lag=10;
IF i=3 THEN Lag=15;
input obs_no treatment;
cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 1
8 2
9 3
10 4
11 5
12 6
13 1
14 2
15 3
16 4
17 5
18 6
;
proc print data=AVCue;
run;
The IF THEN should be fine, but the input statement here is just in my opinion counterproductive, and defeats the purpose of using loops, which is to me, to save time. If done this way, I might as well just put the data into excel and import it, or type everything out with ample copy and paste of the text in the
input obs_no treatment;
cards;
section.
My coding knowledge is basic, so sorry if this question sounds silly, I want to know:
1. How would I make a list of numbers using the "do" loops in SAS? I've made several attempts and all I get is a list containing the next number. I know why this happens, the loop counts to x and the value assigned would just be x. I just don't know how to get around that. Somehow this didn't happen in the datalines section, I guess SAS knows there's 18 numbers and the entry i is stored accordingly... or something?
2. How would I go about assigning in this case, the numbers 1 to 6 to each entry?
Thanks!
It is certainly much easier to read in the actual dataset instead of having to impute some of the variables based on the order the values have in the source data. You might be able to combine a SET statement and an INPUT statement in the same data step and get it to work, but it is probably NOT worth the effort. Just make two datasets and merge them.
Looking at the photograph you posted it looks like TREATMENT is not an independent variable. Instead it is just a label for the combination of CUE and LAG. To make it cycle from 1 to 6 just reset it back to 1 when it gets too large.
data AVCue;
do cue = 'Auditory','Visual';
do lag= 5, 10, 15 ;
treatment+1;
if treatment=7 then treatment=1;
obsno+1;
input AVCue ##;
output;
end;
end;
datalines;
.204 .167 .202 .257 .283 .256
.170 .182 .198 .279 .235 .281
.181 .187 .236 .269 .260 .258
;
You can get in trouble if you just let SAS guess at how you want to define your variables. For example if you change the order of the CUE values do cue = 'Visual','Auditory'; then SAS will make CUE with length $5 instead of $8. Add a LENGTH statement to define your variables before you use them.
length obsno 8 treatment 8 cue $8 lag 8 AVCue 8 ;
This will also let you control the order they are created in the dataset.
If you really did already have a SAS dataset and you wanted to add a variable like TREATMENT that cycled from 1 to 6 (or really any DO loop construct) then could nest the SET statement inside the DO loop. Just remember to add the explicit OUTPUT statement.
data new ;
do treatment=1 to 6 ;
set old;
output;
end;
run;
Sorry for the wording on the question as it's not brilliantly clear but this is what I'm trying to do.
I have an array containing 7 pipe separated values in each element. I need to count the combinations but not purely all unique combinations (there would be 896 of those).
Here is what the fields can be
Field1 = 001 or 002
Field2 = N or C
Field3 = Y or N
Field4 = Y or N
Field5 = Y or N
Field6 = C or not C
Field7 = (L1 or blank) or (not L1 or blank) (i.e. a binary value as it is either one of those 2 or it isn't one of those 2)
If I've got the math correct, by switching them all to, effectively, a binary choice, I have reduced my combinations down to 128 as I can happily then split it into 2 subsets where field 2 is N or C and for the C ones I just need separate counts of the 2 possible values of Field 1 (i.e. 001 or 002) that would then reduce my further combinations down to 64 however, that's where I get stuck. For the two where field 2 is C I will just do something like,
$count1 = ($data|Select-String "^001\|C" | Measure-Object).count
So I would assume there is a way I could then do a a similar command to get the other counts, I just don't know how to handle Field 6 and 7.
When I know the values I can happily do the select but in the case of Field 6 I don't know how to do the "not C" and in the case of Field 7 where one of two values is one count and not one of those two values is the other then I'm struggling.
I have a table that has the following structure
[zip] = <zip, nvarchar(4),>
[cityZipID] = <cityZipID, int,>
In the zip column there is a string containing 4 digits and this is a number between 1000 an 1239 gut stored as a string.
For some reason I need to calculate an other value out of this so I need to convert the string into an integer and store it into an other column called cityZipID. I want to do this using SQL Server Management Studio because it has to convert about 32000 lines so I cannot easily do it by hand.
I tried the following but get only an error message when trying to execute it
UPDATE [MyTestData].[dbo].[Addresses]
SET [cityZipID] = ((int)[zip])/10 -100
WHERE [city] = 'Wien'
The column of cityZipID is null in the moment and should be filled with numbers for the districts like the plzl for the first district is 1010 the 12th district is 1120 So the calculation would result in 1120 / 10 = 112 -100 = 12 and this would be the wanted result.
Any help would be appreciated.
Jonny
Try this query
UPDATE [MyTestData].[dbo].[Addresses]
SET [cityZipID] = (Convert(Int,[zip])/10) -100
WHERE [city] = 'Wien'
I have a vcf file like this:
http://www.1000genomes.org/node/101
Here's the example from that site:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
After the header lines, each line has fields that contain genotypes starting with the 10th field. The 10th field is below the NA0001 heading; the 11th field is genotype NA0002, etc. I have a file with 123 different genotypes, so going from position 10 to 133 (NA0001 until NA0123). What is shown in these fields can be 0/0, 0/1, 0/2 .... till 8/9 for instance. Now I want to replace all the non-equal ones. So I would like to keep 0/0, 1/1, 2/2, etc. And replace 0/1, 0/2, 1/2, 4/5, 4/6 etc by ./.
I would like to write this in a C script. Thought about using sed y/regexp/replacement/ but no idea how to write all those unequal values in a regular expression. And on other positions in the file there could also be these values, so really only positions 10 till 133 should be replaced. And it needs to be replaced; I will be needing the rest of the file with the new values.
Hope it is clear. Anyone any idea how to do this?
This regex should do what you want: \s(\d)[|\/](?!\1)\d: Replace matches with ./.:
Breakdown:
\s(\d) matches a space followed by a single digit, capturing the digit in capture group #1
[|\/] matches a pipe or slash (since it seems that the VCF format allows either)
(?!\1)\d uses a negative lookahead to ensure that the next character is not the same as capture group #1, and matches the digit
Caveats:
I matched a leading space and trailing : to try to ensure it matches only the intended values. I couldn't work out a good way to limit it to fields 10 and after.
Example using perl:
perl -pe 's#\s(\d)[|/](?!\1)\d:# ./.:#g' testfile.vcf > testfile_afterchange.vcf
Note: I used # as the delimiter to avoid having to escape the / characters in the regex.