I have a vcf file like this:
http://www.1000genomes.org/node/101
Here's the example from that site:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
After the header lines, each line has fields that contain genotypes starting with the 10th field. The 10th field is below the NA0001 heading; the 11th field is genotype NA0002, etc. I have a file with 123 different genotypes, so going from position 10 to 133 (NA0001 until NA0123). What is shown in these fields can be 0/0, 0/1, 0/2 .... till 8/9 for instance. Now I want to replace all the non-equal ones. So I would like to keep 0/0, 1/1, 2/2, etc. And replace 0/1, 0/2, 1/2, 4/5, 4/6 etc by ./.
I would like to write this in a C script. Thought about using sed y/regexp/replacement/ but no idea how to write all those unequal values in a regular expression. And on other positions in the file there could also be these values, so really only positions 10 till 133 should be replaced. And it needs to be replaced; I will be needing the rest of the file with the new values.
Hope it is clear. Anyone any idea how to do this?
This regex should do what you want: \s(\d)[|\/](?!\1)\d: Replace matches with ./.:
Breakdown:
\s(\d) matches a space followed by a single digit, capturing the digit in capture group #1
[|\/] matches a pipe or slash (since it seems that the VCF format allows either)
(?!\1)\d uses a negative lookahead to ensure that the next character is not the same as capture group #1, and matches the digit
Caveats:
I matched a leading space and trailing : to try to ensure it matches only the intended values. I couldn't work out a good way to limit it to fields 10 and after.
Example using perl:
perl -pe 's#\s(\d)[|/](?!\1)\d:# ./.:#g' testfile.vcf > testfile_afterchange.vcf
Note: I used # as the delimiter to avoid having to escape the / characters in the regex.
Related
I am trying to search through text and find every instance of "Z" followed by a number. If the number is 40 or higher, then it will be replaced with 32.
So for example
N170G00Z58
N280G81X9.1787Y15.1981Z2.3803R4.6F.75L0.0
N300G00Z15.0
N580G03X-12.125Y6.7311Z52.775I-12.5J6.7311F35.0
Would produce
N170G00Z32
N280G81X9.1787Y15.1981Z2.3803R4.6F.75L0.0
N300G00Z15.0
N580G03X-12.125Y6.7311Z32I-12.5J6.7311F35.0
We are only looking at and changing the Z values.
I have tried with the following code, but it removes all Z values instead.
the "%VarOne%%MS201%" is just the file I have previously output, that I am using as a source.
set "INTEXTFILE=%VarOne%%MS201%"
for /f "delims=Z*" %%a in ('type "%INTEXTFILE%"') do (
SET s=%%a
IF s GTR Z40 SET s=!s:Z32!
echo !s!>>new.txt
)
I need to do this with other values as well (any Y value over 40 needs changed to "Y40"), so hopefully, the solution is expandable and understandable by me. I am fully aware that I do not fully know what I am doing, but I am trying.
One possible solution is using the batch/JScript hybrid JREPL.BAT with the command line:
call "%~dp0jrepl.bat" "Z(?:[1-9][0-9]{2,}|[4-9][0-9])(?:\.[0-9]+)?" "Z32" /F "%VarOne%%MS201%" /O New.txt
There could be used - instead of New.txt to do the replaces directly in file with name defined by %VarOne%%MS201%.
jrepl.bat is referenced here with the path of the batch file containing this command line and the definitions of the environment variables VarOne and MS201 which means jrepl.bat must be in same directory as the batch file.
The search expression Z(?:[1-9][0-9]{2,}|[4-9][0-9])(?:\.[0-9]+)? means:
Z ... find first case-sensitive this letter.
(?:...) ... is a non-marking group used here for an OR expression.
[1-9][0-9]{2,} ... there must be after Z a digit in range 1 to 9 with at least two or more digits in range 0 to 9. So this expression matches numbers in range 100 to 999999999 and even higher numbers.
| ... means OR as a second expression is needed for numbers lower than 100 after Z.
[4-9][0-9] ... matches a number with exactly two digits whereby the first digit must be in range 4 to 9 and the second digit can be in range 0 to 9. So this expression matches numbers in range 40 to 99.
(?:...)? ... that is once again a non-marking group used here to apply the multiplier ? on the entire expression inside the group which means applied zero or exactly once. In other words the expression inside this group with multiplier ? matches optionally also a string.
\.[0-9]+ ... matches a dot escaped with a backslash to be interpreted as literal character and one or more digits in range 0 to 9. This optionally applied expression matches the decimal point and the post comma digits of a floating point value.
For a replacement of all Z values with value 32 or higher the group with the OR expression must be extended by one more expression:
call "%~dp0jrepl.bat" "Z(?:[1-9][0-9]{2,}|[4-9][0-9]|3[2-9])(?:\.[0-9]+)?" "Z32" /F "%VarOne%%MS201%" /O New.txt
|3[2-9] ... is a third OR expression matching numbers in range 32 to 39.
So the three expressions in the OR group match numbers 100 or higher, 40 to 99 and 32 to 39 as integers or as floating point values with a decimal point and one or more decimal places with the optionally applied expression in second non-marking group.
I'm trying to define, using the data in the products.txt file, a data set with the delimiter *.
products.txt data:
hartie 2 birotica
creione 10 birotica
apa 6 alimente
ceai 8 alimente
tricou 100 haine
I tried to use the delimiter *:
data produse;
infile '/home/u47505185/produse.txt' dlm='*';
input Nume $ Pret Categorie $;
run;
dsd command is changing space into , . i want the command for changing space into *
The DSD option, in addition to the other things it does, changes the DEFAULT delimiter from space to comma. But you can override the default delimiter to any list of characters you want by using the DLM= (also known as DELIMITER=) option, whether or not you are using the DSD option.
From the comments it sounds like you just want to do text manipulation. Just change the spaces to stars. Make sure to remove any trailing spaces (unless you want those to also become stars).
data _null_;
infile '/home/u47505185/produse.txt';
input;
file '/home/u47505185/produse_star.txt';
_infile_=translate(trimn(_infile_),'*',' ');
put _infile_;
run;
To display missing numeric values as an asterik (*), in output or data viewers, use this setting
OPTIONS MISSING='*';
The INFILE DLM= option is for specifying what character(s) in the data file are to be used to separate the variables being INPUT.
DLM does NOT specify a replacement value for missing values.
You told SAS to use * as a field separator.
So what is happening ? The LOG will tell you. Essentially Nume was read as a 8 character variable (default length) and the delimiter never appeared. So, Pret, a numeric variable, had nothing to be read-in from and was assigned a missing value. When viewed in output or data viewer, the value appears as a ..
data want;
infile datalines dlm='*'; * '/home/u47505185/produse.txt' dlm='*';
input Nume $ Pret Categorie $;
datalines;
hartie 2 birotica
creione 10 birotica
apa 6 alimente
ceai 8 alimente
tricou 100 haine
;
Log
25 data want;
26 infile datalines dlm='*'; * '/home/u47505185/produse.txt' dlm='*';
27 input Nume $ Pret Categorie $;
28 datalines;
NOTE: Invalid data for Pret in line 30 1-80.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+--
31 apa 6 alimente
NOTE: Invalid data errors for file CARDS occurred outside the printed range.
NOTE: Increase available buffer lines with the INFILE n= option.
Nume=hartie 2 Pret=. Categorie=apa 6 al _ERROR_=1 _N_=1
NOTE: Invalid data for Pret in line 33 1-80.
NOTE: LOST CARD.
34 ;
NOTE: Invalid data errors for file CARDS occurred outside the printed range.
NOTE: Increase available buffer lines with the INFILE n= option.
Nume=ceai 8 a Pret=. Categorie= _ERROR_=1 _N_=2
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.WANT has 1 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
By default, what is shown to you when a value is missing?
Numeric variables, . or the current setting for session option MISSING="<one-char>"
Character variables, . The missing value for character variables is a single space.
I would like to check if a specific column in one of my tables meets the following conditions:
String must contain at least three characters
String must contain at least two different numbers [e.g. 123 would work but 111 would not]
Characters which are allowed in the string:
Numbers (0-9)
Uppercase letters
Lowercase letters
Underscores (_)]
Dashes (-)
I have some experience with Regex but am having issues with Snowflake's syntax. Whenever I try using the '?' regex character (to mark something as optional) I receive an error. Can someone help me understand a workaround and provide a solution?
What I have so far:
SELECT string,
LENGTH(string) AS length
FROM tbl
WHERE REGEXP_LIKE(string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$')
ORDER BY length;
Thanks!
Your regex looks a little confusing and invalid, and it doesn't look like it quite meets your needs either. I read this expression as a string that:
Must start with one or more digits, at least 3 or more times
The confusing part to me is the '+' is a quantifier, which is not quantifiable with {3,} but somehow doesn't produce an error for me
Optionally followed by either a dash or plus sign
Followed by an uppercase character zero or one times (giving back as needed)
Followed by and ending with a lowercase character zero or one times (giving back as needed)
Questions
You say that your string must contain 3 characters and at least 2 different numbers, numbers are characters but I'm not sure if you mean 3 letters...
Are you considering the numbers to be characters?
Does the order of the characters matter?
Can you provide an example of the error you are receiving?
Notes
Checking for a second digit that is not the same as the first involves the concept of a lookahead with a backreference. Snowflake does not support backreferences.
One thing about pattern matching with regular expressions is that order makes a difference. If order is not of importance to you, then you'll have multiple patterns to match against.
Example
Below is how you can test each part of your requirements individually. I've included a few regexp_substr functions to show how extraction can work to check if something exists again.
Uncomment the WHERE clause to see the dataset filtered. The filters are written as expressions so you can remove any/all of the regexp_* columns.
select randstr(36,random(123)) as r_string
,length(r_string) AS length
,regexp_like(r_string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$') as reg
,regexp_like(r_string,'.*[A-Za-z]{3,}.*') as has_3_consecutive_letters
,regexp_like(r_string,'.*\\d+.*\\d+.*') as has_2_digits
,regexp_substr(r_string,'(\\d)',1,1) as first_digit
,regexp_substr(r_string,'(\\d)',1,2) as second_digit
,first_digit <> second_digit as digits_1st_not_equal_2nd
,not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) as first_digit_does_not_appear_again
,has_3_consecutive_letters and has_2_digits and first_digit_does_not_appear_again as test
from table(generator(rowcount => 10))
//where regexp_like(r_string,'.*[A-Za-z]{3,}.*') // has_3_consecutive_letters
// and regexp_like(r_string,'.*\\d+.*\\d+.*') // has_2_digits
// and not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) // first_digit_does_not_appear_again
;
Assuming the digits need to be contiguous, you can use a javascript UDF to find the number in a string with with the largest number of distinct digits:
create or replace function f(S text)
returns float
language javascript
returns null on null input
as
$$
const m = S.match(/\d+/g)
if (!m) return 0
const lengths = m.map(m=> [...new Set (m.split(''))].length)
const max_length = lengths.reduce((a,b) => Math.max(a,b))
return max_length
$$
;
Combined with WHERE-clause, this does what you want, I believe:
select column1, f(column1) max_length
from t
where max_length>1 and length(column1)>2 and column1 rlike '[\\w\\d-]+';
Yielding:
COLUMN1 | MAX_LENGTH
------------------------+-----------
abc123def567ghi1111_123 | 3
123 | 3
111222 | 2
Assuming this input:
create or replace table t as
select * from values ('abc123def567ghi1111_123'), ('xyz111asdf'), ('123'), ('111222'), ('abc 111111111 abc'), ('12'), ('asdf'), ('123 456'), (null);
The function is even simpler if the digits don't have to be contiguous (i.e. count the distinct digits in a string). Then core logic changes to:
const m = S.match(/\d/g)
if (!m) return 0
const length = [...new Set (m)].length
return length
Hope that's helpful!
I have this string:
{"name": "Fancy HaXXor123Name","profession": 1,"race": 2,"map_id": 1052,"world_id": 268435461,"team_color_id": 0,"commander": false,"fov": 0.768}
I want to get an array back which includes the following information (from left to right from the string):
Fancy HaXXor123Name
1
2
1052
268435461
0
false
0.768
I tried to mess with RegExBuddy and got a promissing pattern which looks like this
(\d{1,}).(\d{1,})|(\d{1,})|(?i)"(.*?)"
This is what I got back
name
Fancy HaXXor123Name
profession
1
race
2
map_id
10
2
world_id
2684354
1
team_color_id
0
commander
fov
0
768
So there are large spaces between the informations, torn numbers and the false is missing. I can't fix this problem and I'm completely new to StringRegExp.
I'm using AutoIT which uses the PCRE RegExp-Engine (this is what think).
You may use a regex like the following:
"\s*:\s*(?:"\K[^"]*|\K[^][\s,{}]+)
See the regex demo
Details:
"\s*:\s* - a literal ", 0+ whitespaces, :, 0+ whitespaces
(?:"\K[^"]*|\K[^][\s,{}]+) - A non-capturing group matching 2 alternatives:
"\K[^"]* - a ", then \K zeros the text matched so far, and then matches 0+ chars other than " with [^"]*
\K[^][\s,{}]+ - \K drops the text matched so far, and [^][\s,{}]+ matches 1+ chars other than ], [, whitespace, ,, { and }.
I am making a program which got to split the phone-number apart, each part has been divided by a hyphen (or spaces, or '( )' or empty).
Exp: Input: 0xx-xxxx-xxxx or 0xxxxxxxxxx or (0xx)xxxx-xxxx
Output: code 1: 0xx
code 2: xxxx
code 3: xxxx
But my problem is: sometime "Code 1" is just 0x -> so "Code 2" must be xxxxx (1st part always have hyphen or a parenthesis when 2 digit long)
Anyone can give me a hand, It would be grateful.
According to your comments, the following regex will extract the information you need
^\(?(0\d{1,2})\)?[- ]?(\d{4,5})[- ]?(\d{4})$
Break down:
^\(?(0\d{1,2})\)? matches 0x, 0xx, (0xx) and (0x) at he beggining of the string
[- ]? as parenthesis can only be used for the first group, the only valid separators left are space and the hyphen. ? means 0 or 1 time.
(\d{4,5}) will match the second group. As the length of the 3rd group is fixed (4 digits), the regex will automatically calculate the length of the Group1 and 2.
(\d{4})$ matches the 4 digits at the end of the number.
See it in action
You can the extract data from capture group 1,2 and 3
Note: As mentionned in the comments of the OP, this only extracts data from correctly formed numbers. It will match some ill-formed numbers.