I need to delete leading 0s only from a string. I found that there is no in-built function like LTRIM as in C.
I'm thinking of the below AWK script to do that:
awk -F"," 'BEGIN { a[$1] }
for (v in a) {
{if ($v == 0) {delete a[$v]; print a;} else exit;}
}'
But guess I'm not declaring the array correctly, and it throws error. Sorry new to AWK programming. Can you please help me to put it together.
Using awk, as requested:
#!/usr/bin/awk -f
/^0$/ { print; next; }
/^0*[^0-9]/ { print; next; }
/^0/ { sub("^0+", "", $0); print; next; }
{ print $0; }
This provides for not trimming a plain "0" to an empty string, as well as avoiding the (probably) unwanted trimming of non-numeric fields. If the latter is actually desired behavior, the second pattern/action can be commented out. In either case, substitution is the way to go, since adding a number to a non-numeric field will generate an error.
Input:
0
0x
0000x
00012
Output:
0
0x
0000x
12
Output trimming non-numeric fields:
0
x
x
12
Here is a somewhat generic ltrim function that can be called as ltrim(s) or ltrim(s,c), where c is the character to be trimmed (assuming it is not a special regex character) and where c defaults to " ":
function ltrim(s,c) {if (c==""){c=" "} sub("^" c "*","",s); return s}
This can be called with 0, e.g. ltrim($0,0)
NOTE:
This will work for some special characters (e.g. "*"), but if you want to trim special characters, it would probably be simplest to call the appropriate sub() function directly.
Based on other recent questions you posted, you appear to be struggling with the basics of the awk language.
I will not attempt to answer your original question, but instead try to get you on the way in your investigation of the awk language.
It is true that the syntax of awk expressions is similar to c. However there are some important differences.
I would recommend that you spend some time reading a primer on awk and find some exercises. Try for instance the Gnu Awk Getting Started.
That said, there are two major differences with C that I will highlight here:
Types
Awk only uses strings and numbers -- it decides based on context whether it needs to treat input as text or as a number. In some cases
you may need to force conversion to string or to a number.
Structure
An Awk program always follows the same structure of a series of patterns, each followed by an action, enclosed in curly braces: pattern { action }:
pattern { action }
pattern { action }
.
.
.
pattern { action }
Patterns can be regular expressions or comparisons of strings or numbers.
If a pattern evaluates as true, the associated action is executed.
An empty pattern always triggers an action. The { action } part is optional and is equivalent to { print }.
An empty pattern with no action will do nothing.
Some patterns like BEGIN and END get special treatment. Before reading stdin or opening any files, awk will first collect all BEGIN statements in the program and execute their associated actions in order.
It will then start processing stdin or any files given and subject each line to all other pattern/action pairs in order.
Once all input is exhausted, all files are closed, and awk will process the actions belonging to all END patterns, again in order of appearance.
You can use BEGIN action to initialize variables. END actions are typically used to report summaries.
A warning: Quite often we see people trying to pass data from the shell by partially unquoting the awk script, or by using double quotes. Don't do this; instead, use the awk -v option to pass on parameters into the program:
a="two"
b="strings"
awk -v a=$a \
-v b=$b \
'BEGIN {
print a, b
}'
two strings
you can force awk to convert the field to a number and leading zeros by default will be eliminated.
e.g.
$ echo 0001 | awk '{print $1+0}'
1
If I understand correctly, and you just want to trim the leading '0's from a value in bash, you can use sed to provide precise regex control, or a simple loop works well -- and eliminates spawning a subshell with the external utility call. For example:
var=00104
Using sed:
$ echo "$var" | sed 's/^0*//'
104
or using a herestring to eliminate the pipe and additional subshell (bash only)
$ sed 's/^0*//' <<<$var
104
Using a simple loop with string indexes:
while [ "${var:0:1}" = '0' ]; do
var="${var:1}"
done
var will contain 104 following 2 iterations of the loop.
Related
Sample input:
a 54 65 43
b 45 12 98
c 99 0 12
d 3 23 0
Sample output:
c,d
Basically I want to check if there's a value of zero in each line, if yes, print the index(a,b,c,d).
My code:
for(i=1;i<=NF;i++)if(i==0){print$1} I got a syntax error
Thanks.
another approach
$ awk '/\y0\y/{print $1}' file
c
d
\y is the word-boundary operator. Might be only in gawk.
The code needs a set of braces.
awk '{ for(i=1;i<=NF;i++)if($i==0) print $1}' filename
(The print doesn't need braces so I took those out.)
If the first field doesn't ever contain a number, maybe start the loop from 2.
The general form of an Awk script is a sequence of
condition { action }
pairs, where the latter needs braces around it. In the absence of a condition, an action is taken on each line, unconditionally.
To make your code work, you need change it to:
$ awk '{for(i=1;i<=NF;i++)if($i==0) print $1}' file
c
d
You need to put the code inside a block ({} pair).
You have to use $i instead of i in the if condition, $i means the ith column.
Although it's not needed here, it's better to add a space between command and paramter. (print $1)
And it's better to improve it a little bit:
awk '{for(i=1;i<=NF;i++)if($i==0) {print $1;next}}' file
Add next to avoid print $1 multiple times when there're more than one 0 in the line.
Given the columns are space separated, you can do it this way too:
awk '/( |^)0( |$)/{print $1}' file
This one does not require GNU awk.
/( |^)0( |$)/ is a RegEx, and in the command it's short for $0 ~ /( |^)0( |$)/.
^ means line beginnings, $ line endings here.
I have a destination.properties file:
Port:22
10.52.16.156
10.52.16.157
10.52.16.158
10.52.16.159
10.52.16.160
10.52.16.161
10.52.16.162
10.52.16.163
10.52.16.164
10.52.16.165
10.52.16.166
10.52.16.167
10.52.16.168
10.52.16.169
Port:61900-61999
10.52.16.156
10.52.16.157
10.52.16.158
10.52.16.159
10.52.16.160
10.52.16.161
10.52.16.162
10.52.16.163
10.52.16.164
10.52.16.165
10.52.16.166
10.52.16.167
10.52.16.168
10.52.16.169
I want to use an awk command to store all of the line numbers of lines that contain the word 'Port:' in an array.
I have the following command which stores all of the line numbers in the 1st array value ie array[0]:
array=$( (awk '/Port:/ {print NR}' destinations.prop) )
To get them in a shell array, you can do:
array=( $(awk '/Port:/ {print NR}' destinations.prop) )
The parenthesis assign the words within to successive array members. As usual, IFS controls the splitting of that command output, and file name globbing also happens if you happen to output wildcard characters. Probably not an issue in this case.
I'm having a problem with a csh script that i'm writing.
What i wanna do is to read a file, and for every new line of this file that i'm reading, assign a new value to a certain variable that i'll use later.
To make it simple I have a string, array called "zn" that contains 6 values. I can print every value using smth like:
echo ${zn[$i]}
Then i try to use the values of this array with something like (easy example just to explain):
cat file1 | awk '{i=NR;n='${zn[i]}';print n,$0}' >! file2
or other attempt:
cat file1 | awk '{n='${zn[NR]}';print n,$0}' >! file2
Well, I tried almost every possible combination of brackets, apostrophes, quotes...and I always get some errors like:
missing -.
Any help would be really appreciated, the solution it's probably smth pretty easy and obvious.
(i'm sorry if my syntax is not the best but i'm kinda new to this)
EDIT:
I ported the script in bash
...this is part of the script I use to prepare some text files to prepare a graphic in GMT:
cat crosspdf.dat |
awk '
BEGIN { n = int(('$dz')/('$dz_new')) }
{
z=$1
for (i=6;i<=NF;i++) {
if ($i!=0) {
for(j=1;j<=n;j++)
print (i-4)*'$dv', z+(j-n/2)*'$dz_new', $i
}
}
}
' >! temp
This works: the only thing you need to know is that $dz was a constant value, and now i wanna change it in order to have a different value for each line of the file i'm scanning. I can easily prepare the array with the values but i'm not able to pass include it somehow in the previous line. PS: thanks for the support – Francesco 2 mins ago edit
EDIT 2
1) dv, and dz_new are just parameters
2) dz would be an array with variable lenght containing just numbers (depth intervals: smth like -6.0 1.0 5.0 10.0 ... 36.0)
3) crosspdf.dat contains some histogram-like data: Each line corresponds to a different depth (depths were equally spaced, now not anymore, so that's why i need to use the dz array)
Let's start by re-writing your script:
cat crosspdf.dat |
awk '
BEGIN { n = int(('$dz')/('$dz_new')) }
{
z=$1
for (i=6;i<=NF;i++) {
if ($i!=0) {
for(j=1;j<=n;j++)
print (i-4)*'$dv', z+(j-n/2)*'$dz_new', $i
}
}
}
'
to pass shell variable values to awk the right way and cleanup the UUOC. The above should be written as:
awk -v dv="$dv" -v dz="$dz" -v dz_new="$dz_new" '
BEGIN { n = int(dz/dz_new) }
{
z=$1
for (i=6;i<=NF;i++) {
if ($i!=0) {
for(j=1;j<=n;j++)
print (i-4)*dv, z+(j-n/2)*dz_new, $i
}
}
}
' crosspdf.dat
Now some questions you need to answer are: which of your shell variables (dv, dz, and/or dz_new) is it you want to have different values for each line of the input file? What are some representative values of those shell variables? What values could crosspdf.dat contain? What would your expected output look like?
Update your question to show some small sample of crosspdf.dat, some settings of your array variable(s), and the expected output given all of that.
Actually - maybe this is all the hint you need:
$ cat file
abc
def
ghi
$ cat tst.sh
dz="12 23 17"
awk -v dz="$dz" '
BEGIN{ split(dz,dzA) }
{ print dzA[NR], $0 }
' file
$ ./tst.sh
12 abc
23 def
17 ghi
Questions?
Suppose that I have an array of such data:
arr[0] = "someText1 (x,y,z) a"
arr[1] = "someText2 (x,y,z) b"
How can I sort this array lexicographically [only taking the text into account] using Bash?
Join on newline, pass to sort.
(IFS=$'\n'; sort <<<"${arr[*]}")
sort <<<"fnord" simply sends the string "fnord" as the standard input to sort; this is a Bash convenience notation for the clumsier echo "fnord" | sort (plus it avoids the extra process) and similarly, sort <<<"${arr[*]}" feeds the array to sort.
Because array pasting depends on the value of IFS, we change it to a newline so that "${arr[*]}" will result in a newline-separated list (the default IFS would cause the entries in the array to be expanded to a space-separated list). In order to not change IFS permanently, we do this in a subshell; hence, the enclosing parentheses.
The Bash manual page is rather dense, but it's all there; or see the Reference Manual.
One way is to implement your own sorting algorithm; bubble-sort is pretty simple.
Another way is to use an external program, such as sort, to do your sorting. Here is a shell function that takes the array elements as arguments, and saves a sorted copy of the array into a variable named $SORTED:
function sort_array () {
SORTED=()
local elem
while IFS= read -r -d '' elem ; do
SORTED+=("$elem")
done < <(printf '%s\0' "$#" | sort -z)
}
(Note the use of null bytes as a delimiter, rather than newlines, so that your array elements are unrestricted. This is achieved by the -d '' option to read, the \0 in the printf format-string, and the -z option to sort.)
It can be used like this:
arr=('a b c' 'd e f' 'b c d' 'e f g' 'c d e')
printf '%s\n' "${arr[#]}" # prints elements, one per line
sort_array "${arr[#]}"
arr=("${SORTED[#]}")
printf '%s\n' "${arr[#]}" # same as above, but now it's sorted
This code is in modules but you could just include the needed functions from the other file array.sh to make it complete:
https://github.com/konsolebox/bash-library/blob/master/array/sort.sh
The function is customizable like producing elements or indices, and specializing on strings or integers. Just try to use it.
And one thing, it doesn't depend on external binaries like sort, and doesn't cause possible reinterpretation of data.
I'm processing headers of a .fasta file (which is a file universally used in genetics/bioinformatics to store DNA/RNA sequence data). Fasta files have headers starting with a > symbol (which gives specific info), followed by the actual sequence data on the next line that the header describes. The sequence data extends indefinitely until the next \n after which is followed the next header and its respective sequence. For example:
>scaffold1.1_size947603
ACGCTCGATCGTACCAGACTCAGCATGCATGACTGCATGCATGCATGCATCATCTGACTGATG....
>scaffold2.1_size747567.2.603063_605944
AGCTCTGATCGTCGAAATGCGCGCTCGCTAGCTCGATCGATCGATCGATCGACTCAGACCTCA....
and so on...
So, I have a problem with the fasta headers of the genome for the organism with which I am working with. Unfortunately the perl expertise needed to solve this problem seems to be beyond my current skill level :S So I was hoping someone on here could show me how it can be done.
My genome consists of about 25000 fasta headers and their respective sequences, the headers in their current state are giving me a lot of trouble with sequence aligners I am trying to use, so I have to simplify them significantly. Here is an example of my first few headers:
>scaffold1.1_size947603
>scaffold10.1_size550551
>scaffold100.1_size305125:1-38034
>scaffold100.1_size305125:38147-38987
>scaffold100.1_size305125:38995-44965
>scaffold100.1_size305125:76102-78738
>scaffold100.1_size305125:84171-87568
>scaffold100.1_size305125:87574-89457
>scaffold100.1_size305125:90495-305068
>scaffold1000.1_size94939
Essentially I would like to refine these to look like this:
>scaffold1.1a
>scaffold10.1a
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1a
Or perhaps even this (but this seems like it would be more complicated):
>scaffold1.1
>scaffold10.1
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1
What I'm doing here is getting rid of all the size data for each scaffold of the genome. For scaffolds that happen to be fragmented, I'd like to denote them with a,b,c,d etc. There are a few scaffolds with more than 26 fragments so perhaps I could denote them with x, y, z, A, B, C, D .... etc..
I was thinking to do this with a simple replace foreach loop similar to this:
#!/usr/bin/perl -w
### Open the files
$gen = './Hc_genome/haemonchus_V1.fa';
open(FASTAFILE, $gen);
#lines = <FASTAFILE>;
#print #lines;
###Add an # symbol to the start of the label
my #refined;
foreach my $lines (#lines){
chomp $lines;
$lines =~ s/match everything after .1/replace it with a, b, c.. etc/g;
push #refined, $lines;
}
#print #refined;
###Push the array on to a new fasta file
open FILE3, "> ./Hc_genome/modded_haemonchus_V1.fa" or die "Cannot open output.txt: $!";
foreach (#refined)
{
print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;
But I don't know have to build in the added alphabetical label additions between the $1 and the \n in the match and replace operator. Essentially because I'm not sure how to do it sequentially through the alphabet for each fragment of a particular scaffold (All I could manage is to add an a to the start of each one...)
Please if you don't mind, let me know how I might achieve this!
Much appreciated!
Andrew
In Perl, the increment operator ++ has “magical” behaviour with respect to strings. E.g. my $s = "a"; $a++ increments $a to "b". This goes on until "z", where the increment will produce "aa" and so forth.
The headers of your file appear to be properly sorted, so we can just loop through each header. From the header, we extract the starting part (everything up to including the .1). If this starting part is the same as the starting part of the previous header, we increment our sequence identifier. Otherwise, we set it to "a":
use strict; use warnings; # start every script with these
my $index = "a";
my $prev = "";
# iterate over all lines (rather than reading all 25E3 into memory at once)
while (<>) {
# pass through non-header lines
unless (/^>/) {
print; # comment this line to remove non-header lines
next;
}
s/\.1\K.*//s; # remove everything after ".1". Implies chomping
# reset or increment $index
if ($_ eq $prev) {
$index++;
} else {
$index = "a";
}
# update the previous line
$prev = $_;
# output new header
print "$_$index\n";
}
Usage: $ perl script.pl <./Hc_genome/haemonchus_V1.fa >./Hc_genome/modded_haemonchus_V1.fa.
It is considered good style to write programs that accept input from STDIN and write to STDOUT, as this improves flexibility. Rather than hardcoding paths in your perl script, keep your script general, and use shell redirection operators like < to specify the input. This also saves you the hassle of manually opening the files.
Example Output:
>scaffold1.1a
>scaffold10.1a
>scaffold100.1a
>scaffold100.1b
>scaffold100.1c
>scaffold100.1d
>scaffold100.1e
>scaffold100.1f
>scaffold100.1g
>scaffold1000.1a