How to replace New line and ^M chars with ~~ in Unix - file

I am not good at unix.
I have a csv file which I is having multiple columns. Out of which, one column is containing new line and ^M chars. I need to replace all of them between two " (which is a single cell value) by ~~ so that I can treat the cell value as single field. Here is the sample file :
"id","notes"
"N001","this is^M
test.
Again test
"
"N002","this is perfect"
"N00345","this is
having ^M
problem"
I need this file like :
"id","notes"
"N001","this is~~test.~~~~Again test~~~~"
"N002","this is perfect"
"N00345","this is~~~~having ~~problem"
So that the whole cell value can be read as a single field value.
I need to add one more case in this requirement where data within a cell contains " (double quotes). Here in this case we can identify ending " when it is followed by comma. Here are updated case data :
"id","notes"
"N001","this is^M
test. "Again test."
Again test
"
"N002","this is perfect"
"N00345","this is
having ^M
problem as it contains "
test"
We can keep " or remove it. The expected output is :
"id","notes"
"N001","this is~~test. "Again test."~~~~Again test~~~~"
"N002","this is perfect"
"N00345","this is ~~~~having ~~problem as it contains "~~test"

Try using sed
sed -i -e 's/^M//g' -e '/"$/!{:a N; s/\n/~~/; /"$/b; ba}' file
Note : To enter ^M, type Ctrl+V followed by Ctrl+M
File content after running command
"id","notes"
"N001","this is~~test.~~~~Again test~~~~"
"N002","this is perfect"
"N00345","this is~~~~having ~~problem"
Or using dos2unix followed by sed
dos2unix file
sed -i '/"$/!{:a N; s/\n/~~/; /"$/b; ba}' file
Short Description
Idea here is to remove newline character in each line not ending with "
sed -i ' # -i specifies in-place relace i.e. modifies file itself
/"$/!{ # if a line doesn't contain end pattern, " at the end of a line, then do following
:a # label 'a' for branching/looping
N; # append the next line of input into the pattern space
s/\n/~~/; # replace newline character '\n' with '~~' i.e. suppress new lines
/"$/b; # if a line contains end pattern then branch out i.e. break the loop
ba # branch to label 'a' i.e. this will create loop around label 'a'
}
' file # input file name
Refer to man sed for further details
EDIT
Sometimes data in the cell itself contains " within it.
Using sed
sed -i ':a N; s/\n/~~/; $s/"~~"/"\n"/g; ba' file
File content after running command for updated case data
"id","notes"
"N001","this is~~test. "Again test."~~~~Again test~~~~"
"N002","this is perfect"
"N00345","this is~~~~having ~~problem as it contains "~~test"
Using perl one-liner
perl -0777 -i -pe 's/\n/~~/g; s/"~~("|$)/"\n$1/g;' file

You can do this using sed command
To replace '^M' alone
sed -i 's|^M|~~|g' file_name
Edit:
Thanks for giving comment.
Adding a statement to replace '^M and new line'
To replace '^M and new line'**
sed -i ':a;N;$!ba;s|^M\n|~~|g' file_name
To get '^M' in console you should press Cntrl+v+m together

Use tr.
$ tr '<Ctrl>+m' '~'

sed 's/\^M/~~/;t nextline
b
: nextline
N
s/\n/~~/
s/^[^"]*\("[^"]*"\}\{1,\}[^"]*$
t
b nextline
"
not just change the ^M but also the new line between quote.
^M is obtain in unix session with a CTRL+V followed by a CTRL+M on keyboard

Related

transform file with key value (comma separated) into a file with only values

I have the following file -
k11=v11,k12=v12,k13=v13,...,k1N=v1N
k21=v21,k22=v22,k23=v23,...,k2N=v2N
...
...
Note that key and values can be alphanumerics. I want a quick way to generate another file
with the following content -
v11,v12,v13,..,v1N
v21,v22,v23,..,v2N
....
Can I use sed or awk to do this quickly.
With GNU sed:
sed -r 's/(^|,)[^=]+=/\1/g'
Explanation:
(^|,) represents either the beginning of a line (^) or a coma. Owing to the parentheses, this is stored in register #1.
[^=]+= represents a string made of any character except the equal sign ([^=]), at least one character long (+) and directly followed by an equal sign
Both are replaced by the content of register #1 (s/.../\1/). In effect, this removes the second part
The g flag at the end repeats the operation as much as necessary
A shorter form: sed -r 's/[^,=]+=//g'
POSIX-friendly, if you don't have GNU sed: sed 's/[^,=]\+=//g'
Here is an awk:
awk -F, '{for (i=1;i<=NF;i++) {
split($i,a,/=/)
printf "%s%s", a[2], (i<NF ? FS : ORS)
}
}' file > new_file
Another option using awk is to remove any character except a comma or equals sign using a negated character class [^,=]+ or matching only alphanumerics using [[:alnum:]]+=
$ cat file
k11=v11,k12=v12,k13=v13
k21=v21,k22=v22,k23=v23
Awk
awk '
gsub(/[^=,]+=/,"")
' file > output
cat output
Output
v11,v12,v13
v21,v22,v23

Bash: set IFS to Space after specific character only?

I'm using IFS=', ' to split a string of comma-delimited text into an array. The problem is that occasionally one of the comma-delimited items contains a space following a :. The resulting array contains that item as two separate array elements. Is it possible to set IFS to only split ', ' and ignore a comma-delimited item that contains ': ' (or any other character for that matter)?
See the comma-delimited string returned from the first command below, note the second item has the :. See the MarkerNames[1] and MarkerNames[2] to see the unwanted split in the second command below.
$ exiftool -s3 -TracksMarkersName audioFile.wav
Marker1, Tempo: 120.0, Silence, Marker2, Silence.1, Marker3, Silence.2, Marker4, Silence.3, Marker5
$ IFS=', ' read -r -a MarkerNames <<< $(exiftool -s3 -TracksMarkersName audioFile.wav)
$ declare -p MarkerNames
declare -a MarkerNames='([0]="Marker1" [1]="Tempo:" [2]="120.0" [3]="Silence" [4]="Marker2" [5]="Silence.1" [6]="Marker3" [7]="Silence.2" [8]="Marker4" [9]="Silence.3" [10]="Marker5")'
IFS contains an enumeration of the characters which each can be a field separator. So ", " says "any run of spaces or commas separates my fields".
The simplest workaround I think would be to preprocess the output so you get the breaks where you want them.
IFS='~' MarkerNames=($(exiftool -s3 -TracksMarkersName audioFile.wav | sed 's/, /~/g'))
This of course requires you to find another IFS value which doesn't occur in your data. If Bash 4+ is available, maybe use a newline and readarray.
You could split on commas and remove the leading / trailing spaces afterwards:
IFS=',' read -r -a MarkerNames <<< $(exiftool -s3 -TracksMarkersName audioFile.wav)
shopt -s extglob # Needed for extended glob
MarkerNames=( "${MarkerNames[#]/#*( )}" ) # Remove leading spaces
MarkerNames=( "${MarkerNames[#]/%*( )}" ) # Remove trailing spaces

Bash Convert text string into array with multiple \r\n as field seperator

I have a windows text file in the format:
line\r\n
line\r\n
line\r\n
r\n
line\r\n
line\r\n
line\r\n
r\n
...
I want to put this textfile into an array where the field seperator is \r\n\r\n - I did search for an answer but nothing I found and tried did work . awk for example is too complex for me and FS= did not work as I expected.
Commands to read arrays in bash can (as far as I know) only use single characters as a field separator, not complete strings like \r\n\r\n.
Workaround
First replace the field separator \r\n\r\n with a single char which is not used in the string to be splitted. I found \x1e (the ASCII control character »Record Separator«) to work out quite well.
Then read the array using the new (one character) field separator.
The field separator will always be removed when reading something to an array. But you can append the separator to each field.
Here is a pure bash solution to read the file file into the array array:
IFS=$'\x1e'
filecontent="$(< file)"
array=(${filecontent//$'\r\n\r\n'/$'\x1e'})
array=("${array[#]/%/$'\r\n\r\n'}")
IFS=$'\x1e' sets bash's field separator which is used to split strings into arrays. Depending on your script you may want to restore the old IFS afterwards (default is IFS=$' \t\n').
Results
For file
A B C\r\n
D E F\r\n
\r\n
G H I\r\n
\r\n
the resulting array will have two entries:
${array[0]}
A B C\r\n
D E F\r\n
\r\n
${array[1]}
G H I\r\n
\r\n
Known Problems
IFS at the beginning and end of the string will be trimmed. Repeated IFS will be squeezed. The file \r\n\r\n will result in an array without entries. Empty entries cannot be created.
\r\n\r\n is appended to all entries in all cases. The file A\r\n\r\nB will result an array with the two entries A\r\n\r\n and B\r\n\r\n.
In Linux all lines of files are terminated with \n.
So your problem is not the \r\n , it is just the \r. So just remove it:
$ tr -d '\r' <file >newfile
To verify that \r is removed you can do:
$ head -n2 newfile |od -t x1c
This will get the first two lines of the new file and the od tool will dump / convert those lines in ascii hex codes. In ascii hex \r is \x0d and \n is \x0a.
Once you have removed the \r from your file you can do anything you want.
You can use all linux tools (including awk) straight forward without special settings.
To built an array you can use:
$ while read -r line;do data+=("$line");done <newfile
If you want to skip blank lines , this one is enough:
$ while read -r line;do [[ "$line" == "" ]] && continue;data+=("$line") ;done <file1
You can offcourse combine array creation with removal of the \r on-the-fly, without modifying your existed file like this ( See online testing here. )
while read -r line;do [[ "$line" == "" ]] && continue;data+=("$line") ;done < <(tr -d '\r' <file1)
To see what is inside array "data" just use $ declare -p data
PS: By the way using awk -v RS="\r\n" '{you awk code here}' should be enough even to read the initial file in awk as well. RS = Record (lines) Separator
I made this script in pure bash, even if the answer from socowi is pure bash too:
exec < filern.txt
declare -a array
acc=""
lineno=0
cr=$(echo -en "\r")
while read line; do
line=${line%$cr}
if [ -z "$line" ]; then
let lineno=$lineno+1
array[$lineno]=$acc
acc=""
else
[ ! -z "$acc" ] && acc="$acc--" # you can use any separator here
acc="$acc$line"
fi
done
echo "Read file in array:"
for ((i=1; i<= ${#array[#]}; i++)) do
printf "%3.3d |%s|\n" $i "${array[$i]}"
done
It reads a "real" line of input at a time, and strips the trailing \r.
At this point, a sequence \r\n\r\n turns into an empty line, so that is used to assign the array elements one after the other.
The output from the example file is:
Read file in array:
001 |line--line--line|
002 |line--line--line|
The separator could also be a \r, or whatever. I coudn't find a way to clear the trailing \r with the command line=${line% ?? }, so I used a variable. The same trick can be used to add "strange" separator to the variable ACC. I hope it helps.

How can I use sed to make thousands of substitutions in a file using a reference file?

I have a big file with two columns like this:
tiago#tiago:~/$ head Ids.txt
TRINITY_DN126999_c0_g1_i1 ENSMUST00000040656.6
TRINITY_DN126999_c0_g1_i1 ENSMUST00000040656.6
TRINITY_DN126906_c0_g1_i1 ENSMUST00000126770.1
TRINITY_DN126907_c0_g1_i1 ENSMUST00000192613.1
TRINITY_DN126988_c0_g1_i1 ENSMUST00000032372.6
.....
and I have another file with data, like this:
"baseMean" "log2FoldChange" "lfcSE" "stat" "pvalue" "padj" "super" "sub" "threshold"
"TRINITY_DN41319_c0_g1" 178.721774751278 2.1974294626636 0.342621318593487 6.41358066008381 1.4214085388179e-10 5.54686423073089e-08 TRUE FALSE "TRUE"
"TRINITY_DN87368_c0_g1" 4172.76139849472 2.45766387851112 0.404014016558211 6.08311538160958 1.17869459181235e-09 4.02673069375893e-07 TRUE FALSE "TRUE"
"TRINITY_DN34622_c0_g1" 39.1949851245197 3.28758092748061 0.54255370348027 6.05945716781964 1.3658169042862e-09 4.62597265729593e-07 TRUE FALSE "TRUE"
.....
I was thinking of using sed to perform a translation of the values in the first column of the data file, using the first file as a dictionary.
That is, considering each line of the data file in turn, if the value in the first column matches a value in the first column of the dictionary file, then a substitution would be be made; otherwise, the line would simply be printed.
Any suggestions would be appreciated.
You can turn your first file Ids.txt into a sed script:
$ sed -r 's| *(\S+) (\S+)|s/^"\1/"\2/|' Ids.txt > repl.sed
$ cat repl.sed
s/^"TRINITY_DN126999_c0_g1_i1/"ENSMUST00000040656.6/
s/^"TRINITY_DN126999_c0_g1_i1/"ENSMUST00000040656.6/
s/^"TRINITY_DN126906_c0_g1_i1/"ENSMUST00000126770.1/
s/^"TRINITY_DN126907_c0_g1_i1/"ENSMUST00000192613.1/
s/^"TRINITY_DN126988_c0_g1_i1/"ENSMUST00000032372.6/
This removes leading spaces and makes each line into a substitution command.
Then you can use this script to do the replacements in your data file:
sed -f repl.sed datafile
... with redirection to another file, or in-place with sed -i.
If you don't have GNU sed, you can use this POSIX conformant version of the first command:
sed 's| *\([^ ]*\) \([^ ]*\)|s/^"\1/"\2/|' Ids.txt
This uses basic instead of extended regular expressions and uses [^ ] for "not space" instead of \S.
Since the first file (the dictionary file) is large, using sed may be very slow; a much faster and not much more complex approach would be to use awk as follows:
awk -v col=1 -v dict=Ids.txt '
BEGIN {while(getline<dict){a["\""$1"\""]="\""$2"\""} }
$col in a {$col=a[$col]}; {print}'
(Here, "Ids.txt" is the dictionary file, and "col" is the column number of the field of interest in the data file.)
This approach also has the advantage of not requiring any modification to the dictionary file.
#!/bin/bash
# Declare hash table
declare -A Ids
# Go though first input file and add key-value pairs to hash table
while read Id; do
key=$(echo $Id | cut -d " " -f1)
value=$(echo $Id | cut -d " " -f2)
Ids+=([$key]=$value)
done < $1
# Go through second input file and replace every first column with
# the corresponding value in the hash table if it exists
while read line; do
first_col=$(echo $line | cut -d '"' -f2)
new_id=${Ids[$first_col]}
if [ -n "$new_id" ]; then
sed -i s/$first_col/$new_id/g $2
fi
done < $2
I would call the script as
./script.sh Ids.txt data.txt

How to create a dynamic variable array name and fill it with multi-line text in bash script

I need to create a dynamic variable name for some arrays and fill it with multi-line text.
What I actually have is this :
#!/bin/bash
IFS=$'\n'
# Set an array with 1 item
ARRAY=("Item1")
# Get a description for the last item from a simple text file
DESCRIPTION=("$(cat test1.txt)")
# Get the number of items in the array
ARRAY_ITEMS_COUNT=${#ARRAY[#]}
# Create a variable name containing the number of items in the array as identifier
# and fill it with the description
eval ARRAY_ITEM${ARRAY_ITEMS_COUNT}_DESCRIPTION="(\"${DESCRIPTION[#]}\")"
# Display some results
echo "ARRAY_ITEM1_DESCRIPTION[#] = \"${ARRAY_ITEM1_DESCRIPTION[#]}\""
echo "ARRAY_ITEM1_DESCRIPTION[0] = \"${ARRAY_ITEM1_DESCRIPTION[0]}\""
echo "ARRAY_ITEM1_DESCRIPTION[1] = \"${ARRAY_ITEM1_DESCRIPTION[1]}\""
echo
# Same as above with a different text file
ARRAY=("Item1" "Item2")
DESCRIPTION=("$(cat test2.txt)")
ARRAY_ITEMS_COUNT=${#ARRAY[#]}
# Get an error here due to the ' character used in the text file
eval ARRAY_ITEM${ARRAY_ITEMS_COUNT}_DESCRIPTION="(\"${DESCRIPTION[#]}\")"
echo "ARRAY_ITEM2_DESCRIPTION[#] = \"${ARRAY_ITEM2_DESCRIPTION[#]}\""
echo "ARRAY_ITEM2_DESCRIPTION[0] = \"${ARRAY_ITEM2_DESCRIPTION[0]}\""
echo "ARRAY_ITEM2_DESCRIPTION[1] = \"${ARRAY_ITEM2_DESCRIPTION[1]}\""
The files "test1.txt" and "test2.txt" are as follow :
test1.txt
Simple text file with multi-lines used as
a test without special characters inside.
test2.txt
Simple text file with multi-lines used as
a test with single ' and double " quotes.
Expected result :
ARRAY_ITEM1_DESCRIPTION[#] = "Simple text file with multi-lines used as
a test without special characters inside."
ARRAY_ITEM1_DESCRIPTION[0] = "Simple text file with multi-lines used as"
ARRAY_ITEM1_DESCRIPTION[1] = "a test without special characters inside."
ARRAY_ITEM2_DESCRIPTION[#] = "Simple text file with multi-lines used as
a test with single ' and double " quotes."
ARRAY_ITEM2_DESCRIPTION[0] = "Simple text file with multi-lines used as"
ARRAY_ITEM2_DESCRIPTION[1] = "a test with single ' and double " quotes."
Current result :
ARRAY_ITEM1_DESCRIPTION[#] = "Simple text file with multi-lines used as
a test without special characters inside."
ARRAY_ITEM1_DESCRIPTION[0] = "Simple text file with multi-lines used as
a test without special characters inside."
ARRAY_ITEM1_DESCRIPTION[1] = ""
./test.sh: eval: line 28: unexpected EOF while looking for matching `"'
./test.sh: eval: line 29: syntax error: unexpected end of file
ARRAY_ITEM2_DESCRIPTION[#] = ""
ARRAY_ITEM2_DESCRIPTION[0] = ""
ARRAY_ITEM2_DESCRIPTION[1] = ""
I tried a lot of things but it never gives me what is expected, so can someone help me solve the 2 issues I have there please :
How to get proper array containing each line of the text file on each index
How to make it work even with quotes characters in the texts
EDIT : Working solution (bash version > 4) is :
#!/bin/bash
# Set an array with 1 item
ARRAY=("Item1")
# Get the number of items in the array
ARRAY_ITEMS_COUNT=${#ARRAY[#]}
# Create a variable name containing the number of items in the array as identifier
# and fill it with the description
readarray -t "ARRAY_ITEM${ARRAY_ITEMS_COUNT}_DESCRIPTION" < test1.txt
# Display some results
echo "ARRAY_ITEM1_DESCRIPTION[#] = \"${ARRAY_ITEM1_DESCRIPTION[#]}\""
echo "ARRAY_ITEM1_DESCRIPTION[0] = \"${ARRAY_ITEM1_DESCRIPTION[0]}\""
echo "ARRAY_ITEM1_DESCRIPTION[1] = \"${ARRAY_ITEM1_DESCRIPTION[1]}\""
echo
# Same as above with a different text file
ARRAY=("Item1" "Item2")
ARRAY_ITEMS_COUNT=${#ARRAY[#]}
# Get an error here due to the ' character used in the text file
readarray -t ARRAY_ITEM${ARRAY_ITEMS_COUNT}_DESCRIPTION < test2.txt
echo "ARRAY_ITEM2_DESCRIPTION[#] = \"${ARRAY_ITEM2_DESCRIPTION[#]}\""
echo "ARRAY_ITEM2_DESCRIPTION[0] = \"${ARRAY_ITEM2_DESCRIPTION[0]}\""
echo "ARRAY_ITEM2_DESCRIPTION[1] = \"${ARRAY_ITEM2_DESCRIPTION[1]}\""
Thanks for your help, have a nice day.
Slander
When you call cat in an array assignment you shouldn't quote it if you want the file to be read line by line. Because if you do so the contents of the file will be handled as one string/one line. So it won't get read line by line. Just try:
DESCRIPTION=($(cat test1.txt))
Also if you are using Bash version 4 you could use bash builtin command readarray to generate an array:
readarray -t DESCRIPTION < "test1.txt"
For Bash version < 4 this could be an alternative to cat and readarray:
IFS=$'\n' read -d -r -a DESCRIPTION < "test1.txt"

Resources