remove enclosing brackets from a file - unix-text-processing

How can I efficiently remove enclosing brackets from a file with bash scripting (first occurrence of [ and last occurrence of ] in file)?
All brackets that are nested within the outer brackets and may extend over several lines should be retained.
Leading or trailing whitespaces may be present.
content of file1
[
Lorem ipsum
[dolor] sit [amet
conse] sadip elitr
]
cat file1 | magicCommand
desired output
Lorem ipsum
[dolor] sit [amet
conse] sadip elitr
content of file2
[Lorem ipsum [dolor] sit [amet conse] sadip elitr]
cat file2 | magicCommand
desired output
Lorem ipsum [dolor] sit [amet conse] sadip elitr

If you want to edit the file to remove the braces, use ed:
printf '%s\n' '1s/^\([[:space:]]*\)\[/\1/' '$s/\]\([[:space:]]*\)$/\1/' w | ed -s file1
If you want to pass on the modified contents of the file to something else as part of a pipeline, use sed:
sed -e '1s/^\([[:space:]]*\)\[/\1/' -e '$s/\]\([[:space:]]*\)$/\1/' file1
Both of these will, for the first line of the file, remove a [ at the start of the line (Skipping over any initial whitespace before the opening brace), and for the last line of the file (Which can be the same line as in your second example), remove a ] at the end of the line (Not counting any trailing whitespace after the close bracket). Any leading/trailing whitespace will be preserved in the result; use s/...// instead to remove them too.

perl -0777 -pe 's/^\s*\[\s*//; s/\s*\]\s*$//' file
That's aggressive about removing all whitespace around the outer brackets, which isn't exactly what you show in your desired output.

With GNU sed for -E and -z:
$ sed -Ez 's/\[(.*)]/\1/' file1
Lorem ipsum
[dolor] sit [amet
conse] sadip elitr
$ sed -Ez 's/\[(.*)]/\1/' file2
Lorem ipsum [dolor] sit [amet conse] sadip elitr
The above will read the whole file into memory.

Related

Split long string into array and keep delimiter

I've got a strange edge-case.
I have a long string which contains \n (newline characters).
So the string looks something like:
text="loremipsum\nDollor sit atmet \n aliquyam erat,
sed diam\naliquyam erat \n sed diam"
I need to split the string into an array, but keep the newline characters uniterpreted,
so the array/output looks like:
"loremipsum\n"
"Dollor sit atmet \n"
"aliquyam erat, sed diam\n"
"aliquyam erat \n"
"sed diam"
I couldn't find a way to split the string and preserve the \n characters.
If I use IFS=$"\n" the \ncharacters are deleted,
but if I use IFS="\n" it gets split and delets all occurrence of n.
I tried it like:
IFS=$"\n" read -d '' -a arr <<<"$text"
How can I solve this?
Clarification/Update
The text is dynamic and can be very long 3000+ chars,
so creating the array like: declare -a arr=([0]=$'loremipsum\n'... is not an option.
The \n characters (0x5c + 0x6e in ascii code) should all be treated the same,
the should not be replaced with an actual newline.
The \n characters must be preserved,
because the progrann which gets the output looks for these in plaintext.
The \n characters can be àt every position in a sentence,
also in a word like:
lor\nem or with spaces: Lorem \n ipsum
So the \n characters must be at the end of the elements inside the array, like shown above.
The text must only be splitted at \n not a spaces etc..
My understanding from the sample (input/output) data given:
there is one actual newline character in text (between erat, and sed diem); this is to be removed and assuming there is no (space) after erat, we need to add a (space), ie, replace the actual newline character with a (space)
there are 4 literal strings of \ + n; we are to break the array after these literals; the literal \ + n are to remain in the text that is stored in the array
the output should have a leading space removed from array values
I'm assuming the final results should not include the double quotes (ie, OP included the double quotes in the desired output as a means of delimiting the array values for display purposes)
One idea:
text="loremipsum\nDollor sit atmet \n aliquyam erat,
sed diam\naliquyam erat \n sed diam"
# convert actual newline character to a (space)
text=${text//$'\n'/ }
# add an actual newline character after the literal `\` + `n`
text=${text//\n/\n$'\n'}
# print our value, remove leading (space), and load into array
IFS=$'\n' arr=( $(printf "%s\n" "${text}." | sed 's/^ //g') )
# display array
typeset -p arr
declare -a arr=([0]="loremipsum\\n" [1]="Dollor sit atmet \\n" [2]="aliquyam erat, sed diam\\n" [3]="aliquyam erat \\n" [4]="sed diam.")
# loop through array and display individual strings; add double quotes as delimiters for display purposes
for i in "${!arr[#]}"
do
echo "\"${arr[${i}]}\""
done
"loremipsum\n"
"Dollor sit atmet \n"
"aliquyam erat, sed diam\n"
"aliquyam erat \n"
"sed diam."
You can use process substitution and echo, e.g.
text="loremipsum\nDollor sit atmet \n aliquyam erat, sed diam\naliquyam erat \n sed diam"
readarray arr < <(echo -e "$text")
You can also use printf in the process substitution as well, e.g.
< <(printf "$text")
Since the -t option is not give to readarray, the '\n' is included as part of the array element.
Example Use/Output
Adding a declare -p arr to output the array, you would have:
text="loremipsum\nDollor sit atmet \n aliquyam erat, sed diam\naliquyam erat \n sed diam"
readarray arr < <(echo -e "$text")
declare -p arr
declare -a arr=([0]=$'loremipsum\n' [1]=$'Dollor sit atmet \n' [2]=$' aliquyam erat, sed diam\n' [3]=$'aliquyam erat \n' [4]=$' sed diam\n')
If you want to trim leading whitespace, you can use the brace-expansion ${element#*[[:space:]]}. Up to you.

Only keeping lines in a textfile that start with specific characters

How can we achieve deleting lines starting with specific characters in a huge text file using a batch file?
e.g:
oldfile.txt is:
line 1 Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam
line 2 euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.
line 3 aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor
line 4 vel illum dolore eu feugiat nulla.
deleting lines starting with "eui" and "ali";
newfile.txt becomes:
line 1 Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam
line 4 vel illum dolore eu feugiat nulla.
Try:
#ECHO ON
findstr /v "Lorem euismod" "%USERPROFILE%\Desktop\test.txt" > "%USERPROFILE%\Desktop\outfile.txt"
PAUSE
It deletes line 1 and line 2 in test.txt. I used Lorem euismod as reference.
Check out this link.
To avoid deleting lines with one or more of the search terms in the middle of the string somewhere, instead of just at the beginning as requested, add the /B option to fix that:
findstr /B /V "eui ali" "input.txt" > "output.txt"

how can I use sed to delete second line?

I have a text file 3.txt, content like this
123.ooo
aaaa - bbbb ccccc, dddd dddd, eeee fff (hhhhh ii jjj) 890a fff rrr 98.jjjj.1234
444.kkk
read 3.txt into array VAR
#!/bin/bash
VAR=$(cat /tmp/3.txt)
LEN=${#VAR[#]}
for (( i=0; i<${LEN}; i++ ));
do
echo ${VAR[$i]}
sed -i -e "#"${VAR[$i]}"#d" /tmp/3.txt
cat /tmp/3.txt
done
I want to echo 123.ooo then delete the 123.ooo line in /tmp/3.txt
then echo aaaa - bbbb ccccc, dddd dddd, eeee fff (hhhhh ii jjj) 890a fff rrr 98.jjjj.1234 and delete this line in /tmp/3.txt
but second line have space, how can I use sed to delete second line?
Sorry about my poor English.
I want to echo first line, then delete first line in file.
echo second line, then delete second line in file.
echo third line, then delete third line in file.
Taking this as an exercise in using arrays and sed:
The issue with your code is that file is read as a single value into the variable VAR whereas you want to create an array with each line as its own array entry.
To assign to an array use
VAR=( ... )
To split the input on newlines, set the IFS variable to the value of a single newline.
You also have another issue with you sed editing script. You seem to want to use # as the delimiter for the regular expression (which is actually a "sed address"). If you want to do this, the first # has to be escaped (\#). It is only the arguments to the s and y sed editing commands that takes any unescaped character as delimiters.
#!/bin/bash
IFS=$'\n'
VAR=( $(cat /tmp/3.txt) )
LEN=${#VAR[#]}
for (( i=0; i<${LEN}; i++ ));
do
echo ${VAR[$i]}
sed -i -e "\#"${VAR[$i]}"#d" /tmp/3.txt
cat /tmp/3.txt
done
Or, with a bit of cleanup:
#!/bin/bash
IFS=$'\n'
VAR=( $(</tmp/3.txt) )
LEN=${#VAR[#]}
for (( i = 0; i < LEN; i++ )); do
echo ${VAR[$i]}
sed -i -e "\#${VAR[$i]}#d" /tmp/3.txt
cat /tmp/3.txt
done

how to return an array from a script in Bash?

suppose I have a script called 'Hello'. something like:
array[0]="hello world"
array[1]="goodbye world"
echo ${array[*]}
and I want to do something like this in another script:
tmp=(`Hello`)
the result I need is:
echo ${tmp[0]} #prints "hello world"
echo ${tmp[1]} #prints "goodbye world"
instead I get
echo ${tmp[0]} #prints "hello"
echo ${tmp[1]} #prints "world"
or in other words, every word is put in a different spot in the tmp array.
how do I get the result I need?
Emit it as a NUL-delimited stream:
printf '%s\0' "${array[#]}"
...and, in the other side, read from that stream:
array=()
while IFS= read -r -d '' entry; do
array+=( "$entry" )
done
This often comes in handy in conjunction with process substitution; in the below example, the initial code is in a command (be it a function or an external process) invoked as generate_an_array:
array=()
while IFS= read -r -d '' entry; do
array+=( "$entry" )
done < <(generate_an_array)
You can also use declare -p to emit a string which can be evaled to get the content back:
array=( "hello world" "goodbye world" )
declare -p array
...and, on the other side...
eval "$(generate_an_array)"
However, this is less preferable -- it's not as portable to programming languages other than bash (whereas almost all languages can read a NUL-delimited stream), and it requires the receiving program to trust the sending program to return declare -p results and not malicious content.
Although there are workarounds, you can't really "return" an array from a bash function or script, since the normal way of "returning" a value is to send it as a string to stdout and let the caller capture it with command substitution. [Note 1] That's fine for simple strings or very simple arrays (such as arrays of numbers, where the elements cannot contain whitespace), but it's really not a good way to send structured data.
There are workarounds, such as printing a string with specific delimiters (in particular, with NUL bytes) which can be parsed by the caller, or in the form of an executable bash statement which can be evaluated by the caller with eval, but on the whole the simplest mechanism is to require that the caller provide the name of an array variable into which the value can be placed. This only works with bash functions, since scripts can't modify the environment of the caller, and it only works with functions called directly in the parent process, so it won't work with pipelines. Effectively, this is a mechanism similar to that used by the read built-in, and a few other bash built-ins.
Here's a simple example. The function split takes three arguments: an array name, a delimiter, and a string:
split () {
IFS=$2 read -a "$1" -r -d '' < <(printf %s "$3")
}
eg:
$ # Some text
$ lorem="Lorem ipsum dolor
sit amet, consectetur
adipisicing elit, sed do
eiusmod tempor incididunt"
# Split at the commas, putting the pieces in the array phrase
$ split phrase "," "$lorem"
# Print the pieces in a way that you can see the elements.
$ printf -- "--%s\n" "${phrase[#]}"
--Lorem ipsum dolor
sit amet
-- consectetur
adipisicing elit
-- sed do
eiusmod tempor incididunt
Notes:
Any function or script does have a status return, which is a small integer; this is what is actually returned by the return and exit special forms. However, the status return mostly works as a boolean value, and certainly cannot carry a structured value.
hello.sh
declare -a array # declares a global array variable
array=(
"hello world"
"goodbye world"
)
other.sh
. hello.sh
tmp=( "${array[#]}" ) # if you need to make a copy of the array
echo "${tmp[0]}"
echo "${tmp[1]}"
If you truly want a function to spit out values that your script will capture, do this:
hello.sh
#!/bin/bash
array=(
"hello world"
"goodbye world"
)
printf "%s\n" "${array[#]}"
other.sh
#!/bin/bash
./hello.sh | {
readarray -t tmp
echo "${tmp[0]}"
echo "${tmp[1]}"
}
# or
readarray -t tmp < <(./hello.sh)
echo "${tmp[0]}"
echo "${tmp[1]}"

Perl - Check if any elements in each different array matches a variable

I have a problem I am hoping someone can help with (greatly simplified for the purposes of explaining what I am trying to do)...
I have three different arrays:
my #array1 = ("DOG","CAT","HAMSTER");
my #array2 = ("DONKEY","FOX","PIG", "HORSE");
my #array3 = ("RHINO","LION","ELEPHANT");
I also have a variable that contains the content from a web page (using WWW::Mechanize):
my $variable = $r->content;
I now want to see if any of the elements in each of the arrays are found in the variable, and if so which array it comes from:
e.g
if ($variable =~ (any of the elements in #array1)) {
print "FOUND IN ARRAY1";
} elsif ($variable =~ (any of the elements in #array2)) {
print "FOUND IN ARRAY2";
} elsif ($variable =~ (any of the elements in #array3)) {
print "FOUND IN ARRAY3";
}
What is the best way to go about doing this using the arrays and iterating through each element in the arrays? Is there a better way this can be done?
your help is much appreciated, thanks
You can make a regex out of the array elements, but you'll most likely want to disable meta characters and make sure you do not get partial matches:
my $rx = join('\b|\b', map quotemeta, #array1);
if ($variable =~ /\b$rx\b/) {
print "matched array 1\n";
}
If you do want to get partial matches, such as FOXY below, simply remove all the \b sequences.
Demonstration:
use strict;
use warnings;
my #array1 = ("DOG","CAT","HAMSTER");
my #array2 = ("DONKEY","FOX","PIG", "HORSE");
my #array3 = ("RHINO","LION","ELEPHANT");
my %checks = (
array1 => join('\b|\b', map quotemeta, #array1),
array2 => join('\b|\b', map quotemeta, #array2),
array3 => join('\b|\b', map quotemeta, #array3),
);
while (<DATA>) {
chomp;
print "The string: '$_'\n";
for my $key (sort keys %checks) {
print "\t";
if (/\b$checks{$key}\b/) {
print "does";
} else {
print "does not";
}
print " match $key\n";
}
}
__DATA__
A DOG ATE MY RHINO
A FOXY HORSEY
Output:
The string: 'A DOG ATE MY RHINO'
does match array1
does not match array2
does match array3
The string: 'A FOXY HORSEY'
does not match array1
does not match array2
does not match array3
my $re1 = join '|', #array1;
say "found in array 1" if $variable =~ /$re1/;
Repeat for each additional array (or use an array of regexes and an array of arrays of terms).
First of all, if When you find yourself adding an integer suffix to variable names, think I should have used an array.
Therefore, first I am going to put the wordsets in an array of arrayrefs. That will help identify where the matched word came from.
Second, I am going to use Regex::PreSuf to make a pattern out of each word list because I always forget the right way to do that.
Third note that using \b in regex patterns can lead to surprising results. So, instead, I am going to split up the content into individual sequences of \w characters.
Fourth, you say "I also have a variable that contains the content from a web page (using WWW::Mechanize)". Do you want to match words in the comments? In title attributes? If you don't, you should parse the HTML document either to extract full plain text or to restrict the match to within a certain element or set of elements.
Then, grep from the list of words in the text those that are in a wordset and map them to the wordset they matched.
#!/usr/bin/env perl
use strict; use warnings;
use Regex::PreSuf qw( presuf );
my #wordsets = (
[ qw( DOG CAT HAMSTER ) ],
[ qw( DONKEY FOX PIG HORSE ) ],
[ qw( RHINO LION ELEPHANT ) ],
);
my #patterns = map {
my $pat = presuf(#$_);
qr/\A($pat)\z/;
} #wordsets;
my $content = q{Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim
ad minim veniam, quis ELEPHANT exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in HAMSTER
velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in DONKEY qui officia deserunt mollit anim id
est laborum.};
my #contents = split /\W+/, $content;
use YAML;
print Dump [
map {
my $i = $_;
map +{$_ => $i },
grep { $_ =~ $patterns[$i] } #contents
} 0 .. $#patterns
];
Here, grep { $_ =~ $patterns[$i] } #contents extracts the words from #contents which are in the given wordset. Then, map +{$_ => $i } maps those words to the wordset from which they came. The outer map just loops over each wordset pattern.
Output:
---
- HAMSTER: 0
- DONKEY: 1
- ELEPHANT: 2
That is, you get a list of hashrefs where the key in each hashref is the word that was found and the value is the wordset that matched.
I assume $variable is not an array, in which case use a foreach statement.
foreach my $item (#array1) {
if ($item eq $variable) {
print "FOUND IN ARRAY1";
}
}
and repeat the above for each array, i.e. array2, array3...
EDIT: I think you could use perl's map function, something like this:
#a1matches = map { $variable =~ /$_/ ? $_ : (); } #array1;
print "FOUND IN ARRAY1\n" if $#a1matches >= 0;
#a2matches = map { $variable =~ /$_/ ? $_ : (); } #array2;
print "FOUND IN ARRAY2\n" if $#a2matches >= 0;
#a3matches = map { $variable =~ /$_/ ? $_ : (); } #array3;
print "FOUND IN ARRAY3\n" if $#a3matches >= 0;
A fun side effect is that #a1matches contain the elements of #array1 that were in $variable.
Regexp::Assemble may be helpful if you like to use a module. It allows to assemble strings of regular expressions into one regular expression matching all the individual regular expressions.

Resources