i have a huge text file and i want to delete certain portions of it between two certain words. e.g:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam
nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat
volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation
ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat.
Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse
molestie consequat, vel illum dolore eu feugiat nulla facilisis at
vero eros et accumsan et iusto odio dignissim qui blandit praesent
luptatum zzril delenit augue duis dolore te feugait nulla facilisi.
Nam liber tempor cum soluta nobis eleifend option congue nihil
imperdiet doming id quod mazim placerat facer possim assum. Typi non
habent claritatem insitam; est usus legentis in iis qui facit eorum
claritatem. Investigationes demonstraverunt lectores legere me lius
quod ii legunt saepius. Claritas est etiam processus dynamicus, qui
sequitur mutationem consuetudium lectorum. Mirum est notare quam
littera gothica.
delete between "guis" and "gothica" words, it becomes:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis gothica.
actually in the huge files there are lots of "gui" s and "gothica" s , i have to get rid of all of them.
this can be achieved with a simple batch script but i am so strange to the subject. TIA if anyone helps.
Here is the simplest solution I came up with, which I'm sure has issues with things like special characters, but works with the given example. I used filenames input.txt and output.txt.
#echo off
setlocal disableDelayedExpansion
set "FLAG=FALSE"
:: Define LF to contain a newline character
set LF=^
:: Do not remove above lines!
> output.txt (
for /f "eol= tokens=*" %%A in (input.txt) do (
set "ln=%%A"
setlocal enableDelayedExpansion
for %%L in ("!LF!") do (
for /f "eol= delims=., " %%W in ("!ln: =%%~L!") do (
if "%%W"=="quis" (
set "FLAG=TRUE"
<nul set /p=%%W
) else if "%%W"=="gothica" (
<nul set /p=%%W
set "FLAG=FALSE"
) else if "!FLAG!"=="FALSE" (
<nul set /p=%%W
)
)
)
endlocal
)
)
This goes through each word and prints them out until it finds quis, and resumes printing out after it finds gothica. I used <nul set /p=%%W to echo without printing a newline (see 2nd and 3rd links), which has a side effect of printing out an extra space at the end of the file, so be aware of that.
References:
Read words separated by space in a batch script
Hidden features of Windows batch files
Windows batch: echo without new line
Related
How can I efficiently remove enclosing brackets from a file with bash scripting (first occurrence of [ and last occurrence of ] in file)?
All brackets that are nested within the outer brackets and may extend over several lines should be retained.
Leading or trailing whitespaces may be present.
content of file1
[
Lorem ipsum
[dolor] sit [amet
conse] sadip elitr
]
cat file1 | magicCommand
desired output
Lorem ipsum
[dolor] sit [amet
conse] sadip elitr
content of file2
[Lorem ipsum [dolor] sit [amet conse] sadip elitr]
cat file2 | magicCommand
desired output
Lorem ipsum [dolor] sit [amet conse] sadip elitr
If you want to edit the file to remove the braces, use ed:
printf '%s\n' '1s/^\([[:space:]]*\)\[/\1/' '$s/\]\([[:space:]]*\)$/\1/' w | ed -s file1
If you want to pass on the modified contents of the file to something else as part of a pipeline, use sed:
sed -e '1s/^\([[:space:]]*\)\[/\1/' -e '$s/\]\([[:space:]]*\)$/\1/' file1
Both of these will, for the first line of the file, remove a [ at the start of the line (Skipping over any initial whitespace before the opening brace), and for the last line of the file (Which can be the same line as in your second example), remove a ] at the end of the line (Not counting any trailing whitespace after the close bracket). Any leading/trailing whitespace will be preserved in the result; use s/...// instead to remove them too.
perl -0777 -pe 's/^\s*\[\s*//; s/\s*\]\s*$//' file
That's aggressive about removing all whitespace around the outer brackets, which isn't exactly what you show in your desired output.
With GNU sed for -E and -z:
$ sed -Ez 's/\[(.*)]/\1/' file1
Lorem ipsum
[dolor] sit [amet
conse] sadip elitr
$ sed -Ez 's/\[(.*)]/\1/' file2
Lorem ipsum [dolor] sit [amet conse] sadip elitr
The above will read the whole file into memory.
I've got a strange edge-case.
I have a long string which contains \n (newline characters).
So the string looks something like:
text="loremipsum\nDollor sit atmet \n aliquyam erat,
sed diam\naliquyam erat \n sed diam"
I need to split the string into an array, but keep the newline characters uniterpreted,
so the array/output looks like:
"loremipsum\n"
"Dollor sit atmet \n"
"aliquyam erat, sed diam\n"
"aliquyam erat \n"
"sed diam"
I couldn't find a way to split the string and preserve the \n characters.
If I use IFS=$"\n" the \ncharacters are deleted,
but if I use IFS="\n" it gets split and delets all occurrence of n.
I tried it like:
IFS=$"\n" read -d '' -a arr <<<"$text"
How can I solve this?
Clarification/Update
The text is dynamic and can be very long 3000+ chars,
so creating the array like: declare -a arr=([0]=$'loremipsum\n'... is not an option.
The \n characters (0x5c + 0x6e in ascii code) should all be treated the same,
the should not be replaced with an actual newline.
The \n characters must be preserved,
because the progrann which gets the output looks for these in plaintext.
The \n characters can be àt every position in a sentence,
also in a word like:
lor\nem or with spaces: Lorem \n ipsum
So the \n characters must be at the end of the elements inside the array, like shown above.
The text must only be splitted at \n not a spaces etc..
My understanding from the sample (input/output) data given:
there is one actual newline character in text (between erat, and sed diem); this is to be removed and assuming there is no (space) after erat, we need to add a (space), ie, replace the actual newline character with a (space)
there are 4 literal strings of \ + n; we are to break the array after these literals; the literal \ + n are to remain in the text that is stored in the array
the output should have a leading space removed from array values
I'm assuming the final results should not include the double quotes (ie, OP included the double quotes in the desired output as a means of delimiting the array values for display purposes)
One idea:
text="loremipsum\nDollor sit atmet \n aliquyam erat,
sed diam\naliquyam erat \n sed diam"
# convert actual newline character to a (space)
text=${text//$'\n'/ }
# add an actual newline character after the literal `\` + `n`
text=${text//\n/\n$'\n'}
# print our value, remove leading (space), and load into array
IFS=$'\n' arr=( $(printf "%s\n" "${text}." | sed 's/^ //g') )
# display array
typeset -p arr
declare -a arr=([0]="loremipsum\\n" [1]="Dollor sit atmet \\n" [2]="aliquyam erat, sed diam\\n" [3]="aliquyam erat \\n" [4]="sed diam.")
# loop through array and display individual strings; add double quotes as delimiters for display purposes
for i in "${!arr[#]}"
do
echo "\"${arr[${i}]}\""
done
"loremipsum\n"
"Dollor sit atmet \n"
"aliquyam erat, sed diam\n"
"aliquyam erat \n"
"sed diam."
You can use process substitution and echo, e.g.
text="loremipsum\nDollor sit atmet \n aliquyam erat, sed diam\naliquyam erat \n sed diam"
readarray arr < <(echo -e "$text")
You can also use printf in the process substitution as well, e.g.
< <(printf "$text")
Since the -t option is not give to readarray, the '\n' is included as part of the array element.
Example Use/Output
Adding a declare -p arr to output the array, you would have:
text="loremipsum\nDollor sit atmet \n aliquyam erat, sed diam\naliquyam erat \n sed diam"
readarray arr < <(echo -e "$text")
declare -p arr
declare -a arr=([0]=$'loremipsum\n' [1]=$'Dollor sit atmet \n' [2]=$' aliquyam erat, sed diam\n' [3]=$'aliquyam erat \n' [4]=$' sed diam\n')
If you want to trim leading whitespace, you can use the brace-expansion ${element#*[[:space:]]}. Up to you.
I have an XML file of the format:
<classes>
<subject lb="Fall Sem 2020">
<name>Operating System</name>
<credit>3</credit>
<type>Theory</type>
<faculty>Prof. XYZ</faculty>
</subject>
<subject lb="Spring Sem 2020">
<name>Web Development</name>
<credit>3</credit>
<type>Lab</type>
</subject>
<subject lb="Fall Sem 2021">
<name>Computer Network</name>
<credit>3</credit>
<type>Theory</type>
<faculty>Prof. ABC</faculty>
</subject>
<subject lb="Spring Sem 2021">
<name>Software Engineering</name>
<credit>3</credit>
<type>Lab</type>
</subject>
</classes>
Expected Output:
Fall Sem 2020
Spring Sem 2020
Fall Sem 2021
Spring Sem 2021
I want to extract the values of lb in an array.
My try: I tried using sed -n "/lb="/,\/"/p" file.xml but this command is not giving me the values present for the particular label.
What could be the correct way to deal with this problem?
Getting an attribute value in xml element.
If no XML parser is available. With GNU sed:
sed -En 's/.* lb="([^"]+)".*/\1/p' file
Output:
Fall Sem 2020
Spring Sem 2020
Fall Sem 2021
Spring Sem 2021
Could you please try following in awk considering that you don't have any way to use xml tools.
awk '
BEGIN{
OFS=","
}
/<subject lb="/{
match($0,/".*"/)
print substr($0,RSTART+1,RLENGTH-2)
}
' Input_file
How can we achieve deleting lines starting with specific characters in a huge text file using a batch file?
e.g:
oldfile.txt is:
line 1 Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam
line 2 euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.
line 3 aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor
line 4 vel illum dolore eu feugiat nulla.
deleting lines starting with "eui" and "ali";
newfile.txt becomes:
line 1 Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam
line 4 vel illum dolore eu feugiat nulla.
Try:
#ECHO ON
findstr /v "Lorem euismod" "%USERPROFILE%\Desktop\test.txt" > "%USERPROFILE%\Desktop\outfile.txt"
PAUSE
It deletes line 1 and line 2 in test.txt. I used Lorem euismod as reference.
Check out this link.
To avoid deleting lines with one or more of the search terms in the middle of the string somewhere, instead of just at the beginning as requested, add the /B option to fix that:
findstr /B /V "eui ali" "input.txt" > "output.txt"
I have a problem I am hoping someone can help with (greatly simplified for the purposes of explaining what I am trying to do)...
I have three different arrays:
my #array1 = ("DOG","CAT","HAMSTER");
my #array2 = ("DONKEY","FOX","PIG", "HORSE");
my #array3 = ("RHINO","LION","ELEPHANT");
I also have a variable that contains the content from a web page (using WWW::Mechanize):
my $variable = $r->content;
I now want to see if any of the elements in each of the arrays are found in the variable, and if so which array it comes from:
e.g
if ($variable =~ (any of the elements in #array1)) {
print "FOUND IN ARRAY1";
} elsif ($variable =~ (any of the elements in #array2)) {
print "FOUND IN ARRAY2";
} elsif ($variable =~ (any of the elements in #array3)) {
print "FOUND IN ARRAY3";
}
What is the best way to go about doing this using the arrays and iterating through each element in the arrays? Is there a better way this can be done?
your help is much appreciated, thanks
You can make a regex out of the array elements, but you'll most likely want to disable meta characters and make sure you do not get partial matches:
my $rx = join('\b|\b', map quotemeta, #array1);
if ($variable =~ /\b$rx\b/) {
print "matched array 1\n";
}
If you do want to get partial matches, such as FOXY below, simply remove all the \b sequences.
Demonstration:
use strict;
use warnings;
my #array1 = ("DOG","CAT","HAMSTER");
my #array2 = ("DONKEY","FOX","PIG", "HORSE");
my #array3 = ("RHINO","LION","ELEPHANT");
my %checks = (
array1 => join('\b|\b', map quotemeta, #array1),
array2 => join('\b|\b', map quotemeta, #array2),
array3 => join('\b|\b', map quotemeta, #array3),
);
while (<DATA>) {
chomp;
print "The string: '$_'\n";
for my $key (sort keys %checks) {
print "\t";
if (/\b$checks{$key}\b/) {
print "does";
} else {
print "does not";
}
print " match $key\n";
}
}
__DATA__
A DOG ATE MY RHINO
A FOXY HORSEY
Output:
The string: 'A DOG ATE MY RHINO'
does match array1
does not match array2
does match array3
The string: 'A FOXY HORSEY'
does not match array1
does not match array2
does not match array3
my $re1 = join '|', #array1;
say "found in array 1" if $variable =~ /$re1/;
Repeat for each additional array (or use an array of regexes and an array of arrays of terms).
First of all, if When you find yourself adding an integer suffix to variable names, think I should have used an array.
Therefore, first I am going to put the wordsets in an array of arrayrefs. That will help identify where the matched word came from.
Second, I am going to use Regex::PreSuf to make a pattern out of each word list because I always forget the right way to do that.
Third note that using \b in regex patterns can lead to surprising results. So, instead, I am going to split up the content into individual sequences of \w characters.
Fourth, you say "I also have a variable that contains the content from a web page (using WWW::Mechanize)". Do you want to match words in the comments? In title attributes? If you don't, you should parse the HTML document either to extract full plain text or to restrict the match to within a certain element or set of elements.
Then, grep from the list of words in the text those that are in a wordset and map them to the wordset they matched.
#!/usr/bin/env perl
use strict; use warnings;
use Regex::PreSuf qw( presuf );
my #wordsets = (
[ qw( DOG CAT HAMSTER ) ],
[ qw( DONKEY FOX PIG HORSE ) ],
[ qw( RHINO LION ELEPHANT ) ],
);
my #patterns = map {
my $pat = presuf(#$_);
qr/\A($pat)\z/;
} #wordsets;
my $content = q{Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim
ad minim veniam, quis ELEPHANT exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure dolor in reprehenderit in HAMSTER
velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat
cupidatat non proident, sunt in DONKEY qui officia deserunt mollit anim id
est laborum.};
my #contents = split /\W+/, $content;
use YAML;
print Dump [
map {
my $i = $_;
map +{$_ => $i },
grep { $_ =~ $patterns[$i] } #contents
} 0 .. $#patterns
];
Here, grep { $_ =~ $patterns[$i] } #contents extracts the words from #contents which are in the given wordset. Then, map +{$_ => $i } maps those words to the wordset from which they came. The outer map just loops over each wordset pattern.
Output:
---
- HAMSTER: 0
- DONKEY: 1
- ELEPHANT: 2
That is, you get a list of hashrefs where the key in each hashref is the word that was found and the value is the wordset that matched.
I assume $variable is not an array, in which case use a foreach statement.
foreach my $item (#array1) {
if ($item eq $variable) {
print "FOUND IN ARRAY1";
}
}
and repeat the above for each array, i.e. array2, array3...
EDIT: I think you could use perl's map function, something like this:
#a1matches = map { $variable =~ /$_/ ? $_ : (); } #array1;
print "FOUND IN ARRAY1\n" if $#a1matches >= 0;
#a2matches = map { $variable =~ /$_/ ? $_ : (); } #array2;
print "FOUND IN ARRAY2\n" if $#a2matches >= 0;
#a3matches = map { $variable =~ /$_/ ? $_ : (); } #array3;
print "FOUND IN ARRAY3\n" if $#a3matches >= 0;
A fun side effect is that #a1matches contain the elements of #array1 that were in $variable.
Regexp::Assemble may be helpful if you like to use a module. It allows to assemble strings of regular expressions into one regular expression matching all the individual regular expressions.