How do I efficiently filter lines of input against an array of data? - arrays

I am trying to read a file into a temporary variable, filtering the file based off of items in an array. I am doing this by opening a file and in the while loop of reading the file, run another loop (very bad idea IMO) to check to see if the contents match the array, if so the line is discarded and it proceeds to the next line.
It works, but its bad when there are 20,000 lines of input. I am reading with an array of 10 items, which essentially turns it into a 200,000 line file.
Is there a way to process this quicker?

Assuming you want to discard a line if any item in your array is found, the any function from List::MoreUtils will stop searching through an array as soon as it has found a match.
use List::MoreUtils qw(any);
while (<>) {
my $line = $_;
next if any { $line =~ /$_/ } #list;
# do your processing
}
If you happen to know which items in your array are more likely to occur in your lines, you could sort your array accordingly.
You should also Benchmark your approaches to make sure your optimization efforts are worth it.

Mash the array items together into a big regex: e.g., if your array is qw{red white green}, use /(red|white|green)/. The $1 variable will tell you which one matched. If you need exact matching, anchor the end-points: /^(red|white|green)$/.

Related

I can't seem to splice an array from a reference in Perl

So I am passing an array reference into a function to clear out certain array elements:
The code is as follows:
if($notes->[$x] !~ /[^CF]/)
{
print "$notes->[$x]\n";
splice (#{$notes}), $x, 1;
}
If I comment out the splice line, the loop works fine showing me each $x element of the array. But if I do not comment out the splice comment, it all fails. It won't print out the $x element nor will the splice command work.
Use of uninitialized value in pattern match (m//) at
/var/www/cgi-bin/Funx.pm line 130.
Use of uninitialized value in
concatenation (.) or string at /var/www/cgi-bin/Funx.pm line 132.
Totally unsure as to what's going on here. I can understand my splice line not being the correct syntax. But why it affects the line above it I don't.
Any insight would be appreciated.
First of all
splice(#{$notes}), $x, 1;
should be
splice(#{$notes}, $x, 1);
That's not the error you asked about, but it's the only one you showed.
The error leading to the error message you did obtain is likely an incorrect loop. I believe you are using something along the lines of
for (#$notes)
or
for (0..$#$notes)
The first is buggy because you are not allowed to add or remove elements from an array over which you are iterating.
The second is buggy because it will execute the loop body as many times as the array had elements originally, so you'll end up looping too many times.

how to write a tcl script to delete last line of a file

I have to delete the last line of file in using tcl script. I know the content so content replacement is also ok. But my content is which has to be replaced by a space or newline character or have to deleted. And my job is in a loop.
Please let me know which is the efficient way, capturing the entire file content each time in loop and replace that string is better or deleting simply the last line.
Please give some script code because I am very new to tcl.
Are we talking about removing the last line from the data on disk or the data in memory? It matters because the approach you use to do those two cases is entirely different.
In memory
Exactly how you manipulate things in memory depends on whether you're representing the data as list of lines or a big string. Both approaches work. (You could be doing something else too, I suppose, but these two are the common obvious ways.)
If you've got your data as a list of lines in memory, you can simply do (assuming you're holding the lines in a variable called theLines):
set theLines [lreplace $theLines end end]
For a particularly large list, there are a few tricks to make it more efficient, but they come down to careful management of references:
# Needs a new enough Tcl (8.5 or 8.6 IIRC)
set theLines [lreplace $theLines[set theLines ""] end end]
Try the first version instead of this if you don't know you need it. Also be aware that if you're wanting to keep the original list of lines around, you should definitely use the first approach.
You might instead have the data in memory as a single big string. In that case, we can use some of Tcl's string searching capabilities to do the job.
set index [string last "\n" $theString end-1]
set theString [string range $theString 0 $index]
The optimisation mentioned above in relation to lreplace is also applicable here (with all the same caveats):
set index [string last "\n" $theString end-1]
set theString [string range $theString[set theString ""] 0 $index]
On disk
When working on disk, things are different. There you need to be much more careful since you can't undo changes easily. There are two general approaches:
Read the file into memory, do the change there (using the techniques above), and do a (destructive) ordinary write out. This is the approach you need when you are doing many other changes anyway (e.g., removing a line from the middle, adding a line to the middle, adding or removing characters from a line in the middle).
set filename "..."
# Open a file and read its lines into a list
set f [open $filename]
set theLines [split [read $f] "\n"]
close $f
# Transform (you should recognise this from above)
set theLines [lreplace $theLines end end]
# Write the file back out
set f [open $filename "w"]
puts -nonewline $f [join $theLines "\n"]
close $f
Find where the data you don't want starts as an offset in the file and truncate the file at that point. This is the right approach with a very large file, but it is rather more sophisticated.
set f [open $filename "r+"]; # NEED the read-write mode!
seek $f -1000 end; # Move to a little bit before the end of the file.
# Unnecessary, and guesswork, but can work and will
# speed things up for a big file very much
# Find the length that we want the file to become. We do this by building a list of
# offsets into the file.
set ptrList {}
while {![eof $f]} {
lappend ptrList [tell $f]
gets $f
}
# The length we want is one step back from the end of the list
set wantedLength [lindex $ptrList end-1]
# Do the truncation!
chan truncate $f $wantedLength
close $f
However you do the disk transformations, make sure you test on a trash file before applying it to anything real! In particular, I've not checked what the truncation method does on a file without a newline at the end. It probably works, but you should test.

Using TCL is it possible to read a file "backwards"

I have a very large text file from which I have to extract some data. I read the file line by line and look for keywords. As I know that the keywords I am looking for are much closer to the end of the file than to the beginning, I wonder if it is possible to read the file starting at the last row instead of the first. I then would use an aditional keyword which indicates "everything beyound this word is not of interesst" and stop reading.
Is that possible ?
I don't know how performant this would be, but run the file through tac and read from that:
set fh [open "|tac filename"]
# read from last line to first
while {[gets $fh line] != -1} {...
Another tactic would be to read the last, say, 5000 bytes of the file (using seek), split on newlines and examine those lines, then seek to position 10000 from the end and read the "next" 5000 bytes, etc.
No it is not possible (in any runtime/language I'm aware of, Tcl included).
So decide on a buffer side and read your file by seeking backwards and trying to read a full buffer each time.
Note that you have to observe certain possibilities:
The file might be smaller than the size of your buffer.
It seems you're dealing with a text file, and you want to process it line-wise. If so, observe that if the code is cross-platform or has to work on Windows you have to deal with the case when the data placed in the buffer by the last read operation starts with LF, and the next read operation—of the preceding chunk—will end with CR—that is, your EOL marker will be split across the buffers.
You might want to take a look at the implementation of Tcl_GetsObj() in the generic/tclIO.c file in the Tcl source code—it deals with split CRLFs on normal ("forward") reading of a textual string from a file.
The simplest way to grab the end of a file for searching, assuming you don't know the size of the records (i.e., the line lengths) is to grab too much and work with that.
set f [open $filename]
# Pick some large value; the more you read, the slower
seek $f -100000 end
# Read to the end, split into lines and *DISCARD FIRST*
set lines [lrange [split [read $f] "\n"] 1 end]
Now you can search with lsearch. (Note that you won't know exactly where in the file your matched line is; if you need that, you have to do quite a lot more work.)
if {[lsearch -glob $lines "*FooBar*"] >= 0} {
...
}
The discarding of the first line from the read section is because you're probably starting reading half way through a line; dropping the first “line” will mean that you've only got genuine lines to deal with. (100kB isn't very much for any modern computer system to search through, but you may be able to constrain it further. It depends on the details of the data.)
package require struct::list
set fp [open "filename.txt"]
set lines [split [read -nonewline $fp] "\n"]
foreach line [struct::list reverse $lines] {
...
}
do something with "$line".
to reverse file , I read the file into a variable "list" line by line pre-pending $list with the current line. That way List is in reverse order of file ..
while {[gets $in line] > -1} {
if [regexp "#" $line] {
continue
}
# reverse the order in variable "list"
set list "$line $list"
}
foreach line $list {
puts "line:$ln line= $line"
""*** process each line as you need ***""
}

Getting array elements to end at a particular keyword and shifting the rest to the next line in Perl

I am reading a text file into an array in perl and looping through the array to do stuff on it. Whenever there is a "begin", "end" or a ";" anywhere in the text, I want my array element to end there and whatever comes after any of those keywords to be in the next element to make life easier for me when I try to make sense of the elements later.
To achieve this I thought of reading the entire file into an array, replacing all "begin" with "begin\n", "end" with "end\n" and ";" with ";\n", writing this array back to a file and then reading that file back to an array. Will this work ?
Is there a more elegant way to do this rather than use messy extra writes and reads to file?
Is there a way to short (in the electrical circuits sense if you know what I mean!) a read file handle and a write file handle so that I can escape the whole writing to the text file but still get my job done?
Gururaj
You can use split with parentheses to keep the separator in the result:
open my $FH, '<', 'file.txt' or die $!;
my #array = map { split /(begin|end|;)/ } <$FH>;
I would prefer to use a Perl one-liner and avoid manipulating arrays altogether:
$ perl -pi -e 's#(?<=begin)#\n#g; s#(?<=end)#\n#g; s#(?<=;)#\n#g;' file.txt

How do I get the output of an object which has more than one line in Perl?

#ver = $session->cmd("sh conf");
The variable here is ver, which has the configuration file, that is, it has more than one line. So how to take an output of each line of the ver variable without putting it in a loop?
Your #var variable is an array - each element will contain one line.
You cannot get all lines without (implicitly or explicitly) looping over the entire array.
You can have perl do all the work for you though - for example, using join, grep or map, depending what you want.
Examples:
#print all lines to a webpage
print join('<br />',#ver);
#print all lines with the word 'error' in it
print grep(/error/,#ver);
How about :
print join("\n", #ver);

Resources