I have a very large text file from which I have to extract some data. I read the file line by line and look for keywords. As I know that the keywords I am looking for are much closer to the end of the file than to the beginning, I wonder if it is possible to read the file starting at the last row instead of the first. I then would use an aditional keyword which indicates "everything beyound this word is not of interesst" and stop reading.
Is that possible ?
I don't know how performant this would be, but run the file through tac and read from that:
set fh [open "|tac filename"]
# read from last line to first
while {[gets $fh line] != -1} {...
Another tactic would be to read the last, say, 5000 bytes of the file (using seek), split on newlines and examine those lines, then seek to position 10000 from the end and read the "next" 5000 bytes, etc.
No it is not possible (in any runtime/language I'm aware of, Tcl included).
So decide on a buffer side and read your file by seeking backwards and trying to read a full buffer each time.
Note that you have to observe certain possibilities:
The file might be smaller than the size of your buffer.
It seems you're dealing with a text file, and you want to process it line-wise. If so, observe that if the code is cross-platform or has to work on Windows you have to deal with the case when the data placed in the buffer by the last read operation starts with LF, and the next read operation—of the preceding chunk—will end with CR—that is, your EOL marker will be split across the buffers.
You might want to take a look at the implementation of Tcl_GetsObj() in the generic/tclIO.c file in the Tcl source code—it deals with split CRLFs on normal ("forward") reading of a textual string from a file.
The simplest way to grab the end of a file for searching, assuming you don't know the size of the records (i.e., the line lengths) is to grab too much and work with that.
set f [open $filename]
# Pick some large value; the more you read, the slower
seek $f -100000 end
# Read to the end, split into lines and *DISCARD FIRST*
set lines [lrange [split [read $f] "\n"] 1 end]
Now you can search with lsearch. (Note that you won't know exactly where in the file your matched line is; if you need that, you have to do quite a lot more work.)
if {[lsearch -glob $lines "*FooBar*"] >= 0} {
...
}
The discarding of the first line from the read section is because you're probably starting reading half way through a line; dropping the first “line” will mean that you've only got genuine lines to deal with. (100kB isn't very much for any modern computer system to search through, but you may be able to constrain it further. It depends on the details of the data.)
package require struct::list
set fp [open "filename.txt"]
set lines [split [read -nonewline $fp] "\n"]
foreach line [struct::list reverse $lines] {
...
}
do something with "$line".
to reverse file , I read the file into a variable "list" line by line pre-pending $list with the current line. That way List is in reverse order of file ..
while {[gets $in line] > -1} {
if [regexp "#" $line] {
continue
}
# reverse the order in variable "list"
set list "$line $list"
}
foreach line $list {
puts "line:$ln line= $line"
""*** process each line as you need ***""
}
Related
I am wondering from where newline (4th in example code) is written out from following very simple tcl code. Handling from puts -nonewline is cumbersome. Is there any other tcl command influence this behavior?
set fid [open testout.txt w]
puts $fid 1
puts $fid 2
puts $fid 3
close $fid
Output:
#1:1
#2:2
#3:3
#4:
The puts command always appends a newline to the end of what you ask it to write, unless you pass the -nonewline option. It is a feature of that command, and is most of the time what you tend to want. (The puts command is the only standard Tcl command that writes data to a channel; chan puts is just a different name for the same thing.)
In your case, maybe you don't want the newline at the end of the final line (and should use the option). Or maybe you want to trim the newline from the end before splitting the text into lines when reading it back in. Whether you can tolerate that newline character at the end of the text data in the file depends on what you're doing with it.
I have to delete the last line of file in using tcl script. I know the content so content replacement is also ok. But my content is which has to be replaced by a space or newline character or have to deleted. And my job is in a loop.
Please let me know which is the efficient way, capturing the entire file content each time in loop and replace that string is better or deleting simply the last line.
Please give some script code because I am very new to tcl.
Are we talking about removing the last line from the data on disk or the data in memory? It matters because the approach you use to do those two cases is entirely different.
In memory
Exactly how you manipulate things in memory depends on whether you're representing the data as list of lines or a big string. Both approaches work. (You could be doing something else too, I suppose, but these two are the common obvious ways.)
If you've got your data as a list of lines in memory, you can simply do (assuming you're holding the lines in a variable called theLines):
set theLines [lreplace $theLines end end]
For a particularly large list, there are a few tricks to make it more efficient, but they come down to careful management of references:
# Needs a new enough Tcl (8.5 or 8.6 IIRC)
set theLines [lreplace $theLines[set theLines ""] end end]
Try the first version instead of this if you don't know you need it. Also be aware that if you're wanting to keep the original list of lines around, you should definitely use the first approach.
You might instead have the data in memory as a single big string. In that case, we can use some of Tcl's string searching capabilities to do the job.
set index [string last "\n" $theString end-1]
set theString [string range $theString 0 $index]
The optimisation mentioned above in relation to lreplace is also applicable here (with all the same caveats):
set index [string last "\n" $theString end-1]
set theString [string range $theString[set theString ""] 0 $index]
On disk
When working on disk, things are different. There you need to be much more careful since you can't undo changes easily. There are two general approaches:
Read the file into memory, do the change there (using the techniques above), and do a (destructive) ordinary write out. This is the approach you need when you are doing many other changes anyway (e.g., removing a line from the middle, adding a line to the middle, adding or removing characters from a line in the middle).
set filename "..."
# Open a file and read its lines into a list
set f [open $filename]
set theLines [split [read $f] "\n"]
close $f
# Transform (you should recognise this from above)
set theLines [lreplace $theLines end end]
# Write the file back out
set f [open $filename "w"]
puts -nonewline $f [join $theLines "\n"]
close $f
Find where the data you don't want starts as an offset in the file and truncate the file at that point. This is the right approach with a very large file, but it is rather more sophisticated.
set f [open $filename "r+"]; # NEED the read-write mode!
seek $f -1000 end; # Move to a little bit before the end of the file.
# Unnecessary, and guesswork, but can work and will
# speed things up for a big file very much
# Find the length that we want the file to become. We do this by building a list of
# offsets into the file.
set ptrList {}
while {![eof $f]} {
lappend ptrList [tell $f]
gets $f
}
# The length we want is one step back from the end of the list
set wantedLength [lindex $ptrList end-1]
# Do the truncation!
chan truncate $f $wantedLength
close $f
However you do the disk transformations, make sure you test on a trash file before applying it to anything real! In particular, I've not checked what the truncation method does on a file without a newline at the end. It probably works, but you should test.
I am doing a program in VHDL to read and write data. My program has to read data from a line, process it, and then save the new value in the old position. My code is somewhat like:
WRITE_FILE: process (CLK)
variable VEC_LINE : line;
file VEC_FILE : text is out "results";
begin
if CLK='0' then
write (VEC_LINE, OUT_DATA);
writeline (VEC_FILE, VEC_LINE);
end if;
end process WRITE_FILE;
If I want to read line 15, how can I specify that? Then I want to clear line 15 and have to write a new data there. The LINE is of access type, will it accept integer values?
Russell's answer - using two files - is the answer.
There isn't a good way to find the 15th line (seek) but for VHDL's purpose, reading and discarding the first 14 lines is perfectly adequate. Just wrap it in a procedure named "seek" and carry on!
If you're on the 17th line already, you can't seek backwards, or rewind to the beginning. What you can do is flush the output file (save the open line, copy the rest of the input file to it, close both files and reopen them. Naturally, this requires VHDL-93 not VHDL-87 syntax for file operations). Just wrap that in a procedure called "rewind", and carry on!
Keep track of the current line number, and now you can seek to line 15, wherever you are.
It's not pretty and it's not fast, but it'll work just fine. And that's good enough for VHDL's purposes.
In other words you can write a text editor in VHDL if you must, (ignoring the problem of interactive input, though reading stdin should work) but there are much better languages for the job. One of them even looks a lot like an object-oriented VHDL...
Use 2 files, an input file and an output file.
file_open(vectors, "stimulus/input_vectors.txt", read_mode);
file_open(results, "stimulus/output_results.txt", write_mode);
while not endfile(vectors) loop
readline(vectors, iline);
read(iline, a_in);
etc for all your input data...
write(oline, <output data>
end loop;
file_close(vectors);
file_close(results);
I am reading a text file into an array in perl and looping through the array to do stuff on it. Whenever there is a "begin", "end" or a ";" anywhere in the text, I want my array element to end there and whatever comes after any of those keywords to be in the next element to make life easier for me when I try to make sense of the elements later.
To achieve this I thought of reading the entire file into an array, replacing all "begin" with "begin\n", "end" with "end\n" and ";" with ";\n", writing this array back to a file and then reading that file back to an array. Will this work ?
Is there a more elegant way to do this rather than use messy extra writes and reads to file?
Is there a way to short (in the electrical circuits sense if you know what I mean!) a read file handle and a write file handle so that I can escape the whole writing to the text file but still get my job done?
Gururaj
You can use split with parentheses to keep the separator in the result:
open my $FH, '<', 'file.txt' or die $!;
my #array = map { split /(begin|end|;)/ } <$FH>;
I would prefer to use a Perl one-liner and avoid manipulating arrays altogether:
$ perl -pi -e 's#(?<=begin)#\n#g; s#(?<=end)#\n#g; s#(?<=;)#\n#g;' file.txt
I am trying to read a file into a temporary variable, filtering the file based off of items in an array. I am doing this by opening a file and in the while loop of reading the file, run another loop (very bad idea IMO) to check to see if the contents match the array, if so the line is discarded and it proceeds to the next line.
It works, but its bad when there are 20,000 lines of input. I am reading with an array of 10 items, which essentially turns it into a 200,000 line file.
Is there a way to process this quicker?
Assuming you want to discard a line if any item in your array is found, the any function from List::MoreUtils will stop searching through an array as soon as it has found a match.
use List::MoreUtils qw(any);
while (<>) {
my $line = $_;
next if any { $line =~ /$_/ } #list;
# do your processing
}
If you happen to know which items in your array are more likely to occur in your lines, you could sort your array accordingly.
You should also Benchmark your approaches to make sure your optimization efforts are worth it.
Mash the array items together into a big regex: e.g., if your array is qw{red white green}, use /(red|white|green)/. The $1 variable will tell you which one matched. If you need exact matching, anchor the end-points: /^(red|white|green)$/.