how to read a large file line by line using tcl? - file

I've written one piece of code by using a while loop but it will take too much time to read the file line by line. Can any one help me please?
my code :
set a [open myfile r]
while {[gets $a line]>=0} {
"do somethig by using the line variable"
}

The code looks fine. It's pretty quick (if you're using a sufficiently new version of Tcl; historically, there were some minor versions of Tcl that had buffer management problems) and is how you read a line at a time.
It's a little faster if you can read in larger amounts at once, but then you need to have enough memory to hold the file. To put that in context, files that are a few million lines are usually no problem; modern computers can handle that sort of thing just fine:
set a [open myfile]
set lines [split [read $a] "\n"]
close $a; # Saves a few bytes :-)
foreach line $lines {
# do something with each line...
}

If it truly is a large file you should do the following to read in only a line at a time. Using your method will read the entire contents into ram.
https://www.tcl.tk/man/tcl8.5/tutorial/Tcl24.html
#
# Count the number of lines in a text file
#
set infile [open "myfile.txt" r]
set number 0
#
# gets with two arguments returns the length of the line,
# -1 if the end of the file is found
#
while { [gets $infile line] >= 0 } {
incr number
}
close $infile
puts "Number of lines: $number"
#
# Also report it in an external file
#
set outfile [open "report.out" w]
puts $outfile "Number of lines: $number"
close $outfile

Related

Tcl strings to middle of binary file without overwriting its conten

I have a binary file
in which I am trying to add a string in the middle of the file
(lets say after 10 Bytes)
I succees to overwrite the file with my string - but not to append
appreciate if someone can tell how can I append the string.
Here is my code example:
proc write_bit_header {} {
set bit_hdr "#Here is my new string to be added#"
set bit_hdr_len [string length ${bit_hdr}]
set outBinData [binary format a${bit_hdr_len} ${bit_hdr}]
set fp [open "binfile" "a+b"]
fconfigure $fp -translation binary
seek $fp 10
puts -nonewline $fp $outBinData
close $fp
}
When you write to the middle of a file (which you'd use the mode r+b for), none of the other bytes in the file move around. They're still at exactly the same offsets within the file that they were beforehand. If you're writing a fixed-size binary record into the file, this is exactly what you want! However, if you're writing a variable sized record, you have to:
read all the data that is going to go after the bytes that you want to write
seek to the place where you want to do the insert/replace
write the data that you are inserting
write the data that you read in step 1
truncate the file (in case what you wrote in step 3 is shorter than what you were replacing).
Yes, this is non-trivial!
proc insertData {filename dataToInsert insertionPoint {firstAfterByte ""}} {
# If you don't give the end of the range to overwrite, it's zero-length
if {$firstAfterByte eq ""} {
set firstAfterByte $insertionPoint
}
set f [open $filename "r+b"]
chan seek $f $firstAfterByte
set suffixData [chan read $f]
chan seek $f $insertionPoint
chan puts -nonewline $f $dataToInsert
chan puts -nonewline $f $suffixData
chan truncate $f
close $f
}
It's much easier when you're appending, as you are not having to move around any existing data and never need to truncate. And you can use the ab mode so that you don't need to seek explicitly.
proc appendData {filename dataToAppend} {
set f [open $filename "ab"]
puts -nonewline $f $dataToAppend
close $f
}
As you can see, the insertion code is quite a lot more tricky. It runs quite a bit of a risk of going wrong too. It's better to use a working copy file, and then replace the original at the end:
proc insertDataSafely {filename dataToInsert insertionPoint {firstAfterByte ""}} {
set f_in [open $filename "rb"]
set f_out [open ${filename}.tmp "wb"]
try {
chan copy $f_in $f_out $insertionPoint
puts -nonewline $f_out $dataToInsert
if {$firstAfterByte ne ""} {
chan seek $f_in $firstAfterByte
}
chan copy $f_in $f_out
chan close $f_in
chan close $f_out
} on ok {} {
file rename ${filename}.tmp $filename
} on error {msg opt} {
file delete ${filename}.tmp
# Reraise the error
return -options $opt $msg
}
}
Of course, not all files take kindly to this sort of thing being done in the first place, but the ways in which modifying an arbitrary file can make things go haywire is long and thoroughly out of scope for this question.

How to read file from end to start (in reverse order) in TCL?

I have a very large text file from which I have to extract some data. I read the file line by line and look for keywords. As I know that the keywords I am looking for are much closer to the end of the file than to the beginning.
I tried tac keyword
set fh [open "|tac filename"]
I am getting error as : couldn't execute "tac": no such file or directory
My file size is big so i am not able to store the line in a loop and reverse it again. Please suggest some solution
tac is itself a fairly simple program -- you could just implement its algorithm in Tcl, at least if you're determined to literally read each line in reverse order. However, I think that constraint is not really necessary -- you said that the content you're looking for is more likely to be near the end than near the beginning, not that you had to scan the lines in reverse order. That means you can do something a little bit simpler. Roughly speaking:
Seek to an offset near the end of the file.
Read line-by-line as normal, until you hit data you've already processed.
Seek to an offset a bit further back from the end of the file.
Read line-by-line as normal, until you hit data you've already processed.
etc.
This way you don't actually have to keep anything more in memory than the single line you're processing right now, and you'll process the data at the end of the file before data earlier in the file. Maybe you could eke out a tiny bit more performance by strictly processing the lines in reverse order but I doubt it will matter compared to the advantage you gain by not scanning from start to finish.
Here's some sample code that implements this algorithm. Note the bit of care taken to avoid processing a partial line:
set BLOCKSIZE 16384
set offset [file size $filename]
set lastOffset [file size $filename]
set f [open $filename r]
while { 1 } {
seek $f $offset
if { $offset > 0 } {
# We may have accidentally read a partial line, because we don't
# know where the line boundaries are. Skip to the end of whatever
# line we're in, and discard the content. We'll get it instead
# at the end of the _next_ block.
gets $f
set offset [tell $f]
}
while { [tell $f] < $lastOffset } {
set line [gets $f]
### Do whatever you're going to do with the line here
puts $line
}
set lastOffset $offset
if { $lastOffset == 0 } {
# All done, we just processed the start of the file.
break
}
set offset [expr {$offset - $BLOCKSIZE}]
if { $offset < 0 } {
set offset 0
}
}
close $f
The cost of reversing a file is actually fairly high. The best option I can think of is to construct a list of file offsets of the starts of lines, and then to use a seek;gets pattern to go over that list.
set f [open $filename]
# Construct the list of indices
set indices {}
while {![eof $f]} {
lappend indices [tell $f]
gets $f
}
# Iterate backwards
foreach idx [lreverse $indices] {
seek $f $idx
set line [gets $f]
DoStuffWithALine $line
}
close $f
The cost of this approach is non-trivial (even if you happened to have a cache of the indices, you'd still have issues with it) as it doesn't work well with how the OS pre-fetches disk data.

Processing large files using Tcl

I have some information in two large files.
One of them(file1.txt, has ~ 4 million lines) contains all object names(which are unique) and types.
And the other(file2.txt, has ~ 2 million lines) some object names(they can be duplicated) and some values assigned to them.
So, I have something like below in file1.txt:
objName1 objType1
objName2 objType2
objName3 objType3
...
And in file2.txt I have:
objName3 val3_1
objName3 val3_2
objName4 val4
...
For the all objects in file2.txt I need to output object names, their types and values assigned to them in a single file like below:
objType3 val3_1 "objName3"
objType3 val3_2 "objName3"
objType4 val4 "objName4"
...
Previously object names in file2.txt supposed to be unique, so I've implemented some solution, where I'm reading all the data from both files, saving them to a Tcl arrays, and then iterating over larger array and checking whether object with the same name exists in a smaller array, and if so, writing my needed information to a separate file. But this runs too long (> 10 hours and hasn't completed yet).
How can I improve my solution, or is there another way to do this?
EDIT:
Actually I don't have file1.txt, I'm finding that data by some procedure and writing it into Tcl array. I'm running some procedure to get object types and save them to a Tcl array, then, I'm reading file2.txt and saving data to a Tcl array, then I'm iterating over items in the first array, and if object name match some object in second(object values) array, I'm writing info to output file and erasing that element from the second array. Here is a piece of code that I'm running:
set outFileName "output.txt"
if [catch {open $outFileName "w"} fid ] {
puts "ERROR: Failed to open file '$outFileName', no write permission"
exit 1
}
# get object types
set TIME_start [clock clicks -milliseconds]
array set objTypeMap [list]
# here is some proc that fills up objTypeMap
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Object types are found. Elapsed time $TIME_taken"
# read file2.txt
set TIME_start [clock clicks -milliseconds]
set file2 [lindex $argv 5]
if [catch { set fp [open $file2 r] } errMsg] {
puts "ERROR: Failed to open file '$file2' for reading"
exit 1
}
set objValData [read $fp]
close $fp
# tcl list containing lines of file2.txt
set objValData [split $objValData "\n"]
# remove last empty line
set objValData [lreplace $objValData end end]
array set objValMap [list]
foreach item $objValData {
set objName [string range $item 0 [expr {[string first " " $item] - 1}] ]
set objValue [string range $item [expr {[string first " " $item] + 1}] end ]
set objValMap($instName) $objValue
}
# clear objValData
unset objValData
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Object value data is read and processed. Elapsed time $TIME_taken"
# write to file
set TIME_start [clock clicks -milliseconds]
foreach { objName objType } [array get objTypeMap] {
if { [array size objValMap] eq 0 } {
break
}
if { [info exists objValMap($objName)] } {
set objValue $objValMap($objName)
puts $fid "$objType $objValue \"$objName\""
unset objValMap($objName)
}
}
if { [array size objValMap] neq 0 } {
foreach { objName objVal } [array get objValMap] {
puts "WARNING: Can not find object $objName type, skipped..."
}
}
close $fid
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Output is cretaed. Elapsed time $TIME_taken"
Seems for the last step (writing to a file) there are ~8 * 10^12 iterations to do, and it's not realistic to complete in a reasonable time, because I've tried to do 8 * 10^12 iterations in a for loop and just print the iteration index, and ~850*10^6 iterations took ~30 minutes (so, the whole loop will finish in ~11hours).
So, there should be another solution.
EDIT:
Seems the reason was some unsuccessful hashing for file2.txt map, as I've tried to shuffle lines in file2.txt and got results in about 3 minutes.
Write the data to file1, and let an external tool do all the hard work (it's bound to be much more optimized for the task than home-spun Tcl code)
exec bash -c {join -o 0,1.2,2.2 <(sort file1.txt) <(sort file2.txt)} > result.txt
So… file1.txt is describing a mapping and file2.txt is the list of things to process and annotate? The right thing is to load the mapping into an array or dictionary where the key is the part that you will look things up by, and to then go through the other file line-by-line. That keeps the amount of data in memory down, but it's worth holding the whole mapping that way anyway.
# We're doing many iterations, so worth doing proper bytecode compilation
apply {{filename1 filename2 filenameOut} {
# Load the mapping; uses memory proportional to the file size
set f [open $filename1]
while {[gets $f line] >= 0} {
regexp {^(\S+)\s+(.*)} $line -> name type
set types($name) $type
}
close $f
# Now do the streaming transform; uses a small fixed amount of memory
set fin [open $filename2]
set fout [open $filenameOut "w"]
while {[gets $fin line] >= 0} {
# Assume that the mapping is probably total; if a line fails we're print it as
# it was before. You might have a different preferred strategy here.
catch {
regexp {^(\S+)\s+(.*)} $line -> name info
set line [format "%s %s \"%s\"" $types($name) $info $name]
}
puts $fout $line
}
close $fin
close $fout
# All memory will be collected at this point
}} "file1.txt" "file2.txt" "fileProcessed.txt"
Now, if the mapping is very large, so much that it doesn't fit in memory, then you might be better doing it via building file indices and stuff like that, but frankly then you're actually better off getting familiar with SQLite or some other database.
A pure-Tcl variant of Glenn Jackman's code would be
package require fileutil
package require struct::list
set data1 [lsort -index 0 [split [string trim [fileutil::cat file1.txt]] \n]]
set data2 [lsort -index 0 [split [string trim [fileutil::cat file2.txt]] \n]]
fileutil::writeFile result.txt [struct::list dbJoin -full 0 $data1 0 $data2]
But in this case each row will have four columns, not three: the two columns from file1.txt and the two columns from file2.txt. If that is a problem, reducing the number of columns to three is trivial.
The file join in the example is also full, i.e. all rows from both files will occur in the result, padded by empty strings if the other file has no corresponding data. To solve the OP's problem, an inner join is probably better (only rows that correspond are retained).
fileutil::cat reads the contents of a file, string trim removes leading and trailing whitespace from the contents, to avoid empty lines in the beginning or end, split ... \n creates a list where every row becomes an item, lsort -index 0 sorts that list based on the first word in every item.
The code is verified to work with Tcl 8.6 and fileutil 1.14.8. The fileutil package is a part of the Tcllib companion library for Tcl: the package can be individually upgraded to the current version by downloading the Tcl source and copying it to the relevant location in the Tcl installation's lib tree (C:\Tcl\lib\teapot\package\tcl\teapot\tcl8\8.2 in my case).
Quick-and-dirty install: download fileutil.tcl from here (use the Download button) and copy the file to where your other sources are. In your source code, call source fileutil.tcl and then package require fileutil. (There may still be compatibility problems with Tcl or with e.g. the cmdline package. Reading the source may suggest workarounds for such.) Remember to check the license conditions for conflicts.
Documentation: fileutil package, lsort, package, set, split, string, struct::list package

How to look for the difference between two large files in tcl?

I have two files, the some of the contents of these might be common in both. (say file A.txt and file B.txt)
Both the files are sorted files.
I need to get the difference of file A.txt and B.txt, ie, a file C.txt which has contents of A except the common contents in both.
I used the typical search and print algorithm, ie, took a line from A.txt, searched in B.txt, if found, print nothing in C.txt, else print that line in C.txt.
But, I am dealing with files with huge # of contents, and thus, it throws error: failed to load too many files. (Though it works fine for smaller files)
Can anybody suggest more efficient way of getting C.txt?
Script to be used: TCL only!
First off, the too many files error is an indication that you're not closing a channel, probably in the B.txt scanner. Fixing that is probably your first goal. If you've got Tcl 8.6, try this helper procedure:
proc scanForLine {searchLine filename} {
set f [open $filename]
try {
while {[gets $f line] >= 0} {
if {$line eq $searchLine} {
return true
}
}
return false
} finally {
close $f
}
}
However, if one of the files is small enough to fit into memory reasonably, you'd be far better reading it into a hash table (e.g., a dictionary or array):
set f [open B.txt]
while {[gets $f line]} {
set B($line) "any dummy value; we'll ignore it"
}
close $f
set in [open A.txt]
set out [open C.txt w]
while {[gets $in line]} {
if {![info exists B($line)]} {
puts $out $line
}
}
close $in
close $out
This is much more efficient, but depends on B.txt being small enough.
If both A.txt and B.txt are too large for that, you are probably best doing some sort of processing by stages, writing things out to disk in-between. This is getting rather more complex!
set filter [open B.txt]
set fromFile A.txt
for {set tmp 0} {![eof $filter]} {incr tmp} {
# Filter by a million lines at a time; that'll probably fit OK
for {set i 0} {$i < 1000000} {incr i} {
if {[gets $filter line] < 0} break
set B($line) "dummy"
}
# Do the filtering
if {$tmp} {set fromFile $toFile}
set from [open $fromFile]
set to [open [set toFile /tmp/[pid]_$tmp.txt] w]
while {[gets $from line] >= 0} {
if {![info exists B($line)]} {
puts $to $line
}
}
close $from
close $to
# Keep control of temporary files and data
if {$tmp} {file delete $fromFile}
unset B
}
close $filter
file rename $toFile C.txt
Warning! I've not tested this code…

Perl - Open large txt file on server and create / save into smaller files of 100 lines each

I am trying to do this:
I FTP a large file of single words (~144,000 and one word per line)
I need to open uploaded file and create files with 100 lines max one
word per line (01.txt, 02.txt etc).
I would like the processed 100 to be REMOVED from the original file
AFTER the file of 100 is created.
The server is shared but, I can install modules if needed.
Now, my code below is very crude as my knowledge is VERY limited. One problem is opening the whole file into an array? The shared server does not sport enough memory I assume to open such a large file and read into memory all at once? I just want the first 100 lines. Below is just opening a file that is small enough to be loaded and getting 100 lines into an array. Nothing else. I typed it quickly so, prob has several issues but, show my limited knowledge and need for help.
use vars qw($Word #Words $IN);
my $PathToFile = '/home/username/public/wordlists/Big-File-Of-Words.txt';
my $cnt= '0';
open $IN, '<', "$PathToFile" or die $!;
while (<$IN>) {
chomp;
$Word = $_;
$Word=~ s/\s//g;
$Word = lc($Word);
######
if ($cnt <= 99){
push(#Words,$Word);
}
$cnt++;
}
close $IN;
Thanks so much.
Okay, I am trying to implement the code below:
#!/usr/bin/perl -w
BEGIN {
my $b__dir = (-d '/home/username/perl'?'/home/username/perl':( getpwuid($>) )[7].'/perl');
unshift #INC,$b__dir.'5/lib/perl5',$b__dir.'5/lib/perl5/x86_64-linux',map { $b__dir . $_ } #INC;
}
use strict;
use warnings;
use CGI;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
print CGI::header();
my $WORD_LIST='/home/username/public/wordlists/Big-File-Of-Words.txt';
sed 's/ *//g' $WORD_LIST | tr '[A-Z]' '[a-z]' | split -l 100 -a6 - words.
print 'Done';
1;
But I get:
syntax error at split-up-big-file.pl line 12, near "sed 's/ *//g'"
Can't find string terminator "'" anywhere before EOF at split-up-big-file.pl line 12.
FINALLY:
Well I figured out a quick solution that works. Not pretty:
#!/usr/bin/perl -w
BEGIN {
my $b__dir = (-d '/home/username/perl'?'/home/username/perl':( getpwuid($>) )[7].'/perl');
unshift #INC,$b__dir.'5/lib/perl5',$b__dir.'5/lib/perl5/x86_64-linux',map { $b__dir . $_ } #INC;
}
use strict;
use warnings;
use CGI;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
use diagnostics;
print CGI::header();
my $sourcefile = '/home/username/public_html/test/bigfile.txt';
my $rowlimit = 100;
my $cnt= '1';
open(IN, $sourcefile) or die "Failed to open $sourcefile";
my $outrecno = 1;
while(<IN>) {
if($outrecno == 1) {
my $filename= $cnt.'.txt';
open OUT, ">$filename" or die "Failed to create $filename";
$cnt++;
}
print OUT $_;
if($outrecno++ == $rowlimit) {
$outrecno = 1;
close FH;
}
}
close FH;
I found enough info here to get me going. Thanks...
Here is a solution based on a slight modification of your code that should work approximately the way you want it.
It loops through all the lines of the input file and for every 100th line it will write the word list of the words encountered since the last write (or the beginning). The eof($IN) check is to catch the remaining lines if they are less than 100.
use strict;
use warnings;
my $PathToFile = '/home/username/public/wordlists/Big-File-Of-Words.txt';
open my $IN, '<', "$PathToFile" or die $!;
my $cnt = 0;
my $cnt_file = 0;
my #Words;
while ( my $Word = <$IN> ) {
chomp $Word;
$Word =~ s/\s//g;
$Word = lc($Word);
######
push(#Words,$Word);
if ( !(++$cnt % 100) || eof($IN) ) {
$cnt_file++;
open my $out_100, '>', "file_$cnt_file.txt" or die $!;
print $out_100 join("\n", #Words), "\n";
close $out_100;
#Words = ();
}
}
There's a non-Perl solution that you might find interesting...
$ split -l 100 -a6 /home/username/public/wordlists/Big-File-Of-Words.txt words.
This will split your big file of words into a bunch of files with no more than 100 lines each. The file name will start with words., and the suffix will range from aaaaaa to zzzzzz. Thus, you'll have words.aaaaaa, words.aaaaab, words.aaaaac, etc. You can then recombine all of these files back into your word list like this:
$ cat words.* > reconstituted_word_list.txt
Of course, you want to eliminate spaces, and lowercase the words all at the same time:
$ WORD_LIST=/home/username/public/wordlists/Big-File-Of-Words.txt
$ sed 's/ *//g' $WORD_LIST | tr '[A-Z]' '[a-z]' | split -l 100 -a6 - words.
The tr is the transformation command, and will change all uppercase to lower case. The split splits the files, and sed removes the spaces.
One of Unix's big strengths was its file handling ability. Splitting up big files into smaller pieces and reconstituting them was a common task. Maybe you had a big file, but a bunch of floppy disks that couldn't hold more than 100K per floppy. Maybe you were trying to use UUCP to copy these files over to another computer and there was a 10K limit on file transfer sizes. Maybe you were doing FTP by email, and the system couldn't handle files larger than 5K.
Anyway, I brought it up because it's probably an easier solution in your case than writing a Perl script. I am a big writer of Perl, and many times Perl can handle a task better and faster than shell scripts can. However, in this case, this is an easy task to handle in shell.
Here's a pure Perl solution. The problem is that you want to create files after every 100 lines.
To solve this, I have two loops. One is an infinite loop, and the other loops 100 times. Before I enter the inner loop, I create a file for writing, and write one word per line. When that inner loop ends, I close the file, increment my $output_file_num and then open another file for output.
A few changes:
I use use warnings; and use strict (which is included when you specify that you want Perl version 5.12.0 or greater).
Don't use use vars;. This is obsolete. If you have to use package variables, declare the variable with our instead of my. When should you use package variables? If you have to ask that question, you probably don't need package variables. 99.999% of the time, simply use my to declare a variable.
I use constant to define your word file. This makes it easy to move the file when needed.
My s/../../ not only removes beginning and ending spaces, but also lowercases my word for me. The ^\s*(.*?)\s*$ removes the entire line, but captures the word sans spaces at the beginning and end of the word. The .*? is like .*, but is non-greedy. It will match the minimum possible (which in this case does not include spaces at the end of the word).
Note I define a label INPUT_WORD_LIST. I use this to force my inner last to exit the outer loop.
I take advantage of the fact that $output_word_list_fh is defined only in the loop. Once I leave the loop, the file is automatically closed for me since $output_word_list_fh in out of scope.
And the program:
#!/usr/bin/env perl
use 5.12.0;
use warnings;
use autodie;
use constant WORD_FILE => "/home/username/public/wordlists/Big-File-Of-Words.txt";
open my $input_word_list_fh, "<", WORD_FILE;
my $output_file_num = 0;
INPUT_WORD_LIST:
for (;;) {
open my $output_word_list_fh, ">", sprintf "%05d.txt", $output_file_num;
for my $line (1..100) {
my $word;
if ( not $word = <$input_word_list_fh> ) {
last INPUT_WORD_LIST;
}
chomp $word;
$word =~ s/^\s*(.*?)\s*$/\L$1\E/;
say {$output_word_list_fh} "$word";
}
close $output_word_list_fh;
$output_file_num += 1;
}
close $input_word_list_fh;

Resources