Fast way to add line/row number to text file - sql-server

I have a file wich has about 12 millon lines, each line looks like this:
0701648016480002020000002030300000200907242058CRLF
What I'm trying to accomplish is adding a row numbers before the data, the numbers should have a fixed length.
The idea behind this is able to do a bulk insert of this file into a SQLServer table, and then perform certain operations with it that require each line to have a unique identifier. I've tried doing this in the database side but I havenĀ“t been able to accomplish a good performance (under 4' at least, and under 1' would be ideal).
Right now I'm trying a solution in python that looks something like this.
file=open('file.cas', 'r')
lines=file.readlines()
file.close()
text = ['%d %s' % (i, line) for i, line in enumerate(lines)]
output = open("output.cas","w")
output.writelines(str("".join(text)))
output.close()
I don't know if this will work, but it'll help me having an idea of how will it perform and side effects before I keep on trying new things, I also thought doing it in C so I have a better memory control.
Will it help doing it in a low level language? Does anyone know a better way to do this, I'm pretty sure it has being done but I haven't being able to find anything.
thanks

oh god no, don't read all 12 million lines in at once! If you're going to use Python, at least do it this way:
file = open('file.cas', 'r')
try:
output = open('output.cas', 'w')
try:
output.writelines('%d %s' % tpl for tpl in enumerate(file))
finally:
output.close()
finally:
file.close()
That uses a generator expression which runs through the file processing one line at a time.

Why don't you try cat -n ?

Stefano is right:
$ time cat -n file.cas > output.cas
Use time just so you can see how fast it is. It'll be faster than python since cat is pure C code.

Related

Looping through a set of variables for R package analysis

Here's a novice question..being new to R this has got to be.
I am trying to run an R package that analyzes "csv" data using the following R scripts:
library(agricolae)
LXTOUTPUT2<-with(RLINXTES2, lineXtester(Replication, Lines, Tester, Y))
All elements analyzed by the function "lineXtester" are numerics.
Analyzing 1 variable is fine. However, I have several variables to supply as "Y" and would like to run this through as one chunk.
I tried the "for loop" but couldn't find the right script that would cycle thru all variables.
Instead of "for loop" is there a better, faster option? I read about "vectorizing" but R is still a strange stuff for me.
Would greatly appreciate your help.
Thank you.
My sincere apology. I was finally able to figure out my problem by reading and learning more about "vectorization" and applying it to my dataframe and accessing the elements using the [[]] indexing.
Indeed, it is much simpler and faster than using the "for loop".
Please disregard my request for help.
Thank you just the same.

Advice on reading multiple text files into an array with Ruby

I'm currently writing out a program in Ruby, which I'm fairly new at, and it requires multiple text files to be pushed into an array line by line.
I am currently unable to actually test my code since I'm at work and this is for personal use, but I'm seeking advice to see if my code is correct. I knows how to read a file and push it to the array. If possible can someone check it over and advise if I have the correct idea? I'm self taught regarding Ruby and have no-one to check my work.
I understand if this isn't the right place for trying to get this sort of advice and it's deleted/locked. Apologies if so.
contentsArray = []
Dir.glob('filepath').each do |filename|
next if File.directory?(filename)
r = File.open("#{path}#{filename}")
r.each_line { |line| contentsArray.push line}
end
I'm hoping this snippet will take the lines from multiple files in the same directory and stick them in the array so I can later splice what's in there.
Thank you for the question.
First let's assume that 'filepath' is something like the target pattern you want to glob in Dir.glob('filepath') (I used Dir.glob('src/*.h').each do |filename| in my test).
After that, File.open("#{path}#{filename}") prepends another path to the already complete path you'll have in filename.
And lastly, although this is probably not the problem, the code opens the file and never closes it. The IO object provides a readlines method that takes care of opening and closing the file for you.
Here's some working code that you can adapt:
contentsArray = []
Dir.glob('filepath').each do |filename|
next if File.directory?(filename)
lines = IO.readlines(filename)
contentsArray.concat(lines)
end
puts "#{contentsArray.length} LINES"
Here are references to the Ruby doc's for the IO::readlines and Array::concat methods used:
https://ruby-doc.org/core-2.5.5/IO.html#method-i-readlines
https://ruby-doc.org/core-2.5.5/Array.html#method-i-concat
As an alternative to using the goto (next) the code could conditionally execute on files, like this:
if File.file?(filename)
lines = IO.readlines(filename)
contentsArray.concat(lines)
end

How to run same code on multiple files, or all files in directory

so I am very new to coding and recently wrote a little program that involved R and sox. It looked like this
file <- "test.mp3"
testSox = paste("sox ",file," -n spectrogram -o ",file,".png stats",sep='')
sox = system(testSox, intern = TRUE)
print(sox)
Now, instead of assigning the one file manually within the code, I would just like to have this code read through all the mp3s in a folder automatically. Is this possible? Any help would be greatly appreciated. Thanks!
EDIT: Actually, I should add that I tried list.files, but when it comes to running the system() command, I get
"Error in system(command, as.integer(flag), f, stdout, stderr) :
character string expected as first argument"
Here's the list.files code I tried:
> temp = list.files(path = ".", pattern=".mp3")
>
> file <- temp
>
> firstSox = paste("sox ",file," -n spectrogram -o ",file,".png stats",sep='')
> sox = system(firstSox, intern = TRUE)
Error in system(command, as.integer(flag), f, stdout, stderr) :
character string expected as first argument
> print(sox)
I'm guessing this is not the correct route to go? Because I basically need to replace 'file' in the firstSox line with each mp3 that's in the temp array. So instead of running:
file <- "test.mp3"
...I would just like to have it re-assign each time for every file in the folder..., so it runs through as test.mp3, then 1.mp3, then 2.mp3, then 3.mp, etc.
I've scoured the net, and just feel like I've hit a brick wall. As stated in the comments, I've read up on loops, but for some reason I can't wrap my head around how to incorporate it into what I have written. I feel like I just need someone to show me at least the way, or maybe even write me an example so I can wrap my head around it. Would greatly appreciate help and any tips on what I'm doing wrong and could correct. Thanks.
Try the below code. I am using dir() instead of list.files, just because I find it easier. Remember there are many ways to do the same thing in R.
files <- dir(path = ".",pattern = ".mp3") #Get all the mp3 files
for(f in files) { #Loop over the mp3 files one at a time
firstSox = paste("sox ",f," -n spectrogram -o ",f,".png stats",sep='')
sox = system(firstSox, intern = TRUE)
print(sox)
}
Your firstSox variable will be a vector of commands to run (paste will generate a vector, one string for each element of file). So now you just need to run each command through system
One way to do this and capture the output is to use the lapply or sapply function:
sox <- lapply( firstSox, function(x) system(x, intern=TRUE) )
In this code lapply will run the function for each element of firstSox one at a time, the function just takes the current element (in x) and passes that to system. Then lapply gathers all the outputs together and combines them into a list that it puts into sox.
If the results of each run give the same shape of results (single number or vector of same length) then you can use sapply instead and it will simplify the return into a vector or matrix.

Shell script vs C performance

I was wondering how bad would be the impact in the performance of a program migrated to shell script from C.
I have intensive I/O operations.
For example, in C, I have a loop reading from a filesystem file and writing into another one. I'm taking parts of each line without any consistent relation. I'm doing this using pointers. A really simple program.
In the Shell script, to move through a line, I'm using ${var:(char):(num_bytes)}. After I finish processing each line I just concatenate it to another file.
"$out" >> "$filename"
The program does something like:
while read line; do
out="$out${line:10:16}.${line:45:2}"
out="$out${line:106:61}"
out="$out${line:189:3}"
out="$out${line:215:15}"
...
echo "$out" >> "outFileName"
done < "$fileName"
The problem is, C takes like half a minute to process a 400MB file and the shell script takes 15 minutes.
I don't know if I'm doing something wrong or not using the right operator in the shell script.
Edit: I cannot use awk since there is not a pattern to process the line
I tried commenting the "echo $out" >> "$outFileName" but it doesn't gets much better. I think the problem is the ${line:106:61} operation. Any suggestions?
Thanks for your help.
I suspect, based on your description, that you're spawning off new processes in your shell script. If that's the case, then that's where your time is going. It takes a lot of OS resource to fork/exec a new process.
As donitor and Dietrich sugested, I did a little research about the AWK language and, again, as they said, it was a total success. here is a little example of the AWK program:
#!/bin/awk -f
{
option=substr($0, 5, 9);
if (option=="SOMETHING"){
type=substr($0, 80, 1)
if (type=="A"){
type="01";
}else if (type=="B"){
type="02";
}else if (type=="C"){
type="03";
}
print substr($0, 7, 3) substr($0, 49, 8) substr($0, 86, 8) type\
substr($0, 568, 30) >> ARGV[2]
}
}
And it works like a charm. It takes barely 1 minute to process a 500mb file
What's wrong with the C program? Is it broken? Too hard to maintain? Too inflexible? You are more of a Shell than a C expert?
If it ain't broke, don't fix it.
A look at Perl might be an option, too. Easier than C to modify and still speedy I/O; and it's much harder to create useless forks in Perl than in the shell.
If you told us exactly what the C program does, maybe there's a simple and faster-than-light solution with sed, grep, awk or other gizmos in the Unix tool box. In other words, tell us what you actually want to achieve, don't ask us to solve some random problem you ran into while pursuing what you think is a step towards your actual goal.
Alright, one problem with your shell script is the repeated open in echo "$out" >> "outFileName". Use this instead:
while read line; do
echo "${line:10:16}.${line:45:2}${line:106:61}${line:189:3}${line:215:15}..."
done < "$fileName" > "$outFileName"
As an alternative, simply use the cut utility (but note that it doesn't insert the dot after the first part):
cut -c 10-26,45-46,106-166 "$fileName" > "$outFileName"
You get the idea?

Prolog how to save file in an existing file

How do I save on an existing file after adding new data
add_a_link(X,Y) :-
tell('alink.txt'),
write(X),
write('.'),
write(Y),
write('.'),
put(10),
told,
write('data written'),
nl.
this code only re-write the text file.
Use open/3 and stream oriented I/O:
open(file, append, S), write(S, info(X,Y)), put_char(S,.), nl(S), close(S).
Using tell/1 and told is extremely unreliable. It easily happens that the output is written to another file accidentally.
Edit: Here is an example to illustrate the extremely unreliable properties of tell/1 and told.
Say, you write tell(file), X > 3, write(biggervalue), told. This works fine as long as X > 3. But with a smaller value this query fails and nothing is written. That might have been your intention. However, the next output somewhere else in your program will now go into the file. That's something you never want to happen. For this reason ISO-Prolog does not have tell/1 and told but rather open/3 and close/1.

Resources