we have around 200 GB .sql file we are grepping for some tables it is taking around 1 and half hour, as there any method we ca reduce time? any other efficient method to filter for some tables ? any help will be appreciated
The GNU parallel program can split input into multiple child processes, each of which will run grep over each respective part of the input. By using multiple processes (presumably you have enough CPU cores to apply to this work), it can finish faster by running in parallel.
cat 200-gb-table.sql | parallel --pipe grep '<pattern>'
But if you need to know the context of where the pattern occurs (e.g. the line number of the input) this might not be what you need.
Related
For years I have being using variasons of du command below in order to produce a report of the largest files from specific location, and most of the time it worked well.
du -L -ch /var/log | sort -rh | head -n 10 &> log-size.txt
This this proved to get stuck in several cases, in a way that prevented stopping it with even the timeout -s KILL 5m ... approach.
Few years back this was caused by stalled NFS mounts but more recently I got into this in on VMs where I didn't use NFS at all. Apparently there is a ~1:30 chance to get this on openstack builds.
I read that following symbolic links (-L) can block "du" in some cases if there are loops but my tests failed to reproduce the problem, even when I created some loop.
I cannot avoid following the symlinks because that's how the files are organized.
What would be safer alternative to generate this report, one that would not block or at least if it does, it can be constrainted to a maximum running duration. It is essential to limit the execution of this command to a number of minutes -- if I can also get a partial result on timeouts or some debuggin info even better.
If you don't care about sparse files and can make do with apparent size (and not the on-disk size), then ls should work just fine: ls -L --sort=s|head -n10> log-size.txt
i have a 6 GB applicationlogfile. The loglines have the following format (shortened)
[...]
timestamp;hostname;sessionid-ABC;type=m
timestamp;hostname;sessionid-ABC;set_to_TRUE
[...]
timestamp;hostname;sessionid-HHH;type=m
timestamp;hostname;sessionid-HHH;set_to_FALSE
[...]
timestamp;hostname;sessionid-ZZZ;type=m
timestamp;hostname;sessionid-ZZZ;set_to_FALSE
[...]
timestamp;hostname;sessionid-WWW;type=s
timestamp;hostname;sessionid-WWW;set_to_TRUE
I have a lot of session with more then these 2 lines.
I need to find out all sessions with type=m and set_to_TRUE
My first attempt was to grep all sessionIDs with type=m and write it into a file. Then looping with every line from the file (1 sessionID per line) trough the big logfile and grep for sessionID;set_to_TRUE
This method takes a loooot of time. Can anyone give me a hint to solve this in a much better and faster way?
Thanks a lot!
I have two files in Unix Box, both have around 10 million rows.
File1 (Only one column)
ASD123
AFG234
File2 (Only one column)
ASD456
AFG234
Now I want to compare the records from File 1 to File 2 and output those that are in File2. How to achieve this?
I have tried a while loop and grep, seems it is way too slow, any ideas will be appreciated.
If you want to find all the rows from file A which are also in file B, you can use grep's inbuilt -f option:
grep -Ff fileA.txt fileB.txt
This should be faster than putting it inside any kind of loop (although given the size of your files, it may still take some time).
I just wanted to use grep with option -f FILE. This should make grep use every line of FILE as a pattern and search for it.
Run:
grep -f patternfile searchfile
The pattern-file I used is 400MB large. The file I want to search through is 7GB.
After 3 min the process ended up with 70GB RAM and no reaction.
Is this normal? Am I doing something wrong? Is grep not capable is such large scale?
Thank you for ideas.
If the lines in the pattern file are literal strings, using the "-F" option will make it much faster.
You could try breaking the task up such that the grep process ends on each pass of the file. I'm not sure how useful this will be, however, given the sheer size of the file you're searching.
for pattern in `cat patternFile`
do
grep "$pattern" searchFile
done
I have to say that this is the first time I've ever heard of anyone using a 700MB pattern file before - I'm not surprised it ate up so much memory.
If you have time, I would suggest either breaking the file up into sections and processing each section one at a time, or even just processing the 7GB file one regex at a time. If you can fit the whole 7GB file in memory then, and aren't worried about how long it takes, then that might be the most reliable solution.
is there any nice GNU way how to measure average (worst case, best case) execution time of some command line program? I have image filter, unspecified amount of pictures, filtering them using for-loop in bash. So far I am using time, but I can't find a way how to get some statistics.
You can send the output of time to some file, and then "work" that file
echo "some info" >> timefile.txt
time ( ./yourprog parm1 parm2 ) 2>> timefile.txt
There's an interesting Perl program called dumbbench that's essentially a wrapper around the time command. It runs your program a number of times, throws away outliers, then calculates some statistics.
The author has a couple of articles (here and here) outlining a) why benchmarking sucks, and b) what kind of pretty graphs you can make to make your benchmarking numbers suck a little less.
You're on the right track with time. It's what I use to preform small code execution analyses.
I then use python to collect the statistics by reading the output of time. In order to increase accuracy, I typically do the trial 10 - 1000 times, depending on how long each process takes.
I'm not familiar with any pre-installed GNU application that does this sort of analysis.
#!/bin/bash
for i in {1..100}
do
env time --append -o time_output.txt ./test_program --arguments-to-test-program
done
exit
If you find that the {1..100} syntax doesn't work for you then you should have a look at the seq command.
I used the env time to execute the time program rather than the shell's built in command, which does not take all of the arguments that the time program takes. The time program also takes other arguments to alter the format of it's output, which you will probably want to use to make the data easier to process by another program. The -p (--portability) argument makes it output in the POSIX format (like BASH's builtin time does), but using the -f option you can get more control. man 1 time for more info.
After you have gathered your data a simple perl or python script can easily parse and analyze your timing data.
You should consider whether to time the outer loop and divide by the repetitions rather than timing each iteration separately. If you're worried about discarding the high and low, just do a few more iterations to drown them out.
time for i in {1..1000}
do
something
done
You can capture the output from time in a variable:
foo=$( { time {
echo "stdout test message demo"
for i in {1..30}
do
something
done
echo "stderr test message demo" >&2
} 1>&3 2>&4; } 2>&1 )
and do some fake math:
foo=${foo/.} # "divide" by ...
echo "0.00${foo/#0}" # ... 1000
Or just use bc:
echo "scale=8; $foo/1000" | bc