i have a 6 GB applicationlogfile. The loglines have the following format (shortened)
[...]
timestamp;hostname;sessionid-ABC;type=m
timestamp;hostname;sessionid-ABC;set_to_TRUE
[...]
timestamp;hostname;sessionid-HHH;type=m
timestamp;hostname;sessionid-HHH;set_to_FALSE
[...]
timestamp;hostname;sessionid-ZZZ;type=m
timestamp;hostname;sessionid-ZZZ;set_to_FALSE
[...]
timestamp;hostname;sessionid-WWW;type=s
timestamp;hostname;sessionid-WWW;set_to_TRUE
I have a lot of session with more then these 2 lines.
I need to find out all sessions with type=m and set_to_TRUE
My first attempt was to grep all sessionIDs with type=m and write it into a file. Then looping with every line from the file (1 sessionID per line) trough the big logfile and grep for sessionID;set_to_TRUE
This method takes a loooot of time. Can anyone give me a hint to solve this in a much better and faster way?
Thanks a lot!
Related
we have around 200 GB .sql file we are grepping for some tables it is taking around 1 and half hour, as there any method we ca reduce time? any other efficient method to filter for some tables ? any help will be appreciated
The GNU parallel program can split input into multiple child processes, each of which will run grep over each respective part of the input. By using multiple processes (presumably you have enough CPU cores to apply to this work), it can finish faster by running in parallel.
cat 200-gb-table.sql | parallel --pipe grep '<pattern>'
But if you need to know the context of where the pattern occurs (e.g. the line number of the input) this might not be what you need.
We are trying to read a very large CSV file(which cannot be fully loaded to memory) in batches say 100 lines per batch) using Apache Camel. Any assistance that can be provided will be greatly appreciated.
Use the splitter EIP in streaming mode: http://camel.apache.org/splitter
And read the link and see the section about grouping N lines together. This allows you to read and process the files with 100 lines at a time.
you can use the throttler to throttle the number files loaded at a time.
Use split with groups, e.g.:
from(CSV).split().tokenize("\n", 100).streaming()
where each Exchange body will be a String containing a group of 100 lines.
I'm looking for the best way to read data from an stdin pipe in C programming.
Problem : I need to seek on this data, ie I need to read data from the start of the stream after reading some data at the end of this same stream.
Small use case : gunzip -c 4GbDataFile.gz | myprogram
Another one :
On local host : nc -l -p 1234 | myprogram
On remote host : gunzip -c 4GbDataFile.gz | nc -q 0 theotherhost 1234
I know that reading from fifo can be done only once. So, at the moment :
I slurp everything from stdin to memory and work from this allocated memory.
It's ugly, but it works. An evident issue is that if someone sends a huge (or a continuous) stream to my app, I'll end with a big allocated memory chunk or I'll run out of memory. (Think about an 8Gb file)
What I thought next :
I set a size limit (maybe user-defined) of that memory chunk. Once I've read this much data from stdin :
Either I stop here : "Errr. Out of memory, bazinga. Forget it." style.
Either I start dumping what I am reading to a file and work from this file once all data is read.
But then, what is the point? I can not find out the origin of the data that I am reading. If this is a local 8Gb file, I'll be dumping it to another 8Gb file on the same system.
So, my question is :
How do you read efficiently a lot of data from an stdinpipe when you have to seek back and forth in it?
Thanks in advance for your answers.
Edit :
My program needs to read metadata somewhere (depending of the file format) in the given file, so that maybe at the end of the stream. Then it may read back other data at the start of the stream, then at another place etc. In short : it needs to have access to any bytes of the data.
An example would be to read data of an archive file without knowing the file format before starting to read from stdin: I need to check the archive metadata, find archive files names and offsets etc.
So I'll make a local copy of stdin content and work from it. Thanks everyone for your inputs ;)
You need to get your requirements clear. If you need to seek() then obviously you can't take input from stdin. If you need to seek() then you should take input file name as argument.
The data structure in your 4GbDataFile just doesn't lend itself to what you want to do. Think outside the box. Don't hammer your program into something it shouldn't even attempt. Try to fix the input format where it is generated so you don't need to seek back 4 GB.
In case you do like hammering: 4GB of in-core memory is pretty expensive. Instead, save the data read from stdin in a file, then open the file (or mmap it) and seek to your heart's content.
I think you should read the infamous Useless Use of Cat Award.
TL;DR: change cat 4gbfile | yourprogram to yourprogram < 4gbfile.
If you really insist on having it work with data from a pipe, you'll have to store it in a temporary file at startup then replace file descriptor 0 with a copy of the fd for the temp file, using dup2.
I just wanted to use grep with option -f FILE. This should make grep use every line of FILE as a pattern and search for it.
Run:
grep -f patternfile searchfile
The pattern-file I used is 400MB large. The file I want to search through is 7GB.
After 3 min the process ended up with 70GB RAM and no reaction.
Is this normal? Am I doing something wrong? Is grep not capable is such large scale?
Thank you for ideas.
If the lines in the pattern file are literal strings, using the "-F" option will make it much faster.
You could try breaking the task up such that the grep process ends on each pass of the file. I'm not sure how useful this will be, however, given the sheer size of the file you're searching.
for pattern in `cat patternFile`
do
grep "$pattern" searchFile
done
I have to say that this is the first time I've ever heard of anyone using a 700MB pattern file before - I'm not surprised it ate up so much memory.
If you have time, I would suggest either breaking the file up into sections and processing each section one at a time, or even just processing the 7GB file one regex at a time. If you can fit the whole 7GB file in memory then, and aren't worried about how long it takes, then that might be the most reliable solution.
I have a file wich has about 12 millon lines, each line looks like this:
0701648016480002020000002030300000200907242058CRLF
What I'm trying to accomplish is adding a row numbers before the data, the numbers should have a fixed length.
The idea behind this is able to do a bulk insert of this file into a SQLServer table, and then perform certain operations with it that require each line to have a unique identifier. I've tried doing this in the database side but I havenĀ“t been able to accomplish a good performance (under 4' at least, and under 1' would be ideal).
Right now I'm trying a solution in python that looks something like this.
file=open('file.cas', 'r')
lines=file.readlines()
file.close()
text = ['%d %s' % (i, line) for i, line in enumerate(lines)]
output = open("output.cas","w")
output.writelines(str("".join(text)))
output.close()
I don't know if this will work, but it'll help me having an idea of how will it perform and side effects before I keep on trying new things, I also thought doing it in C so I have a better memory control.
Will it help doing it in a low level language? Does anyone know a better way to do this, I'm pretty sure it has being done but I haven't being able to find anything.
thanks
oh god no, don't read all 12 million lines in at once! If you're going to use Python, at least do it this way:
file = open('file.cas', 'r')
try:
output = open('output.cas', 'w')
try:
output.writelines('%d %s' % tpl for tpl in enumerate(file))
finally:
output.close()
finally:
file.close()
That uses a generator expression which runs through the file processing one line at a time.
Why don't you try cat -n ?
Stefano is right:
$ time cat -n file.cas > output.cas
Use time just so you can see how fast it is. It'll be faster than python since cat is pure C code.