I am using linux based cluster system and I am working on whole chromosome so ı am trying to generate 1 billions blocks. ı have a command which includes region -i 1:1-1000000 is there any way/loop to change index(-i) untill specific number for example chr 1 until 2.6millions. thanks
I am a fresher in this field :(
Related
I am using trace_printk() to print some numbers (they are around a million number). However when I check /sys/kernel/debug/tracing/trace ... only a part of the full range is printed.
Can anyone suggest me how to increase the buffer size or any way to print the full range via any option.
*Note: I don't care about the other output of ftrace.
*Note2: I am kinda beginner in using ftrace and kernel functions.
on Ubuntu 18.04 basis
The buffer_size_kb file exists in the /sys/kernel/debug/tracing directory.
You can make changes through echo.
$ eche echo 4096 > buffer_size_kb
The buffer_size_kb * cpu core count = buffer_total_size_kb is automatically calculated and stored.
This will increase the amount in the ftrace file.
Overwrite file exists in /sys/kernel/debug/tracing/options directory.
Overwrite files can also be changed to echo.
$ eche echo 4096 > buffer_size_kb
The default value is 1, which throws away the oldest event (first part).
Conversely, if zero, discard the most recent event (the back).
In this case, the amount of ftrace files does not increase, and you can see the first or last.
This question already has answers here:
Shell command to find lines common in two files
(12 answers)
grep a large list against a large file
(4 answers)
Closed 5 years ago.
Really need help trying to find the fastest way to display the number of times each part of a 500000 part unidimensional array occurs inside the DATA file. So its a word count of every element in the huge array.
I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:
DATA file is
1321064615465465465465465446513213321378787542119 #### the actual real life string is a lot longer than this space provides!!
The ArrayDataFile (they are all unique string elements) is
1212121
11
2
00
215
0
5845
88
1
6
5
133
86 ##### . . . etc etc on to 500 000 items
BASH Script code that I have been able to put together to accomplish this:
#!/bin/bash
declare -a Array
readarray Array < ArrayDataFile
for each in "${Array[#]}"
do
LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile
done
In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?
BASH the best way at all? Heard its slow.
The data file is just a huge string of numbers 21313541321 etc etc as I have now clarified. Also the ArrayDataFile is the one that has 500000 items. These are taken into an array by readarray in order to search the DATA file one by one and then get results in a per line basis into a new file. My specific question is regarding a search against a LARGE STRING, not an INDEXED file or a per line sorted file, nor do I want the lines in which my array elements from ArrayDataFile were found or anything like that. What I want is to search a large string of data for every time that every array element (taken from the ArrayDataFile) happened and print the results in the same line as they are located in the Array Data File so that I can keep everything together and do further operations. The only operation that really takes long is the actual Searching of the DATA file by utilizing the code provided in this post. I could not utilize those solutions for my query and my issue is not resolved with those answers. At least I have not been able to extrapolate a working solution for my sample code from those specific posts.
we have around 200 GB .sql file we are grepping for some tables it is taking around 1 and half hour, as there any method we ca reduce time? any other efficient method to filter for some tables ? any help will be appreciated
The GNU parallel program can split input into multiple child processes, each of which will run grep over each respective part of the input. By using multiple processes (presumably you have enough CPU cores to apply to this work), it can finish faster by running in parallel.
cat 200-gb-table.sql | parallel --pipe grep '<pattern>'
But if you need to know the context of where the pattern occurs (e.g. the line number of the input) this might not be what you need.
I have 50 GB structured (as key/value) data like this which are stored in a text file (input.txt / keys and values are 63 bit unsigned integers);
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
2436941118228099529 7438724021973510085
3370171830426105971 6928935600176631582
3370171830426105971 5928936601176631564
I need to sort this data as keys in increasing order with the minimum value of that key. The result must be presented in another text file (data.out) under 30 minutes. For example the result must be like this for the sample above;
2436941118228099529 7438724021973510085
3370171830426105971 5928936601176631564
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
I decided that;
I will create a BST tree with the keys from the input.txt with their minimum value, but this tree would be more than 50GB. I mean, I have time and memory limitation at this point.
So I will use another text file (tree.txt) and I will serialize the BST tree into that file.
After that, I will traverse the tree using in-order traverse and write the sequenced data into data.out file.
My problem is mostly with the serialization and deserialization part. How can I serialize this type of data? and I want to use the INSERT operation on the serialized data. Because my data is bigger than memory. I can't perform this in the memory. Actually I want to use text files as a memory.
By the way, I am very new to this kind of stuffs. If is there a conflict with my algorithm steps, please warn me. Any thought, technique and code samples would be helpful.
OS: Linux
Language: C
RAM: 6 GB
Note: I am not allowed to use built-in functions like sort and merge.
Considering, that your files seems to have the same line size around 40 chars giving me around 1250000000 lines in total, I'd split the input file into smaller, by a command:
split -l 2500000 biginput.txt
then I'd sort each of them
for f in x*; do sort -n $f > s$f; done
and finally I'd merge them by
sort -m sx* > bigoutput.txt
Hi : Is there a way to create a file which, when read, is generated dynamically ?
I wanted to create 3 versions of the same file (one with 10 lines, one with 100 lines, one with all of the lines). Thus, I don't see any need for these to be static, but rather, it would be best if they were proxies from a head/tail/cat command.
The purpose of this is for unit testing - I want a unit test to run on a small portion of the full input file used in production. However, since the code only runs on full files (its actually a hadoop map/reduce application), I want to provide a truncated version of the whole data set without duplicating information.
UPDATE: An Example
more myActualFile.txt
1
2
3
4
5
more myProxyFile2.txt
1
2
more myProxyFile4.txt
1
2
3
4
etc.... So the proxy files are DIFFERENT named files with content that is dynamically provided by simply getting the first n lines of the main file.
This is hacky, but... One way is to use named pipes, and a looping shell script to generate the content (one per named pipe). This script would look like:
while true; do
(
for $(seq linenr); do echo something; done
) >thenamedpipe;
done
Your script would then read from that named pipe.
Another solution, if you are ready to dig into low level stuff, is FUSE.