Fastest way to search a 500 thousand part array in BASH? [duplicate] - arrays

This question already has answers here:
Shell command to find lines common in two files
(12 answers)
grep a large list against a large file
(4 answers)
Closed 5 years ago.
Really need help trying to find the fastest way to display the number of times each part of a 500000 part unidimensional array occurs inside the DATA file. So its a word count of every element in the huge array.
I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:
DATA file is
1321064615465465465465465446513213321378787542119 #### the actual real life string is a lot longer than this space provides!!
The ArrayDataFile (they are all unique string elements) is
1212121
11
2
00
215
0
5845
88
1
6
5
133
86 ##### . . . etc etc on to 500 000 items
BASH Script code that I have been able to put together to accomplish this:
#!/bin/bash
declare -a Array
readarray Array < ArrayDataFile
for each in "${Array[#]}"
do
LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile
done
In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?
BASH the best way at all? Heard its slow.
The data file is just a huge string of numbers 21313541321 etc etc as I have now clarified. Also the ArrayDataFile is the one that has 500000 items. These are taken into an array by readarray in order to search the DATA file one by one and then get results in a per line basis into a new file. My specific question is regarding a search against a LARGE STRING, not an INDEXED file or a per line sorted file, nor do I want the lines in which my array elements from ArrayDataFile were found or anything like that. What I want is to search a large string of data for every time that every array element (taken from the ArrayDataFile) happened and print the results in the same line as they are located in the Array Data File so that I can keep everything together and do further operations. The only operation that really takes long is the actual Searching of the DATA file by utilizing the code provided in this post. I could not utilize those solutions for my query and my issue is not resolved with those answers. At least I have not been able to extrapolate a working solution for my sample code from those specific posts.

Related

UNIX - I want to find in a text file the records which have more columns than expected and their line numbers

I have a file with pipe delimiter and one record has more columns than expected.
For example:
File NPS.txt
1|a|10
2|b|20
3|c|30
4|d|40|old
The last column has more columns than expected and I want to know the line number to understand what the problem is.
I found this command:
awk -F\; '{print NF}' NPS.txt | sort | uniq -c
With this command I know that one columns has one column added but I do not know which one is.
I would use a bash script
a) Define a counter variable, starting at 0,
b) iterate over each line in your file, adding +1 to the counter at the beginning of each loop,
c) split each line into an array based on the "|" delimiter, logging the counter # if the array contains more than 3 elements. you can log to console or write to a file.
It's been awhile since I've scripted in Linux, but these references might help:
Intro:
https://www.guru99.com/introduction-to-shell-scripting.html
For Loops:
https://www.cyberciti.biz/faq/bash-for-loop/
Bash String Splitting
How do I split a string on a delimiter in Bash?
Making Scripts Executable
https://www.andrewcbancroft.com/blog/musings/make-bash-script-executable/
There may be a good one-liner out there, but it's not a difficult script to write.

Expanding an array within a for loop (file list)

I am running into problems when interating over files using a for-loop. For simplicity, I created a small loop which should explain which problem I have at the moment.
Starting point: files in a folder which have a file-specific one to three digit number at a defined position in their filename.
Goal: Iterate over some of these files (not all) using a for-loop.
Problem: I created an array containing these one to three digit numbers specific for each file. The files are called at the beginning of the for-loop and I would like to use the array to reference to the specific files. But: The array is not expanding correctly.
Hope someone can help!
(There might be several good alternative ways to do this. Maybe some of them do not need an array, but I would be interested in knowing the solution to my specific problem since I think this might be a fundamental missunderstanding in how to expand a variable as part of filenames at the beginning of a for-loop.)
This is the code:
declare -a SOME_SAMPLES=(37 132 253 642 242 42)
for d in prmrp_*_${SOME_SAMPLES[#]}_S*_L00?_R1_001.fastq.gz; do
INPUT_FILE1=$(echo $d | sed 's/_L00._R1_001.fastq.gz//')
echo ${INPUT_FILE1}
done
Again, this is just an example code. The problem is the ${SOME_SAMPLES[#]} part which is not expanding correctly so the loop fails.
Thanks!
I think the problem is that in
prmrp_*_${SOME_SAMPLES[#]}_S*_L00?_R1_001.fastq.gz
it doesn't duplicate the entire expression for each element of the array, it just blindly inserts the array's elements in the middle, giving the equivalent of this:
prmrp_*_37 132 253 642 242 42_S*_L00?_R1_001.fastq.gz
... which is a bunch of separate items (prmrp_*_37 as a wildcard expression, followed by 132 as a simple string, followed by 253 etc). AIUI you want to expand the array's contents, and then for each element use a wildcard expression to get all matching files. The best way to do this is to use two loops, one to expand the array, and another to find matching files:
for sample in "${SOME_SAMPLES[#]}"; do
for d in prmrp_*_"${sample}"_S*_L00?_R1_001.fastq.gz; do
...
BTW, I'd also recommend using lowercase or mixed-case variable names (e.g. sample above) to avoid possible conflicts with the many all-caps variables with special meanings/functions. Also, I'd use a parameter expansion to remove the filename's suffix (instead of sed):
input_file1=${d%_L00?_R1_001.fastq.gz}
Also, you should generally put double-quotes around variable references (e.g. echo "${input_file1}" instead of echo ${input_file1}). (Assignments like input_file1=${d... are an exception, although double-quotes don't hurt there; they just aren't needed.) Note that in the for loop above, I put double-quotes around the array and variable references, but not around the wildcards; this means the shell will expand the wildcards (as you want) but not mess with the variable's contents.
Try:
array=( 37 132 253 642 242 42 );
for d in ${array[#]}; do
INPUT_FILE1="prmrp_*_"$d"_S*_L00?_R1_001.fastq.gz";
echo ${INPUT_FILE1}
done

Sorting huge volumed data using Serialized Binary Search Tree

I have 50 GB structured (as key/value) data like this which are stored in a text file (input.txt / keys and values are 63 bit unsigned integers);
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
2436941118228099529 7438724021973510085
3370171830426105971 6928935600176631582
3370171830426105971 5928936601176631564
I need to sort this data as keys in increasing order with the minimum value of that key. The result must be presented in another text file (data.out) under 30 minutes. For example the result must be like this for the sample above;
2436941118228099529 7438724021973510085
3370171830426105971 5928936601176631564
3633223656935182015 2473242774832902432
8472954724347873710 8197031537762113360
I decided that;
I will create a BST tree with the keys from the input.txt with their minimum value, but this tree would be more than 50GB. I mean, I have time and memory limitation at this point.
So I will use another text file (tree.txt) and I will serialize the BST tree into that file.
After that, I will traverse the tree using in-order traverse and write the sequenced data into data.out file.
My problem is mostly with the serialization and deserialization part. How can I serialize this type of data? and I want to use the INSERT operation on the serialized data. Because my data is bigger than memory. I can't perform this in the memory. Actually I want to use text files as a memory.
By the way, I am very new to this kind of stuffs. If is there a conflict with my algorithm steps, please warn me. Any thought, technique and code samples would be helpful.
OS: Linux
Language: C
RAM: 6 GB
Note: I am not allowed to use built-in functions like sort and merge.
Considering, that your files seems to have the same line size around 40 chars giving me around 1250000000 lines in total, I'd split the input file into smaller, by a command:
split -l 2500000 biginput.txt
then I'd sort each of them
for f in x*; do sort -n $f > s$f; done
and finally I'd merge them by
sort -m sx* > bigoutput.txt

Read a specific line from text file without reading whole file in C [duplicate]

This question already has answers here:
How to fgets() a specific line from a file in C?
(5 answers)
Closed 7 years ago.
I want to read a specific line from a text file without reading the whole file line by line. For Example, if I have 10 lines in a text file and I have to read 6th line, I will not read the first 5 lines but will directly read the 6th one. Can anyone help me??
This question is answered here
Quoting from above,
Unless you know something more about the file, you can't access specific lines at random. New lines are delimited by the presence of line end characters and they can, in general, occur anywhere. Text files do not come with a map or index that would allow you to skip to the nth line.
If you knew that, say, every line in the file was the same length, then you could use random access to jump to a particular line. Without extra knowledge of this sort you simply have no choice but to iterate through the entire file until you reach your desired line.
Credits : Quoted answered was by David Heffernan
You could 'index' the file. Please note that this is only worth the effort if your text file:
is big
is frequently read and rarely written
The easiest (and probably most efficient) way is to use a database engine. Just store your file in a table, one row for each line.
Alternatively, you could make your own indexing mechanism. Basically, this means:
create a new file (the index)
scan the entire text file once, storing the offset of each line in the index file
repeat the above each time the text file changes
Finding line n in the text file requires two seeks:
read the nth offset from the index
read a line from the text file, starting at the offset found in the index

Concatenate a large number of HDF5 files

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

Resources