How to replace arrays in file using bash? - arrays

I have a file like this:
# [...]
ver=1.0
source_x86_64=("file-$ver.tar.gz" "file-$ver.tar.gz.sig")
source_i386=("file-$ver.tar.gz")
sha256sums_x86_64=("54842def55267131a12494cd6d23243df1fa3eee60dd418f0a84737f1feafc28"
"SKIP")
sha256sums_i386=("c453479cbada5139cc3f494f9e72f192451780c7d936d7aa15be99c48deb4237")
# [...]
I need to update the hashes each time a new version is released. I can run a program that downloads the three files and generates the sha256sum for each of them:
For instance makepkg -gA with "ver=1.1" outputs
sha256sums_x86_64=('82cfaef1f73748a0f814dd06eee3af1fd34232d6dc49421a13176255fed24845'
'SKIP')
sha256sums_i386=('7324bed84c99eb51aa7d639d7c087154291f27e9f494f3cc9315adabc974354c')
How I can overwrite the arrays in the original file, possibly keeping the same position, without appending the program output to the end and deleting the original arrays?
Note that the first array spans multiple lines and that arrays in the file may differ in size from the ones in the program output (e.g. if a file has been added to the source array and the hash array in the file has not been updated accordingly).

Related

UNIX - I want to find in a text file the records which have more columns than expected and their line numbers

I have a file with pipe delimiter and one record has more columns than expected.
For example:
File NPS.txt
1|a|10
2|b|20
3|c|30
4|d|40|old
The last column has more columns than expected and I want to know the line number to understand what the problem is.
I found this command:
awk -F\; '{print NF}' NPS.txt | sort | uniq -c
With this command I know that one columns has one column added but I do not know which one is.
I would use a bash script
a) Define a counter variable, starting at 0,
b) iterate over each line in your file, adding +1 to the counter at the beginning of each loop,
c) split each line into an array based on the "|" delimiter, logging the counter # if the array contains more than 3 elements. you can log to console or write to a file.
It's been awhile since I've scripted in Linux, but these references might help:
Intro:
https://www.guru99.com/introduction-to-shell-scripting.html
For Loops:
https://www.cyberciti.biz/faq/bash-for-loop/
Bash String Splitting
How do I split a string on a delimiter in Bash?
Making Scripts Executable
https://www.andrewcbancroft.com/blog/musings/make-bash-script-executable/
There may be a good one-liner out there, but it's not a difficult script to write.

Fastest way to search a 500 thousand part array in BASH? [duplicate]

This question already has answers here:
Shell command to find lines common in two files
(12 answers)
grep a large list against a large file
(4 answers)
Closed 5 years ago.
Really need help trying to find the fastest way to display the number of times each part of a 500000 part unidimensional array occurs inside the DATA file. So its a word count of every element in the huge array.
I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:
DATA file is
1321064615465465465465465446513213321378787542119 #### the actual real life string is a lot longer than this space provides!!
The ArrayDataFile (they are all unique string elements) is
1212121
11
2
00
215
0
5845
88
1
6
5
133
86 ##### . . . etc etc on to 500 000 items
BASH Script code that I have been able to put together to accomplish this:
#!/bin/bash
declare -a Array
readarray Array < ArrayDataFile
for each in "${Array[#]}"
do
LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile
done
In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?
BASH the best way at all? Heard its slow.
The data file is just a huge string of numbers 21313541321 etc etc as I have now clarified. Also the ArrayDataFile is the one that has 500000 items. These are taken into an array by readarray in order to search the DATA file one by one and then get results in a per line basis into a new file. My specific question is regarding a search against a LARGE STRING, not an INDEXED file or a per line sorted file, nor do I want the lines in which my array elements from ArrayDataFile were found or anything like that. What I want is to search a large string of data for every time that every array element (taken from the ArrayDataFile) happened and print the results in the same line as they are located in the Array Data File so that I can keep everything together and do further operations. The only operation that really takes long is the actual Searching of the DATA file by utilizing the code provided in this post. I could not utilize those solutions for my query and my issue is not resolved with those answers. At least I have not been able to extrapolate a working solution for my sample code from those specific posts.

Read matrices from multiple .csv files and print matrices in .csv files

So I have to write a C program to read data from .csv files supplied to me by multiple users, into matrices on which I will perform some operations (like matrix addition, multiplication with necessary conditions on dimensions, etc.) and print these matrices (or the output data) in to .csv files again.
I also need to dynamically allocate memory to my matrices.
Now, I have zero background in dealing with .csv files. I do not at all know the required code to read a .csv file or write into a .csv file. I have searched for long on the Internet but surprisingly I have not found any program that teaches how to deal with .csv files from the elementary level.
I am lost on this and need a lot of guidance, maybe a sample, fully well-written C program as I need a comprehensive example to begin with.
A CSV file is just a plain ASCII text file that contains a grid of values. Think of the file as a set of rows in a database table where each line in the file represents one record and the order of the data in each line is identical. Each item of data is separated using a comma character (hence the name). So to read the file:-
open file
until the end of the file
read line into a string
split the string into sub strings where ',' is the dilimiter
parse each sub string
Since there is no formatting information in a CSV file, if the data in each value consists of a string, then what do you do if the value has a comma in it? For reading numbers that is not a problem for you.
You could read the file in several passes, the first to determine the amount of data there is (number of columns, number of rows, etc) and the second to actually read the data.
Writing the CSV is quite simple:-
open file
for each record to write
for each element to write
write element
if not last element
write a comma
write a new line

Powershell - Getting an item in an array, and all items after that match

I have a bunch of files, some ending with .a and some with .b. I have already created an array with all these elements, and when I echo them out I get:
1.a
111.b
112.b
113.b
114.b
2.a
111.b
112.b
3.a
111.b
112.b
113.b
etc.
These will always be sorted in the correct order, with the oldest entries appearing at the start of the array, and the newest at the bottom.
How could I get the latest '.a' file, and all '.b' files since then?
In the above case, I need to return
3.a
111.b
112.b
113.b
Thanks!
In a process {} block, keep an array of entries ending with ".b", added to every time one shows up; empty it whenever a ".a" entry comes along, and dump the last ".a" entry seen, along with the entire available ".b" array, in the end {} block.
Basically a fairly standard accumulate-and-release pattern.

Concatenate a large number of HDF5 files

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

Resources