c - merging two text files - c

Say I have 2 text files:
A.txt:
Param1: Value1
Param2: Value2
.
.
.
.
ParamM: ValueM
B.txt
Param1: Value1
Param2: Value2
.
.
.
.
ParamN: ValueN
Number of parameters in A.txt, i.e. M can be greater than, lesser than or equal to the number of parameters in B.txt, i.e. N.
Values for the same parameter need not be the same in A.txt and B.txt.
M and N can probably reach a maximum value of 200.
The parameter names are arbitrary. They have no numbers. The above was just for an illustration.
My objective is to merge A.txt and B.txt. If any conflicts occur, I have a file/in-memory storage which dictates which one takes precedence.
For example,
A.txt could look like:
Path: C:\Program\Files
Data: ImportantInfo.dat
Version: 1.2.3
Useless: UselessParameter.txt
and
B.txt could look like:
Path: C:\ProgramFiles
Data: NotSoImportant.dat
Version: 1.0.0
Useful: UsefulParameter.txt
The final text file should look like:
Path: C:\ProgramFiles
Data: ImportantInfo.dat
Version: 1.2.3
Useful: UsefulParameter.txt
Now my approach to thinking about this is:
get a line from A.txt
get a line from B.txt
tokenize both by ":"
compare param names
if same
write A.txt's value to Result.txt
else if different
write A.txt's line into Result.txt
write B.txt's line into Result.txt /* Order doesn't really matter */
repeat above steps until end of both text files
This approach doesn't take care of when A.txt has a certain parameter and B.txt doesn't. It would be extremely helpful if some lightweight library/APIs exist to do this. I am not looking for command-line tools (thanks to the overhead of doing system() calls).
Thanks!

This would be so much easier in a language that has a Map datatyp.
I would do it like this:
Read all of B and store the key-value strings in memory
Read A and for each key either overwrite the value (if it exists) or add it (if it doesn't)
Output the key-values to file

Related

sh - appending 0's to file names according to the max

I am trying to make a file sorter. In the current directory I have files named like this :
info-0.jpg
info-12.jpg
info-40.jpg
info-5.jpg
info-100.jpg
I want it to become
info-000.jpg
info-012.jpg
info-040.jpg
info-005.jpg
info-100.jpg
That is, append 0's so that the number of digits is equal to 3, because the max number was 100 and had 3 digits.
I would like to use cut and wc by doing a loop on each of the file names, If $1 is "info", for i in $1-*.jpg, but how. Thanks
I did this to start but get a syntax error
wcount=0
for i in $filename-*.jpg; do
wcount=$((echo $i | wc -c))
done
for f in info*.jpg ; do
numPart=${f%.*} ; #dbg echo numPart1=$numPart;
numPart=${numPart#*-}; #dbg echo numPart2=$numPart;
newFilename="${f%-*}"-$(printf '%03d' "$numPart")."${f##*.}"
echo /bin/mv "$f" "$newFilename"
done
The key is using printf with formatting that forces the width to 3 digits wide, and includes 0 padding; the printf "%03d" "$numPart" portion of the script.
Also the syntax ${f%.*} is a set of features offered by modern shells to remove parts of a variables value, where % means match (and destroy) the minimal match from the right side of the value, and ${numPart#*-} means match (and destroy) the minimal match from the left side of the value. There are also %% (maximum match from right) and ## (maximum match from left). Experiment with a variable on your command line get comfortable with this.
Triple check the output of this code in your environment and only when sure all mv commands look correct, remove the echo in front of /bin/mv.
If you get an error message like Can't find /bin/mv, then enter type mv and replace /bin/ with whatever path is returned for mv.
IHTH

Echoing an array containing elements with spaces as an argument to another command

I am writing a little script that outputs a list of duplicate files in the directory, ie. pairs of XXX.jpg and XXX (1).jpg. I want to use the output of this script as an argument to a command, namely ql (quicklook) so I can look through all such images (to verify they are indeed duplicate images, or just filenames). For instance, I can do `ql (' which will allow me to look through all the files 'XXX (1).jpg'; but I want to include in that list also the original 'XXX.jpg' files.
Here is my script so far:
dups=()
for file in *\(*; do
dups+=( "${file}" )
breakdown=( $file )
dupfile="${breakdown[0]}.jpg"
if [ -e "$dupfile" ]; then
dups+=( "$dupfile" )
fi
done
echo ${dups[#]}
As far as building an array of the required filenames goes, it works. But when it comes to invoking something like ql $(./printdups.sh), the command gets confused by the filenames with spaces. It will attempt to open 'XXX' as a file, and then '(1).jpg' as another file. So the question is, how can I echo this array such that filenames with spaces are recognised as such by the command I pass it to?
I have tried changing line 3 to:
dups+=( "'$file'" )
And:
dups+=( "${file/ /\ }" )
Both to no avail.
You can't pass arrays from one process to another. All you are doing is writing a space-separated sequence of file names to standard output, and the unquoted command substitution in ql $(./printdups.sh) fails for the same reason you need an array in the first place: word-splitting does not distinguish between spaces in file names and spaces between file names.
I would recommend defining a function, rather than a script, and have that function populate a global array that you can access directly after the function has been called.
get_dups () {
dups=()
for file in *\(*; do
dups+=( "$file" )
read -a breakdown <<< "$file" # safer way to split the name into parts
dupfile="${breakdown[0]}.jpg"
if [ -e "$dupfile" ]; then
dups+=( "$dupfile" )
fi
done
}
get_dups
ql "${dups[#]}"

How to create multiple files of the same size from a variable?

I have a shell script with a variable which I create an output file as follows
Variable >> file.txt
Result:
file.txt 20 kilobytes
Then, I have to split that output file in several of the same size using the split instruction
Result:
file01.txt 10 kilobytes
file02.txt 10 kilobytes
My question is:
Is there any way to apply the equivalent of split instruction while creating the output file? This is the expected output:
Variable >> file.txt / / Adding here the code needed to do the split
Result:
file01.txt 10 kilobytes
file02.txt 10 kilobytes
An example,
echo $var | split -b 10240
You can specify the output file prefix like this:
echo $var | split -b 10240 - dir1/mysplits
which produces filenames dir1/mysplitsaa, dir1/mysplitsab, dir1/mysplitsac, ... You can also rename these files after split of course.
You can chain any number of commands together by putting && between them - this basically tells the shell that if the first command succeeded, then perform the second command.
Alternatively, you can "pipe" data from one command to the next. This is done with |, and essentially takes the output of the first command and passes it as input to the second command.

How to find duplicate lines across 2 different files? Unix

From the unix terminal, we can use diff file1 file2 to find the difference between two files. Is there a similar command to show the similarity across 2 files? (many pipes allowed if necessary.
Each file contains a line with a string sentence; they are sorted and duplicate lines removed with sort file1 | uniq.
file1: http://pastebin.com/taRcegVn
file2: http://pastebin.com/2fXeMrHQ
And the output should output the lines that appears in both files.
output: http://pastebin.com/FnjXFshs
I am able to use python to do it as such but i think it's a little too much to put into the terminal:
x = set([i.strip() for i in open('wn-rb.dic')])
y = set([i.strip() for i in open('wn-s.dic')])
z = x.intersection(y)
outfile = open('reverse-diff.out')
for i in z:
print>>outfile, i
If you want to get a list of repeated lines without resorting to AWK, you can use -d flag to uniq:
sort file1 file2 | uniq -d
As #tjameson mentioned it may be solved in another thread.
Just would like to post another solution:
sort file1 file2 | awk 'dup[$0]++ == 1'
refer to awk guide to get some awk
basics, when the pattern value of a line is true this line will be
printed
dup[$0] is a hash table in which each key is each line of the input,
the original value is 0 and increments once this line occurs, when
it occurs again the value should be 1, so dup[$0]++ == 1 is true.
Then this line is printed.
Note that this only works when there are not duplicates in either file, as was specified in the question.

MD5 checksum of the whole file is different from checksum of content

I created a file a.txt containing one word - 'dog'.
Here is a MD5 checksum:
$md5sum a.txt
c52605f607459b2b80e0395a8976234d a.txt
Here is MD5 checksum of the word dog:
$perl -e "use Digest::MD5 qw(md5_base64 md5_hex); print(md5_hex('dog'));"
06d80eb0c50b49a509b49f2424e8c805
Why are checksums different?
Thank you,
Martin
Presumably you have a newline at the end of the file. Try using echo -n:
$ perl -e "use Digest::MD5 qw(md5_base64 md5_hex); print(md5_hex('dog'));"
06d80eb0c50b49a509b49f2424e8c805
$ echo 'dog' >a.txt
$ md5sum a.txt
362842c5bb3847ec3fbdecb7a84a8692 a.txt
$ echo -n 'dog' >a.txt
$ md5sum a.txt
06d80eb0c50b49a509b49f2424e8c805 a.txt
This is quite a common question:
Why does Perl and /bin/sha1 give different results?
Why PHP's md5 is different from OpenSSL's md5?
Why is the same input returning two different MD5 hashes?
md5_base64 is just a function declaration.
use Digest::MD5 qw(md5_base64 md5_hex)
means that I can use functions md5_base64() or md5_hex() from the library Digest::MD5
Basically you can use some other tools than Perl to compute MD5 hash of the word...
I'm wondering why the checksum of the file (using md5sum) is different from the checksum of the content itself...
Does the md5sum append some information about the file to the content before computing MD5?
Or is there some character "end of file"?
Thank you for your time...
When You perform and MD5 Calculation of a file (txt in your situation) the whole content of the file is taken in consideration even control chars (EOF, SOH, LF, CR) they are non printable chars but have some HEXA values which changes the corresponding MD5 result, which make different than the result of just passing a string to the MD5 function.

Resources