How to split a large file into small ones by line number - file

I am trying to split my large big file into small bits using the line numbers. For example my file has 30,000,000 lines and i would like to divide this into small files wach of which has 10,000 lines(equivalent to 3000 small files).
I used the 'split' in unix but it seems that it is limited to only 100 files.
Is there a way of overcoming this limitation of 100 files?
If there is another way of doing this, please advise as well.
Thanks.

Using GNU awk
gawk '
BEGIN {
i=1
}
{
print $0 > "small"i".txt"
}
NR%10==0 {
close("file"i".txt"); i++
}' bigfile.txt
Test:
[jaypal:~/temp] seq 100 > bigfile.txt
[jaypal:~/temp] gawk 'BEGIN {i=1} {print $0 > "small"i".txt" } NR%10==0 { close("file"i".txt"); i++ }' bigfile.txt
[jaypal:~/temp] ls small*
small1.txt small10.txt small2.txt small3.txt small4.txt small5.txt small6.txt small7.txt small8.txt small9.txt
[jaypal:~/temp] cat small1.txt
1
2
3
4
5
6
7
8
9
10
[jaypal:~/temp] cat small10.txt
91
92
93
94
95
96
97
98
99
100

Not an answer, just added a way to do the renaming-part as requested in a comment
$ touch 000{1..5}.txt
$ ls
0001.txt 0002.txt 0003.txt 0004.txt 0005.txt
$ rename 's/^0*//' *.txt
$ ls
1.txt 2.txt 3.txt 4.txt 5.txt
I also tried the above with 3000-files without any problems.

Related

Complete file2 with data from file1

I have two files with fields separated with tabs:
File1 has 13 columns and 90 millions of lines (~5GB). The number of lines of file 1 is always smaller than the number of lines of file2.
1 1 27 0 2 0 0 1 0 0 0 1 false
1 2 33 0 3 0 0 0 0 0 0 1 false
1 5 84 3 0 0 0 0 0 0 0 2 false
1 6 41 0 1 0 0 0 0 0 0 1 false
1 7 8 4 0 0 0 0 0 0 0 1 false
File2 has 2 columns and 100 millions of lines (1.3GB)
1 1
1 2
1 3
1 4
1 5
What I want to achieve:
When the pair columns $1/$2 of file2 is identical to the pair columns $1/$2 of file1, I would like to print $1 and $2 from file2 and $3 from file1 into an output file. In addition, if the pair $1/$2 of file2 does not have a match in file1, print $1/$2 in the output and the 3rd column is left empty. Thus, the output keeps the same structure (number of lines) than file2.
If relevant: The pairs $1/$2 are unique in both file1 and 2 and both files are sorted according $1 first and then $2.
Output file:
1 1 27
1 2 33
1 3 45
1 4
1 5 84
What I have done so far:
awk -F"\t" 'NR == FNR {a[$1 "\t" $2] = $3; next } { print $0 "\t" a[$1 "\t" $2] }' file1 file2 > output
The command runs for few minutes and unexpectedly stop without additional information. When I open the output file, the first 5 to 6x10E6 lines have been correctly processed (I can see the 3rd column that was correctly added) but the rest of the output file does not have a 3rd column. I am running this command on a 3.2 GHz Intel Core i5 with 32 GB 1600 MHz DDR3. Any ideas why the command stops? Thanks for your help.
You are close.
I would do something like this:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
key in seen {print key, seen[key]}
' file1 file2
Or, since file1 is bigger, reverse which file is held in memory:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]; next}
key in seen {print key, $3}
' file2 file1
You could also use join which will likely handle files much larger than memory. This is BSD join that can use multiple fields for the join:
join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 file1 file2
join requires the files be sorted, as your example is. If not sorted, you could do:
join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 <(sort -n file1) <(sort -n file2)
Or, if your join can only use a single field, you can temporarily use ' ' as the field separator between field 2 and 3 and set join to use that as the delimiter:
join -1 1 -2 1 -t $' ' -o 1.1,2.2 <(sort -k1n -k2n file2) <(awk '{printf("%s\t%s %s\n",$1,$2,$3)}' file1 | sort -k1n -k2n) | sed 's/[ ]/\t/'
Either awk or join prints:
1 1 27
1 2 33
1 3 45
1 4 7
1 5 84
Your comment:
After additional investigations, the suggested solutions did not worked because my question was not properly asked (my mistake). The suggested solutions printed lines only when matches between pairs ($1/$2) were found between files 1 and 2. Thus, the resulting output file has always the number of lines of file1 (that is always smaller than file2). I want the output file to keep the same structure than file2, as said, the same number of lines (for further comparison). The question was further refined.
If your computer can handle the file sizes:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
{if (key in seen)
print key, seen[key]
else
print key
}
' file1 file2
Otherwise you can filter file1 so only the matches are feed to the awk from file1 and then file2 dictates the final output structure:
awk 'BEGIN{FS=OFS="\t"}
{key=$1 FS $2}
NR==FNR{seen[key]=$3; next}
{if (key in seen)
print key, seen[key]
else
print key
}
' <(join -1 1 -1 2 -2 1 -2 2 -t $'\t' -o 1.1,1.2,1.3 file1 file2) file2
If you still need something more memory efficient, I would break out ruby for a line-by-line solution:
ruby -e 'f1=File.open(ARGV[0]); f2=File.open(ARGV[1])
l1=f1.gets
f2.each { |l2|
l1a=l1.chomp.split(/\t/)[0..2].map(&:to_i)
l2a=l2.chomp.split(/\t/).map(&:to_i)
while((tst=l1a[0..1]<=>l2a)<0 && !f1.eof?)
l1=f1.gets
l1a=l1.chomp.split(/\t/)[0..2].map(&:to_i)
end
if tst==0
l2a << l1a[2]
end
puts l2a.join("\t")
}
' file1 file2
Issues with OP's current awk code:
testing shows loading file1 into memory (a[$1 "\t" $2] = $3) requires ~290 bytes per entry; for 90 million rows this works out to ~26 GBytes; this amount of memory usage should not be an issue in OP's system (max of 32 GBytes) ... assuming all other processes are not consuming 6+ GBytes; having said that ...
in the 2nd half of OP's script (ie, file2 processing) the print/a[$1 "\t" $2] will actually create a new array entry if one doesn't already exist (ie, if file2 key not found in file1 then create a new array entry); since we know this situation can occur we have to take into consideration the amount of memory required to store an entry from file2 in the a[] array ...
testing shows loading file2 into memory (a[$1 "\t" $2] = $3) requires ~190 bytes per entry; for 100 million rows this works out to ~19 GBytes; 'course we won't be loading all of file2 into the a[] array so total additional memory will be less than 19 GBytes; then again, we only need to load about 26 million rows (from file2) into the a[] array to use up another ~5 GBytes (26 million * 190 bytes) to run out of memory
OP has mentioned that processing 'stops' after generating 5-6 million rows of output; this symptom ('stopped' process) can occur when the system runs out of memory and/or goes into heavy swapping (also a memory issue); with 100 million rows in file2 and only 5-6 million rows in the output, that leaves 94-95 million rows from file2 unaccounted for which in turn is considerably more then the 26 million rows it would take to use up the remaining ~5 GBytes of memory ...
net result: OP's current awk script is likely hanging/stopped due to running out of memory
net result: we need to look at solutions that keep us from running out of memory; better would be solutions that use considerably less memory than the current awk code; even better would be solutions that use little (effectively 'no') memory at all ...
Assumptions/understandings:
both input files have already been sorted by the 1st and 2nd columns
within a given file the combination of the 1st and 2nd columns represents a unique key (ie, there are no duplicate keys in a file)
all lines from file2 are to be written to stdout while an optional 3rd column will be derived from file1 (when the key exists in both files)
General approach for a 'merge join' operation:
read from both files in parallel
compare the 1st/2nd columns from both files to determine what to print
memory usage should be minimal since we never have more than 2 lines (one from each file) loaded in memory
One awk idea for implementing a 'merge join' operation:
awk -v lookup="file1" ' # assign awk variable "lookup" the name of the file where we will obtain the optional 3rd column from
function get_lookup() { # function to read a new line from the "lookup" file
rc=getline line < lookup # read next line from "lookup" file into variable "line"; rc==1 if successful; rc==0 if reached end of file
if (rc) split(line,f) # if successful (rc==1) then split "line" into array f[]
}
BEGIN { FS=OFS="\t" # define input/output field delimiters
get_lookup() # read first line from "lookup" file
}
{ c3="" # set optional 3rd column to blank
while (rc) { # while we have a valid line from the "lookup" file look for a matching line ...
if ($1< f[1] ) { break }
else if ($1==f[1] && $2< f[2]) { break }
else if ($1==f[1] && $2==f[2]) { c3= OFS f[3]; get_lookup(); break }
# else if ($1==f[1] && $2> f[2]) { get_lookup() }
# else if ($1> f[1] ) { get_lookup() }
else { get_lookup() }
}
print $0 c3 # print current line plus optional 3rd column
}
' file2
This generates:
1 1 27
1 2 33
1 3
1 4
1 5 84

Bash Indented Output for Multiple Variables

I have a script that loops over every text file in a directory, and stores the content in variables. The content can be anywhere from 1-50 characters long. The amount of text files is unknown. I would like to print the content in such a way that each variable falls into a clean column.
for file in $LIBPATH/*.txt; do
name=$( awk 'FNR == 1 {print $0}' $file )
height=$( awk 'FNR == 2 {print $0}' $file )
weight=$( awk 'FNR == 3 {print $0}' $file )
echo $name $height $weight
done
This code produces the output:
Avril Stewart 99 54
Sally Kinghorn 170 60
John Young 195 120
While the desired output is:
Avril Stewart 99 54
Sally Kinghorn 170 60
John Young 195 120
Thanks!
Use printf:
printf '%-20s %3s %3s\n' "$name" "$height" "$weight"
%3s ensures that all fields use three characters, %-20s does the same for 20 characters, but the - in front makes the output left-aligned.
If you want to limit the output to e.g. 20 characters, you can use
printf '%-20.20s %3s %3s\n' "$name" "$height" "$weight"
This will give you a left aligned minimum width of 20 characters and a maximum width of 20 characters, in other words it will ensure that you always have exactly 20 characters.

grep Trim txt file by certain line number

I have a txt file containing, let's say, 1000 lines. I would like to trim it obtaining a file with 100 lines, composed by lines 0, 10, 20, 30, etc of the original file.
Is that possible with grep or something? thanks
it could be easily done by awk/sed one-liner:
awk
awk '!(NR%10)' file
sed
sed -n '0~10p' file
or
sed '0~10!d` file
see below example: (sed one liner will give same output)
print the first 10 lines:
kent$ seq 1000|awk '!(NR%10)'|head -10
10
20
30
40
50
60
70
80
90
100
total lines:
kent$ seq 1000|awk '!(NR%10)'|wc -l
100

Delete lines in a file containing argument passed on command line

I'm trying to delete specific lines based on the argument passed in.
My data.txt file contains
Cpu 500 64 6
Monitor 22 42 50
Game 32 64 128
My del.sh contains
myvar=$1
sed'/$myvar/d' data.txt > temp.txt
mv temp.txt > data.txt
but it just prints every line in temp.txt to data.txt....however
sed '/64/d' data.txt > temp.txt
will do the correct data transfer (but I don't want to hardcode 64), I feel like there's some kind of syntax error with the argument. Any input please
It's because of the single quotes, change them to double quotes. Variables inside single quotes are not interpolated, so you are sending the literal string $myvar to sed, instead of the value of $myvar.
Change:
sed '/$myvar/d' data.txt
to:
sed "/$myvar/d" data.txt
Note: You will run into issues when $myvar contains regular expression meta characters or forward slashes as pointed out in this response from Ed Morton. If you are not in complete control of your input you will need to find another avenue to accomplish this.
Assuming this is undesirable behavior:
$ cat file
Cpu 500 64 6
Monitor 22 42 50
Game 32 64 128
$ myvar=6
$ sed "/$myvar/d" file
Monitor 22 42 50
$ myvar=/
$ sed "/$myvar/d" file
sed: -e expression #1, char 3: unknown command: `/'
$ myvar=.
$ sed "/$myvar/d" file
$
Try this instead:
$ myvar=6
$ awk -v myvar="$myvar" '{for (i=1; i<=NF;i++) if ($i == myvar) next }1' file
Monitor 22 42 50
Game 32 64 128
$ myvar=/
$ awk -v myvar="$myvar" '{for (i=1; i<=NF;i++) if ($i == myvar) next }1' file
Cpu 500 64 6
Monitor 22 42 50
Game 32 64 128
$ myvar=.
$ awk -v myvar="$myvar" '{for (i=1; i<=NF;i++) if ($i == myvar) next }1' file
Cpu 500 64 6
Monitor 22 42 50
Game 32 64 128
and if you think you can just escape the /s and use sed, you can't because you might be adding a 2nd backslash to one already present:
$ foo='\/'
$ myvar=${foo//\//\\\/}
$ sed "/$myvar/d" file
sed: -e expression #1, char 5: unknown command: `/'
$ awk -v myvar="$myvar" '{for (i=1; i<=NF;i++) if ($i == myvar) next }1' file
Cpu 500 64 6
Monitor 22 42 50
Game 32 64 128
This is simply NOT a job you can in general do with sed due to it's syntax and it's restriction of only allowing REs in it's search.
You can also use awk to do the same,
awk '!/'$myvar'/' data.txt > temp.txt && mv temp.txt data.txt
Use -i option in addition to what #SeanBright proposed. Then you won't need > temp.txt and mv temp.txt data.txt.
sed -i "/$myvar/d" data.txt

Merge multiple files by common field - Unix

I have hundreds of files, each with two columns :
For example :
file1.txt
ID Value1
1 40
2 30
3 70
file2.txt
ID Value2
1 50
2 70
3 20
And so on, till
file150.txt
ID Value150
1 98
2 52
3 71
How do I merge these files based on the first column (which is common). My output should be
ID Value1 Value2...........Value150
1 40 50 98
2 30 70 52
3 70 20 71
Thank you.
using cut and paste combination to solve the file merging problem on three files or more. cd to the folder only contains file1, file2, file3, ... file150:
i=0
cut -f 1 file1 > delim ## use first column as delimiter
for file in file*
do
i=$(($i+1)) ## for adding count to distinguish files from original ones
cut -f 2 $file > ${file}__${i}.temp
done
paste -d\\t delim file*__*.temp > output
Another solution is using join to merge two files once by steps.
join -j 1 test1 test2 | join -j 1 test3 - | join -j 1 test4 -

Resources