Split file in dynamic name in unix - file

I have a file of 500 million records and I have to split this file into files of 1 million each. File should be named dynamically with numeric suffix. I tried :
split -dl 1000000 myInputFile.txt output_
But after creating 99 files( like output_00... output_99) , I got the error:
split: output file suffixes exhausted
Any suggestion?

man split says
-a 3
to use 3 digits suffixes
If all else fails read the manual

Related

unix file utility: magic syntax

I would like to create a custom magic file for the file utility, but I'm having a really hard time understanding the syntax described in man magic.
I need to test several places, each of which can contain several strings. Only if all the tests succeed would it print a file type.
To summarize, I would like a test similar to this if it were fields in an SQL database:
( byte_0 = "A" OR byte_0 = "B" OR byte_0 = "C" )
AND
( byte_1_to_3 = "DEF" OR byte_1_to_3 = "GHI" OR byte_1_to_3 = "JKL" )
Or in Perl regexp syntax:
m/^
[ABC]
(DEF|GHI|JKL)
/x
file has its own syntax, with hundreds of examples. If the documentation is unclear, you should start by reading examples which are close to your intended changes. That's what I did with ncurses for example, in the terminfo magic-file, to describe the Solaris xcurses header as a sequence of strings:
# Rather than SVr4, Solaris "xcurses" writes this header:
0 regex \^MAX=[0-9]+,[0-9]+$
>1 regex \^BEG=[0-9]+,[0-9]+$
>2 regex \^SCROLL=[0-9]+,[0-9]+$
>3 regex \^VMIN=[0-9]+$
>4 regex \^VTIME=[0-9]+$
>5 regex \^FLAGS=0x[[:xdigit:]]+$
>6 regex \^FG=[0-9],[0-9]+$
>7 regex \^BG=[0-9]+,[0-9]+, Solaris xcurses screen image
#
but without the insight gained by reading this example,
0 string \032\001
# 5th character of terminal name list, but not Targa image pixel size (15 16 24 32)
>16 ubyte >32
# namelist, if more than 1 separated by "|" like "st|stterm| simpleterm 0.4.1"
>>12 regex \^[a-zA-Z0-9][a-zA-Z0-9.][^|]* Compiled terminfo entry "%-s"
the manual page was not (as you report) clear enough that file processes a numbered series of steps in sequence.

grep results in array

I have a document which contains several names of files over which I want to use grep to gather all files with the xsd extension. When I use grep with my regex pattern, I get the correct results, about 18 of them. Now I want to store these results in an array. I used the following bash code :
targets=($(grep -i "AppointmentManagementService[\.]" AppointmentManagementService\?wsdl))
Then I print the array size :
echo ${#targets[#]}
which turns out to be 80 instead of 18 since it stored only a part of one result into an array cell. How do I make sure only one result goes into one array cell?
The results probably get split over multiple cells because a character (most likely space) is interpreted as an internal field separator.
Try executing it like this:
IFS=$'\n' targets=($(grep -i "AppointmentManagementService[\.]" AppointmentManagementService\?wsdl))

Batch Script to remove Specials Charcters

i am trying to create a batch file , but i am unable to find.
My Requirement is below.
1) i have some group of text files like Text file 1 , Text File 2 , Text file 3.
2) Each text files contains Some Special Characters .
3) I want to remove those Special characters from All the Text files .
4) Some Specials characters are there which we can type it on Notepad.
5) So i need a batch file, which can search for special character by passing ASCII Value & Remove them .
Please let us know, it would be grateful.
////// Below is text file format
81
2016-03-13 00:13:05 2016-03-14 00:51:39 �# 81
101
2016-03-13 00:13:05 2016-03-14 03:02:48 xuyou �#
2016-03-14 03:16:06 2016-03-14 08:16:13 =M 100
2016-03-14 03:16:06 2016-03-14 08:16:13
2016-03-14 03:16:06 2016-03-14 08:16:41 Search : ��~ 100
dhfcjchjchjcdhj �
huge files are not ready f okay
~
fd
Looks like binary data.May be he easiest way will be to use Strings.exe:
strings -n 1 -a -q nottextfile >purified
and see if the purified file contains what you want.

How to create multiple files of the same size from a variable?

I have a shell script with a variable which I create an output file as follows
Variable >> file.txt
Result:
file.txt 20 kilobytes
Then, I have to split that output file in several of the same size using the split instruction
Result:
file01.txt 10 kilobytes
file02.txt 10 kilobytes
My question is:
Is there any way to apply the equivalent of split instruction while creating the output file? This is the expected output:
Variable >> file.txt / / Adding here the code needed to do the split
Result:
file01.txt 10 kilobytes
file02.txt 10 kilobytes
An example,
echo $var | split -b 10240
You can specify the output file prefix like this:
echo $var | split -b 10240 - dir1/mysplits
which produces filenames dir1/mysplitsaa, dir1/mysplitsab, dir1/mysplitsac, ... You can also rename these files after split of course.
You can chain any number of commands together by putting && between them - this basically tells the shell that if the first command succeeded, then perform the second command.
Alternatively, you can "pipe" data from one command to the next. This is done with |, and essentially takes the output of the first command and passes it as input to the second command.

How to find duplicate lines across 2 different files? Unix

From the unix terminal, we can use diff file1 file2 to find the difference between two files. Is there a similar command to show the similarity across 2 files? (many pipes allowed if necessary.
Each file contains a line with a string sentence; they are sorted and duplicate lines removed with sort file1 | uniq.
file1: http://pastebin.com/taRcegVn
file2: http://pastebin.com/2fXeMrHQ
And the output should output the lines that appears in both files.
output: http://pastebin.com/FnjXFshs
I am able to use python to do it as such but i think it's a little too much to put into the terminal:
x = set([i.strip() for i in open('wn-rb.dic')])
y = set([i.strip() for i in open('wn-s.dic')])
z = x.intersection(y)
outfile = open('reverse-diff.out')
for i in z:
print>>outfile, i
If you want to get a list of repeated lines without resorting to AWK, you can use -d flag to uniq:
sort file1 file2 | uniq -d
As #tjameson mentioned it may be solved in another thread.
Just would like to post another solution:
sort file1 file2 | awk 'dup[$0]++ == 1'
refer to awk guide to get some awk
basics, when the pattern value of a line is true this line will be
printed
dup[$0] is a hash table in which each key is each line of the input,
the original value is 0 and increments once this line occurs, when
it occurs again the value should be 1, so dup[$0]++ == 1 is true.
Then this line is printed.
Note that this only works when there are not duplicates in either file, as was specified in the question.

Resources