I have two huge json files (20GB each) and I need to join them. The files have the following content:
file_1.json = [{"key": "value"}, {...}]
file_2.json = [{"key": "value"}, {...}]
The main problem, however, is that I need all dict to be in the same list. I tried to do this in python, but unfortunately, I don't have the memory to do this operation.
So, I thought maybe I could tackle this with unix commands, by replacing, in the first file, the ] for , (note that there is a space after the comma) and erasing [ from the second file. Then, I would join the two files with the cat unix command.
Is there a way for me to edit only the last 10 char in unix?
I tried to use echo and tr but I might be doing something wrong with the syntax.
You can very easily append to a file in place, i.e. add characters at the end without rewriting the data that's already there. With the right tools (truncate if your system has it), you can truncate a file in place, i.e. remove characters at the end without rewriting the data that's staying. With the right tools (dd, if you're feeling adventurous), you can replace a part of a file by a string of the same length, without rewriting the unchanged parts. On the other hand, you can't remove characters from the beginning or middle of a file without rewriting the file (with a few exceptions that aren't relevant here).
But anyway rewriting both files in place wouldn't help you that much. You will need to at least rewrite the content of the second file to append it to the first file.
If you don't need to keep the split files around, you can append the second file to the first file in place, after taking care of the middle punctuation. Remove the last ] character from the first file, as well as any following spaces and line breaks. Assuming that the first file ends in ] and a newline and you have GNU core utilities (e.g. non-embedded Linux):
truncate -s -2 file_1.json
Now you can add a comma and optionally a line break to the first file, and append the data from the second file without its first character.
echo , >>file_1.json
tail -c +2 file_2.json >>file_1.json
If you want to keep the original files unmodified, you can make a copy of the first file and truncate it. Or you can directly make a truncated copy of the first file (still assuming GNU coreutils):
head -c -2 file_1.json >concatenated.json
echo , >>concatenated.json
tail -c +2 file_2.json >>concatenated.json
If you're more comfortable with Python, you can do all of this in Python. Just don't read the whole file in one go, i.e. don't call read() or use readline() in a way that reads all the lines as once. Instead, read and process a single line at a time (if the lines are short) or a single block of data. Untested code:
with open('concatenated.json', 'wb') as out:
with open('file_1.json', 'rb') as inp:
buf = bytes(1024)
size = inp.seek(-len(buf), io.SEEK_END)
n = inp.readinto(buf)
m = re.search(rb']\s*\Z', buf)
stop_at = m.start()
inp.seek(0, io.SEEK_SET)
n = inp.readinto(buf)
total = n
while n > 0:
out.write(buf)
n = inp.readinto(buf)
total += n
if total > stop_at:
out.write(buf[:len(buf)-(total-stop_at)])
n = 0
out.write(b',')
with open('file_2.json', 'rb') as inp:
buf = bytes(1024)
n = inp.readinto(buf)
assert buf[0] == b'['
buf[0:1] = b'\n'
while n > 0:
out.write(buf)
n = inp.readinto(buf)
Related
What is the best way to append line(s) to a file?
Currently I am using the following script:
/**
* Append the `line` to the file given at the `path`.
*
* #param path
* The absolute or relative path to the file with
* extension
* #param line
* The line to append
* #param [max_lines=10000]
* The maximum number of lines to allow for a file
* to prevent an infinite loop
*/
void append(string path, string line, number max_lines){
number f = OpenFileForReadingAndWriting(path);
// go through file until the end is reached to set the
// internal pointer to this position
number line_counter = 0;
string file_content = "";
string file_line;
while(ReadFileLine(f, file_line) && line_counter < max_lines){
line_counter++;
// file_content += file_line;
}
// result("file content: \n" + file_content + "{EOF}");
// append the line
WriteFile(f, line + "\n");
CloseFile(f);
}
void append(string path, string line){
append(path, line, 10000);
}
string path = "path/to/file.txt";
append(path, "Appended line");
For me it seems a little bit odd to read the whole file content to just append one line. If the file is very big, this probably is very slow1. So I guess there is a better solution of this. Does anyone know this solution?
Some background
My application is written in python but executed in Digital Micrograph. My python application is logging its steps. Sometimes I am executing dm-script from python. There I have no possibility to see what is going on. Since there is a bug, I need something to find out what is going on. Therefore I want to add logging to dm-script too.
This also explains, why I want to open and close the file every single time. This takes more time, but I don't care about execution speed while debugging. The logs will either be removed or switched off for the normal version, as usual. But on the other hand I am executing dm-script and python alternating so I have to prevent python blocking the file for dm-script and the other way around.
1As written in the background, I am not really interested in speed. So the current script is enough for me. Still I am interested in how to do this better, just for learnings and curiositys sake.
The best way to deal with any files in DM-script (binary or text) is to use the streaming object. The following example should answer your question:
void writeText()
{
string path
if ( !SaveAsDialog( "Save text as" , path , path ) ) return
number fileID = CreateFileForWriting( path )
object fStream = NewStreamFromFileReference( fileID , 1 ) // 1 for auto-close file when out of scope
// Write some text
number encoding = 0 // 0 = system default
fStream.StreamWriteAsText( encoding , "The quick brown dog jumps over the lazy fox" )
// Replace last 'fox' by 'dog'
fStream.StreamSetPos( 1 , -3 ) // 3 bytes before current position
fStream.StreamWriteAsText( encoding, "dog" )
// Replace first 'dog' by 'fox'
fStream.StreamSetPos( 0 , 16 ) // 16 bytes after start
fStream.StreamWriteAsText( encoding, "fox" )
// Append at end
fStream.StreamSetPos( 2 , 0 ) // end position (0 bytes from end)
fStream.StreamWriteAsText( encoding, "." )
}
writeText()
I am trying to split files evenly in a number of chunks. This is my code:
awk '/*/ { delim++ } { file = sprintf("splits/audio%s.txt", int(delim /2)); print >> file; }' < input_file
my files looks like this:
"*/audio1.lab"
0 6200000 a
6200000 7600000 b
7600000 8200000 c
.
"*/audio2.lab"
0 6300000 a
6300000 8300000 w
8300000 8600000 e
8600000 10600000 d
.
It is giving me an error: awk: line 1: syntax error at or near *
I do not know enough about awk to understand this error. I tried escaping characters but still haven't been able to figure it out. I could write a script in python but I would like to learn how to do this in awk. Any awkers know what I am doing wrong?
Edit: I have 14021 files. I gave the first two as an example.
For one thing, your regular expression is illegal; '*' says to match the previous character 0 or more times, but there is no previous character.
It's not entirely clear what you're trying to do, but it looks like when you encounter a line with an asterisk you want to bump the file number. To match an asterisk, you'll need to escape it:
awk '/\*/ { close(file); delim++ } { file = sprintf("splits/audio%d.txt", int(delim /2)); print >> file; }' < input_file
Also note %d is the correct format character for decimal output from an int.
idk what all the other stuff around this question is about but to just split your input file into separate output files all you need is:
awk '/\*/{close(out); out="splits/audio"++c".txt"} {print > out}' file
Since "repetition" metacharacters like * or ? or + can take on a literal meaning when they are the first character in a regexp, the regexp /*/ will work just fine in some (e.g. gawk) but not all awks and since you apparently have a problem with having too many files open you must not be using gawk (which manages files for you) so you probably need to escape the * and close() each output file when you're done writing to it. No harm doing that and it makes the script portable to all awks.
I'm trying to open this file (final.txt) and read the contents:
c0001
f260
L
D11
H30
R0000
C0040
1X1100000100010B300300003003
181100202900027Part No
181100202900097[PRTNUM]
1e5504002400030B
1X1100002300010L300003
191100202000030Quantity
191100202000080[QUANTY]
1e5504001500040B
1X1100001400010L300003
1X1100001400150L003090
191100202000170P.O.No
191100202000220[PONUMB]
1e5504001500180B
191100201200030Supplier
1e3304000700030B
1X1100000600010L300003
181100200300030Serial
181100200300090[SERIAL]
171100300900190Rev
171100300300190[REV]
171100300900240Units
171100300300240[UNITS]
1X1100000100180L003130
Q0001
E
from which I am reading only [PRTNUM], [QUANTY], [PONUMB], [SERIAL], [UNITS].
I've written the following C program:
char* cStart = strchr(cString, '[');
if (cStart)
{
// open bracket found
*cStart++ = '\0'; // split the string at [
char* cEnd = strchr(cStart, ']');
// you could check here for the close bracket being found
// and throw an exception if not
*cEnd = '\0'; // terminate the keyword
printf("Key: %s, Value: %s",cString, cStart);
}
// continue the loop
but now I want to replace these placeholders with data from the 2nd file:
132424235
004342
L1000
DZ12
234235
234235
I want to replace [PRTNUM] (from the 1st file) with 132424235 and so on... In the end my file should be updated with all this data. Can you tell me what function I should use in the above program?
If you don't mind having an alternate approach, here's an algorithm to do the work in an elegant way
Create one (large enough) temporary buffer. Also, create (open) one output file which will be the modified version.
Read a line from the input file into the buffer using fgets()
Search for the particular "keyword" using strstr()
If a match is found --
4.1. Open the other input file.
4.2. Read the corresponding data (line), using fgets()
4.3. Replace the actual data in the temporary buffer with the newly read value.
4.4. write the modified data to the output file.
If match is not found, write the original data in the output file. Then, go to step 2.
Continue until fgets() returns NULL (indicates the file content has been exhausted).
Finally, the output file will have the data from the first file with those particular "placeholders" substituted with the value read from the second file.
Obviously, you need to polish the algorithm a little bit to make it work with multiple "placeholder" string.
Keep an extra string(name it copy) large enough to hold file 1 + some extra to manage replacement of [PRTNUM] with 132424235.
Start reading first string that has file1 and keep copying into second string (copy) as soon as you encounter [PRTNUM] , in string 2 instead of copying [PRTNUM] you append it with 132424235 and so on for all others.
And finally replace file1.txt with this second (copy) string.
Is there any way to figure out the length of a .dat file (in terms of rows) without loading the file into the workspace?
Row Counter -- only loads one character per row:
Nrows = numel(textread('mydata.txt','%1c%*[^\n]'))
or file length (Matlab):
datfileh = fopen(fullfile(path, filename));
fseek(datfileh, 0,'eof');
filelength = ftell(datfileh);
fclose(datfileh);
I'm assuming you are working with text files, since you mentioned finding the number of rows.
Here's one solution:
fid = fopen('your_file.dat','rt');
nLines = 0;
while (fgets(fid) ~= -1),
nLines = nLines+1;
end
fclose(fid);
This uses FGETS to read each line, counting the number of lines it reads. Note that the data from the file is never saved to the workspace, it is simply used in the conditional check for the while loop.
It's also worth bearing in mind that you can use your file system's in-built commands, so on linux you could use the command
[s,w] = system('wc -l your_file.dat');
and then get the number of lines from the returned text (which is stored in w). (I don't think there's an equivalent command under Windows.)
I have a program that generates text fiels that can be up to 20 mets in size. Sometimes I only care about the last line in the file, is there a way to read just that line with out wasting memory reading the rest of the file?
I could be mistaken but without resorting to some trickery, it seems that you can't.
However, if you have a rough estimate of the length of the lines, you can open a file and then seek from the end, say, 1Kb.
local f = io.open([[c:\test_file.txt]], "r")
local len = f:seek("end")
f:seek("set", len - 1024)
local text = f:read("*a")
print(string.match(text, "[^%c]*$"))
f:close()
Hope this helps. Take into account that the pattern needs some refinement. It currently assumes that no control characters appear on a line. If your line has i.e. tabs, then it will capture from there till the end of file.
I'm not sure how this will behave with with files containing really big lines since it will make lots of f:read(1) calls. I've also added an optional parameter to read more than just a single line.
-- assumes file ends with line break
function lastline(path, how_many)
how_many = (how_many) or 1
how_many = how_many + 1
local f = assert(io.open(path))
local new_lines_found = 0
-- needs to find at least a pair of \n
local len = f:seek("end")
for back_by=1, len do
f:seek("end", -back_by)
if f:read(1) == '\n' then
new_lines_found = new_lines_found + 1
if new_lines_found == how_many then
local last_lines = f:read("a")
io.write(last_lines)
return
end
end
end
f:close()
end
This is the cleanest solution I could come up with.
local function readLastLine(filePath)
local file = io.open(filePath, "rb")
local eof = file:seek("end")
for i = 1, eof do
file:seek("set", eof - i)
if i == eof then break end
if file:read(1) == '\n' then break end
end
local lastLine = file:read("*a")
file:close()
return lastLine
end