Related
I have two huge json files (20GB each) and I need to join them. The files have the following content:
file_1.json = [{"key": "value"}, {...}]
file_2.json = [{"key": "value"}, {...}]
The main problem, however, is that I need all dict to be in the same list. I tried to do this in python, but unfortunately, I don't have the memory to do this operation.
So, I thought maybe I could tackle this with unix commands, by replacing, in the first file, the ] for , (note that there is a space after the comma) and erasing [ from the second file. Then, I would join the two files with the cat unix command.
Is there a way for me to edit only the last 10 char in unix?
I tried to use echo and tr but I might be doing something wrong with the syntax.
You can very easily append to a file in place, i.e. add characters at the end without rewriting the data that's already there. With the right tools (truncate if your system has it), you can truncate a file in place, i.e. remove characters at the end without rewriting the data that's staying. With the right tools (dd, if you're feeling adventurous), you can replace a part of a file by a string of the same length, without rewriting the unchanged parts. On the other hand, you can't remove characters from the beginning or middle of a file without rewriting the file (with a few exceptions that aren't relevant here).
But anyway rewriting both files in place wouldn't help you that much. You will need to at least rewrite the content of the second file to append it to the first file.
If you don't need to keep the split files around, you can append the second file to the first file in place, after taking care of the middle punctuation. Remove the last ] character from the first file, as well as any following spaces and line breaks. Assuming that the first file ends in ] and a newline and you have GNU core utilities (e.g. non-embedded Linux):
truncate -s -2 file_1.json
Now you can add a comma and optionally a line break to the first file, and append the data from the second file without its first character.
echo , >>file_1.json
tail -c +2 file_2.json >>file_1.json
If you want to keep the original files unmodified, you can make a copy of the first file and truncate it. Or you can directly make a truncated copy of the first file (still assuming GNU coreutils):
head -c -2 file_1.json >concatenated.json
echo , >>concatenated.json
tail -c +2 file_2.json >>concatenated.json
If you're more comfortable with Python, you can do all of this in Python. Just don't read the whole file in one go, i.e. don't call read() or use readline() in a way that reads all the lines as once. Instead, read and process a single line at a time (if the lines are short) or a single block of data. Untested code:
with open('concatenated.json', 'wb') as out:
with open('file_1.json', 'rb') as inp:
buf = bytes(1024)
size = inp.seek(-len(buf), io.SEEK_END)
n = inp.readinto(buf)
m = re.search(rb']\s*\Z', buf)
stop_at = m.start()
inp.seek(0, io.SEEK_SET)
n = inp.readinto(buf)
total = n
while n > 0:
out.write(buf)
n = inp.readinto(buf)
total += n
if total > stop_at:
out.write(buf[:len(buf)-(total-stop_at)])
n = 0
out.write(b',')
with open('file_2.json', 'rb') as inp:
buf = bytes(1024)
n = inp.readinto(buf)
assert buf[0] == b'['
buf[0:1] = b'\n'
while n > 0:
out.write(buf)
n = inp.readinto(buf)
I am writing an R package that contains C and Rcpp. The goal is to call the C function from R and within Rcpp, eventually performing most of the analysis in Rcpp and only returning to R for minimal tasks. My package compiles and calling my function from R works fine.
#generate some matrix. Numeric is fine too. Must have column names, no row names
myMat <- matrix(data = 1:100, nrow = 10, ncol = 10,
dimnames = list(NULL, LETTERS[1:10]))
#This works. Put in full path, no expansion. It returns null to the console.
MinimalExample::WriteMat(mat = myMat, file = "Full_Path_Please/IWork.csv",
sep = "," ,eol = "\n", dec = ".", buffMB = 8L)
However, attempting the same thing in Rcpp produces a SIGSEV error. I think the problem is how I am passing arguments to the function, but I can't figure out the proper way.
#include <Rcpp.h>
using namespace Rcpp;
extern "C"{
#include "fwrite.h"
}
//' #export
// [[Rcpp::export]]
void WriteMatCpp(String& fileName, NumericMatrix& testMat){
Rcpp::Rcout<<"I did start!"<<std::endl;
String patchName = fileName;
int whichRow = 1;
std::string newString = std::string(3 - toString(whichRow).length(), '0')
+ toString(whichRow);
patchName.replace_last(".csv", newString+".csv");
//Set objects to pass to print function
String comma = ",";
String eol = "\n";
String dot = ".";
int buffMem = 8;
//This is where I crash, giving a SIGSEV error
fwriteMain(testMat, (SEXP)&patchName, (SEXP)&comma, (SEXP)&eol,
(SEXP)&dot, (SEXP)&buffMem);
}
Here is a link to the GitHub repository with the package. https://github.com/GilChrist19/MinimalExample
Your call from C++ to C is wrong. You can't just write (SEXP)& in front of an arbitrary data structure and hope for it to become a SEXP.
Fix
Use a line such as this to convert what you have in C++ to the SEXP your C function expects using Rcpp::wrap() on each argument:
//This is where I crash, giving a SIGSEV error
fwriteMain(wrap(testMat), wrap(patchName), wrap(comma),
wrap(eol), wrap(dot), wrap(buffMem));
Demo
edd#brad:/tmp/MinimalExample/MinEx(master)$ Rscript RunMe.R
I did start!
edd#brad:/tmp/MinimalExample/MinEx(master)$ cat /tmp/IDoNotWork.csv
A,B,C,D,E,F,G,H,I,J
1,11,21,31,41,51,61,71,81,91
2,12,22,32,42,52,62,72,82,92
3,13,23,33,43,53,63,73,83,93
4,14,24,34,44,54,64,74,84,94
5,15,25,35,45,55,65,75,85,95
6,16,26,36,46,56,66,76,86,96
7,17,27,37,47,57,67,77,87,97
8,18,28,38,48,58,68,78,88,98
9,19,29,39,49,59,69,79,89,99
10,20,30,40,50,60,70,80,90,100
edd#brad:/tmp/MinimalExample/MinEx(master)$
See https://github.com/GilChrist19/MinimalExample/tree/master/MinEx for a complete example.
I've written a program to move around particles in a 3-D field based on a 3-D velocity field. However, I get a segmentation fault at the line when I update the particle positions, and I have no idea why! I wrote this program previously in a single file, and it worked fine. But now I'm getting the segmentation fault error when I have all the functions/subroutines in a module.
Edit: I implemented the suggestions below, and now the segmentation fault has moved from the update particle line to the line where I call writeResults. Any help is still appreciated!
Main Program:
program hw4Fortran
use hw4_module
implicit none
!Define types
integer::i ,j, k, num_ts, num_particles, field_size_x, field_size_y, &
field_size_z, num_arguments
type(vector),allocatable::vfield(:,:,:)
type(vector),allocatable::parray(:)
character(30)::out_file_basename, vel_file, part_file, filename, string_num_ts
!Read command line
num_arguments = NARGS()
if (num_arguments > 1) then
call GETARG(1, string_num_ts)
read(string_num_ts, *) num_ts
else
num_ts = 50
end if
if (num_arguments > 2) then
call GETARG(2, out_file_basename)
else
out_file_basename = "results"
end if
if (num_arguments > 3) then
call GETARG(3, vel_file)
else
end if
if (num_arguments > 4) then
call GETARG(4, part_file)
else
part_file = "particles.dat"
end if
!Open files
open(unit=1, file=vel_file)
open(unit=2, file=part_file)
!Read number of particles
num_particles = readNumParticles(2)
!Adjust for zero index
num_particles = num_particles - 1
!Allocate and read particle array
parray = readParticles(2, num_particles)
!Read field size
field_size_x = readFieldSize(1)
field_size_y = readFieldSize(1)
field_size_z = readFieldSize(1)
!Adjust for zero index
field_size_x = field_size_x - 1
field_size_y = field_size_y - 1
field_size_z = field_size_z - 1
!Allocate and read vector field
vfield = readVectorField(1, field_size_x, field_size_y, field_size_z)
!Move particles and write results
do i=0,num_ts
if (mod(i,10) == 0) then
write(filename, fmt = "(2A, I0.4, A)") trim(out_file_basename), "_", i, ".dat"
open(unit = 3, file=filename)
end if
do j=0, num_particles
if (i > 0) then
parray(j) = updateParticle(vfield(INT(FLOOR(parray(j)%x)),INT(FLOOR(parray(j)%y)),INT(FLOOR(parray(j)%z))), parray(j))
end if
if (mod(i,10) == 0) then
call writeResults(3, parray(j))
end if
end do
if (mod(i,10) == 0) then
close(3)
end if
end do
!Close files
close(1)
close(2)
!Deallocate arrays
deallocate(vfield)
deallocate(parray)
end program hw4Fortran
Module:
module hw4_module
implicit none
type vector
real::x,y,z
end type
contains
function readNumParticles(fp) result(num_particles)
integer::fp, num_particles
read(fp, *) num_particles
end function
function readParticles(fp, num_particles) result(parray)
integer::fp, num_particles, i
type(vector),allocatable::parray(:)
allocate(parray(0:num_particles))
do i=0, num_particles
read(fp, *) parray(i)
end do
end function
function readFieldSize(fp) result(field_size)
integer::fp, field_size
read(fp, *) field_size
end function
function readVectorField(fp, field_size_x, field_size_y, &
field_size_z) result(vfield)
integer::fp, field_size_x, field_size_y, field_size_z, i, j
type(vector),allocatable::vfield(:,:,:)
allocate(vfield(0:field_size_x,0:field_size_y,0:field_size_z))
do i=0, field_size_x
do j=0, field_size_y
read(fp, *) vfield(i,j,:)
end do
end do
end function
function updateParticle(velocity, old_particle) result(new_particle)
type(vector)::new_particle,old_particle,velocity
new_particle%x = old_particle%x + velocity%x
new_particle%y = old_particle%y + velocity%y
new_particle%z = old_particle%z + velocity%z
end function
subroutine writeResults(fp, particle)
integer::fp
type(vector)::particle
write(fp, *) particle%x, " ", particle%y, " ", particle%z
end subroutine
end module
This function
function readParticles(fp, num_particles) result(parray)
integer::fp, num_particles, i
type(vector),allocatable::parray(:)
allocate(parray(0:num_particles))
do i=0, num_particles
read(fp, *) parray(i)
end do
end function
allocates parray with index values 0:num_particles. Unfortunately, and this trips up many a newcomer to Fortran (some oldcomers too), those array bounds are not passed out to the calling code which will blithely assume an index range starting at 1. And then the code goes on to access parray(0) ... and the problem that John B warns of arises.
Fortran's capability of indexing arrays from an arbitrary integer value is never quite as useful as it seems. You can pass the bounds into and out of procedures, but who can be bothered ? Easier just to pretend that Fortran arrays index from 1 and apply that consistently throughout a program.
here is a simple version of what the OP is doing with allocate
module altest
contains
function setarray(n) result(x)
implicit none
integer, intent(in) :: n
integer , allocatable :: x(:)
allocate(x(n))
x(1)=1
end function
end module
program Console6
use altest
implicit none
integer,allocatable :: m(:)
m=setarray(2)
write(*,*)'m1',m(1)
end program Console6
It "appears" to be allocating an array x in the function and assigning that to an allocatable array m in the calling program. This compiles but throws a subscript out of bounds error on the write. (note this would likely be a seg fault if bounds checking is not enabled )
This can be fixed by separately allocating the array in the calling routine, or by passing the allocatable array as an argument:
module altest
contains
subroutine setarray(n,x)
implicit none
integer, intent(in) :: n
integer , allocatable :: x(:)
allocate(x(n))
x(1)=1
end subroutine
end module
program Console6
use altest
implicit none
integer,allocatable :: m(:)
call setarray(2,m)
write(*,*)'m1',m(1)
end program Console6
Edit - somewhat to my surprise, the second case works fine if we allocate with a zero lower bound in the sub allocate(x(0:n)) , the calling routine 'knows' the subscript starts at zero. ( Works with intel fortran v13 -- I have no Idea if this is a safe thing to do.. )
A segmentation fault normally indicates that your program is trying to access memory that does not belong to it.
When you say the error occurs "when I update the particle positions", I take it you mean this line:
updateParticle(vfield(INT(FLOOR(parray(j)%x)),INT(FLOOR(parray(j)%y)),INT(FLOOR(parray(j)%z))), parray(j))
An array-bounds violation in that statement seems entirely plausible, as I don't see anything in your code that would prevent the array indexes INT(FLOOR(parray(j)%x)) et al from falling outside the allocated dimensions of array vfield. Even if they are all in bounds at the initial step of the simulation, they may go out of bounds as the simulation proceeds.
Whether such a result in fact occurs appears to be data-dependent, not related to whether your functions appear in a module.
It looks like you have a C background. Could this be an off-by-one error? When looping in Fortran, the loop index goes all the way to the upper bound. Your Fortran loop:
do j=0, num_particles
! ...
end do
is equivalent to this C loop:
for (int j = 0; j <= num_particles; j++)
{
// ...
}
Note the <= sign, instead of <.
You may want to change your Fortran upper bound to num_particles - 1.
I have a small fortran program:
integer :: ok
real :: x
character(len=80) :: name
name="test.txt"
open(1,file=name,status='old',iostat=ok)
print *,ok
read(1,*) x
close(1)
print *,x
and it works for now.
But When I do something like this (of course i put my file in this dir)
name="/data/test.txt"
or this
name = "../data/input.txt"
I get the error:
At line 8 of file test.f90 (unit = 1, file = 'fort.1')
Fortran runtime error: End of file
I use gfortran compiler.
UPDATE:
Absolute path like "/Users/name/Documents/fortran/data/input.txt" works too!
i have a very odd problem, it seems that, somehow some of my reals are getting changed.
i've got a Modul:
c\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
module Koordinaten
implicit none
save
real(kind=8),allocatable,dimension(:) :: xi, yi, zi
integer,allocatable,dimension(:) :: setnodes, st
real(8),allocatable, dimension(:) :: sx,sy,sz
Integer :: setdaten
end
c/////////////////////////////////////////////////////////
which is used in most of the suroutines and the main subroutine (this subroutine is called and the end of each simulation Increment, an does nothing but my code.). There, in the main program, all are allocated.
SUBROUTINE UEDINC (INC,INCSUB)
use Koordinaten
implicit none
c ** Start of generated type statements **
include 'dimen'
include 'jname'
include 'statistics'
include 'spaceset'
integer inc, incsub
integer :: i, nsets,k
character(265), dimension(ndset) :: setname
c ** End of generated type statements **
write(0,*)"NUMNP: ",NUMNP
allocate(xi(NUMNP))
allocate(yi(NUMNP))
allocate(zi(NUMNP))
allocate(setnodes(NUMNP))
allocate(st(NUMNP))
allocate(sx(NUMNP))
allocate(sy(NUMNP))
allocate(sz(NUMNP))
allocate(ri(NUMNP))
allocate(delta_r(NUMNP))
allocate(dummy(NUMNP))
Where NUMNP comes from 'dimen'. (Creating the dimensions with NUMNP as size in the Modul doesn't work, i dont know why, but thats not my problem for now)
next a subroutine is called doing this:
c#########################################################
subroutine einlesen ()
use Koordinaten
use zyl_para
implicit none
c ** Start of generated type statements **
include 'dimen'
integer :: i, j
integer :: token1, token2
real(8), dimension(3) :: nodein, nodedis
c ** End of generated type statements **
write(0,*)"--- lese Koordinaten ein ---"
write(0,*)"Anzahl der Datenpunkte: ", NUMNP
do i=1,NUMNP
call NODVAR(0,i,nodein,token1,token2)
call NOdVAR(1,i,nodedis,token1,token2)
xi(i)=nodein(1)+nodedis(1)
yi(i)=nodein(2)+nodedis(2)
zi(i)=nodein(3)+nodedis(3)
end do
write(0,*)"--- Koordinaten eingelesen ---"
do i=1, NUMNP
write(0,*)xi(i),yi(i),zi(i)
end do
write(0,*)"§§§§§§§§§§§§§§§§§"
write(0,*)xi(i),yi(i),zi(i)
return
end subroutine einlesen
c#########################################################
Here is the strange part: Thr subroutine 'NODVAR' gives back the Koordinates and the Displacement of a node; calling it works just fine and the values are stored correct in nodein(1:3) and nodedis(1:3).
But
write(0,*)xi,yi,zi
gives out 3 columns of the values stored in xi, so basically yi and zi have the values of xi
Update
The Values are not exact equal, they differ a bit:
....
-20.0000765815728 -20.0000760759377 -20.0000751985753
-20.0000726178150 -20.0000671623785 -20.0000576162833
-20.0000427864046 -20.0000214978710 -19.9999932382105
-19.9999590389013 -18.9999215100902 -18.9998779514709
-18.9998277903388 -18.9997725557635 -18.9997146854248
-18.9996577298267 -18.9996059540004 -18.9995633069003
-18.9995325241422 -18.9995144999731 -18.9995087694121
-18.9995144999742 -18.9995325241444 -18.9995633069036
-18.9996059540045 -18.9996577298314 -18.9997146854297
-18.9997725557682 -18.9998277903431 -18.9998779514747
-18.9999215100934 -18.9999598955851 -18.9999939247953
-19.0000218363084 -19.0000426285757 -19.0000570432278
-19.0000664612509 -19.0000719811992 -19.0000746027515
-19.0000754299370 -19.0000747701169 -19.0000754299373
-19.0000746027519 -19.0000719811998 -19.0000664612514
-19.0000570432280 -19.0000426285755 -19.0000218363074
-18.9999939247935 -18.9999598955826 -17.9999226880232
-17.9998792166917 -17.9998290553161 -17.9997737084839
-17.9997156002768 -17.9996582203842 -17.9996058186853
....
END update
do i=1, NUMNP
write(0,*)xi(i),yi(i),zi(i)
end do
prints the values for xi, yi, and zi.
I do not deallocate the array till the end of the main subroutine
The printing is not the Problem, the Problem is, that the next subroutines uses this koordinates, but seem to have the same Problem.
The subroutine worked fine as i gave the xi,yi and zi as parameters during calling, but now i have to work with subroutins where i cannot pass them during calling.
So, why does this happen?
Thanks you for your time... and sorry for my errors.
UPDATE
I use the Subruotine 'UEDINC' euqivalent to a main Program. It works like an API to the FEM-Programm i use. This subroutine is called at the end of each increment and all my code and my subroutines are within this subroutine / called in this subroutine.
'NODVAR' is provided by the FEM-Program and dokumented. It is called for each node i and gives back the values in an arry of dim(3), here nodein and nodedis, the 0/1 indicates what is given back: koordinates or their displacement, token1 and token2 give back some Information i do not need.
I verified, that the values given back form 'NODVAR' are the ones i expect by printing them out. I also printetd out the values during the loop where they are stored vom 'NODVAR' to my array, by printing the values storred in my arry, here they where also right.
I know, that Kind=8 isn't portable, but it works for ifort, and the code doesn't have to be portable at all.
Further investigation
I modifed my code some bit, i now have the following subroutine:
c##########################################################
implicit none
c ** Start of generated type statements **
integer :: ndaten, i, j
integer :: token1, token2
real(8), dimension(3) :: nodein, nodedis
real(8), dimension(ndaten) :: x,y,z
c ** End of generated type statements **
write(0,*)"--- lese Koordinaten ein ---"
write(0,*)"Anzahl der Datenpunkte: ", ndaten
do i=1,ndaten
call NODVAR(0,i,nodein,token1,token2)
call NOdVAR(1,i,nodedis,token1,token2)
x(i)=nodein(1)+nodedis(1)
y(i)=nodein(2)+nodedis(2)
z(i)=nodein(3)+nodedis(3)
write(0,*)x(i),y(i),z(i) ***(1)
end do
write(0,*)"*****************"
write(0,*)x,y,z ***(2)
write(0,*)"--- Koordinaten eingelesen ---"
return
end subroutine einlesen
c#########################################################
The arrys x,y,z have the dim(NUMNP) and are basically empty, i do nothing with them beofre calling this subroutine, ndaten=NUMNP
(1) gives me, as i expect:
-19.9999205042055 4.174743870006559E-005 -2.49993530375797
-19.9998768725013 0.362341804670311 -2.47354036781631
-19.9998267169371 0.734574978337486 -2.38959111446343
-19.9997716931358 1.10321804323537 -2.24337882624597
-19.9997141644151 1.45282900896610 -2.03451636821160
-19.9996575908584 1.76783665097058 -1.76773205553564
-19.9996061583064 2.03464970008098 -1.45274943026036
-19.9995638755175 2.24353899096506 -1.10315640708085
-19.9995334705205 2.38977079851914 -0.734524030614783
-19.9995156926493 2.47372965346901 -0.362296534411106
-19.9995100448173 2.50012385767524 4.865608618193709E-010
....
(2) gives me:
-19.9999205042055 -19.9998768725013 -19.9998267169371
-19.9997716931358 -19.9997141644151 -19.9996575908584
-19.9996061583064 -19.9995638755175 -19.9995334705205
-19.9995156926493 -19.9995100448173 -19.9995156926504
-19.9995334705227 -19.9995638755208 -19.9996061583105
-19.9996575908630 -19.9997141644199 -19.9997716931404
-19.9998267169414 -19.9998768725051 -19.9999205042086
-19.9999590389038 -19.9999932382123 -20.0000214978720
-20.0000427864049 -20.0000576162831 -20.0000671623780
-20.0000726178145 -20.0000751985748 -20.0000760759375
-20.0000765815728 -20.0000760759378 -20.0000751985753
-20.0000726178150 -20.0000671623785 -20.0000576162833
-20.0000427864046 -20.0000214978710 -19.9999932382105
-19.9999590389013 -18.9999215100902 -18.9998779514709
-18.9998277903388 -18.9997725557635 -18.9997146854248
-18.9996577298267 -18.9996059540004 -18.9995633069003
-18.9995325241422 -18.9995144999731 -18.9995087694121
-18.9995144999742 -18.9995325241444 -18.9995633069036
-18.9996059540045 -18.9996577298314 -18.9997146854297
-18.9997725557682 -18.9998277903431 -18.9998779514747
-18.9999215100934 -18.9999598955851 -18.9999939247953
-19.0000218363084 -19.0000426285757 -19.0000570432278
...
['(1)', and '(2)' are obviously not in the code i compile and only some Markers for demonstration]
In your second output read the values across then down, in your first read the values down then across and you will find that they are the same numbers. This statement
write(0,*)x,y,z
writes vector x, then vector y, then vector z. The format clause (ie the *) tells the compiler to write the numbers as it sees fit. By lucky chance it has chosen to write 3 values on each line, in the order x(1),x(2),x(3),newLine,x(4),...,y(1),y(2),... This has tricked you into thinking that it is writing (incorrectly) x(i),y(i),z(i) but it is your thinking which is incorrect here, not the program.
If you want the values written x(1),y(1),z(1),newLine,x(2),... you have to write the statements to do that, as your first output statement does.
I find your question a rather confusing. Are you saying that you find the values in the arrays xi, yi, zi to be unexpected? What is your evidence that the values have changed?
If values of a variable are changing outside of your expectations, in Fortran there are two likely errors to cause such a problem: 1) array subscripts out of bounds, or 2) disagreement between actual and dummy procedure arguments. The easiest and first step and hunting down such errors and turning on all error and warning options of your compiler, especially run-time subscript bounds checking. Also be sure to place all procedures (subroutines and functions) in modules and use them so that the compiler can check argument consistency.
What compiler are you using?
P.S. real (kind=8) is not guaranteed to be an 8-byte real. Numeric values of kinds are not portable and differ between compilers.