Is it possible to speed up batch processing of documents with CoreNLP from command line so that models load only one time? I would like to trim any unnecessarily repeated steps from the process.
I have 320,000 text files and I am trying to process them with CoreNLP. The desired result is 320,000 finished XML file results.
To get from one text file to one XML file, I use the CoreNLP jar file from command line:
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props config.properties
-file %%~f -outputDirectory MyOutput -outputExtension .xml -replaceExtension`
This loads models and does a variety of machine learning magic. The problem I face is when I try to loop for every text in a directory, I create a process that by my estimation will complete in 44 days. I literally have had a command prompt looping on my desktop for the last 7 days and I'm nowhere near finished. The loop I run from batch script:
for %%f in (Data\*.txt) do (
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -props config.properties
-file %%~f -outputDirectory Output -outputExtension .xml -replaceExtension
)
I am using these annotators, specified in config.properties:
annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref, sentiment
I know nothing about Stanford CoreNLP, so I googled for it (you didn't included any link) and in this page I found this description (below "Parsing a file and saving the output as XML"):
If you want to process a list of files use the following command line:
java -cp
stanford-corenlp-VV.jar:stanford-corenlp-VV-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar
-Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR CONFIGURATION FILE ] -filelist A FILE CONTAINING YOUR LIST OF FILES
where the -filelist parameter points to a file whose content lists all
files to be processed (one per line).
So I guess that you may process your files faster if you store a list of all your text files in a list file:
dir /B *.txt > list.lst
... and then pass that list in the -filelist list.lst parameter in a single execution of Stanford CoreNLP.
Related
I am trying to extract a .pst file from a windows backup. In order to do this I need to copy each ‘partial’ file from the backup zips and then combine them together to make the one file. I have a command that will copy them out and combine them from this post but the problem I have is that cmd is not doing it in numerical order, therefore the file is not complete. I am using this script to put the files in order:
Echo y | for /F "tokens=*" %A in (filenamesinorder.txt) do copy /b %A “c:\pstcombiner\combined.pst”
But all this does is copy each individual file and overwrites it. I get that that’s what the command does but I need it to combine all the files into one. What am I doing wrong?
Form the Microsoft documentation for the copy command:
To append files, specify a single file for destination, but multiple files for source (use wildcard characters or file1+file2+file3 format).
You’ll need to construct the source text for the copy in your for loop and do the copy after; I’ll see if I can provide an example when I am at my Windows system.
Instead of concatenating, you could try merging using the type command:
create an empty target file
copy nul target.ext > nul
then loop the type command to merge the files to the end of the target file
type fileN.ext>>target.ext
where fileN.ext is file 1, 2,3 ... n
i am using this command in cmd with cpdf.
cpdf -split a.pdf -o page %%%pdf
but I wanted to use it for a pdf list in a directory.
ie you need a batch script that runs on all pdf files in the directory and the cpdf split command is applied to each file dividing by one per sheet.
example, transform the files:
a.pdf
b.pdf
c.pdf
and more ...
in several files, 1 per page of the original with the name of the original
a1.pdf
a2.pdf
a3.pdf
b1.pdf
b2.pdf
b3.pdf
c1.pdf
c2.pdf
c3.pdf
and more ...
can help?
I want to back up important files to my USB drive every day. The USB drive is always plugged into the computer, so I don't need to worry about drive letters. I know how to create a batch file to simply copy and paste them into the drive, but I was wondering if there is a way to create a batch file that makes a zip file of all the folders I want (using winzip or winrar), and then has them sent to the drive. That way I can archive them instead of just copying and replacing them.
Thank you.
See:How can you zip or unzip from the command prompt using ONLY Windows' built-in capabilities?
Powershell can do this:
Add-Type -A System.IO.Compression.FileSystem
[IO.Compression.ZipFile]::CreateFromDirectory('C:\foo\', 'D:\foo.zip')
you can incorporate it into a batch file by calling powershell.exe like so:
powershell.exe -nologo -noprofile -command "& { Add-Type -A 'System.IO.Compression.FileSystem'; [IO.Compression.ZipFile]::CreateFromDirectory('C:\foo\', 'D:\foo.zip'); }"
I would recommend running it from a scheduled task so you don't have to start it manually or worry about it running all the time, and plan on disconnecting/replacing/archiving the drive itself occasionally - locally attached storage is not a backup
Start WinRAR, click in menu Help on first menu item Help topics, expand on help tab Contents the list item Command line mode and
read the help page Command line syntax,
expand the list item Commands and read the help page Alphabetic commands list and
expand the list item Switches and read the help page Alphabetic switches list.
Then you know how I created this single command line below and what all those switches mean.
"%ProgramFiles%\WinRAR\WinRAR.exe" a -ac -ao -afzip -agYYYY-MM-DD_NN -cfg- -ed -ep1 -ibck -inul -m5 -r -y -- "D:\Backup Folder\DataBackup_.zip" "C:\Path To Folder With Files To Backup\"
See also answer on Simply compress 1 folder in batch with WinRAR command line?
WinRAR adds to ZIP archive only files with archive attribute set because of switch -ao and clears the archive attribute after compressing the file into the archive because of -ac. The archive attribute is automatically set by Windows when a file is modified, renamed or moved. So this command line avoids compressing the same unmodified files again and again into ZIP archives.
The ZIP file is named automatically with current date in name and an automatic incremented number in case of this command line is executed multiple times on one day.
I have some java executable program initialized from cmd. My problem is as following: I would like to read all files from some directory. Next, I would like to run the program as many times as many files I have in my folder. The required inputs are the path to the file with data and the name of the file where the results will be written. Now my question is, how can I write a simple batch file which will do it for me?
For example:
I have a list of files in my folder
file_1.xls
file_2.xls
file_3.xls
I want to run a loop and for each file initialize line specified below:
java -jar -Xmx1000M Program.jar pathToInputFile PathToOutputfile
For example for file file_1.xls I want to write the result to the file with the same name but different extension and at the begining of this file add some constant prefix. In case of file_1.xls the results I would like to write as Output_file_1.txt
for file_2.xls -> Output_file_2.txt
for file_3.xls -> Output_file_3.txt
and so on...
Can anyone help me?
pushd "c:\excel_files"
for %%F in (*.xls) do (
java -jar -Xmx1000M Program.jar "%%~nxF" "Output_%%~nF.txt"
)
Though I'll recommend you to use -classpath and direct call of the entry point class instead of direct call of the .jar .
(by DOS I mean windows cmd.exe - I don't want to enforce powershell or similar on the end user)
I want to run a command line file that prints output to CON / the screen.
I want to capture that output and compare it to an expected output.
... in a .bat / .cmd file?
Specifically, the identify command of ImageMagick, and I want to run this over +- 300 files and compare the actual sizes to expected sizes.
example output:
$ identify rose.jpg
rose.jpg JPEG 640x480 sRGB 87kb 0.050u 0:01
If I understand the question correctly, you want to run the identify command on all the jpg files in a directory and capture the output of that command into a text file for later comparison. The comparison however is not part of the spec?
Something like the line below should do that job. Just run it from the folder the jpg files are located:
for /R %%X in (*.jpg) do identify %%X >> PicInfo.txt
This will capture the rose.jpg JPEG ... line for every .jpg file you have in the directory (and subdirectories thanks to '/R') that you run the command in and append it to the file PicInfo.txt.
You can call your identify program with a symbol that redirects console output to a file, which is the > character. Something like:
identify rose.jpg > myoutput.txt
Additionally, the >> will append output to what is already in the file. So using
identify rose.jpg >> myoutput.txt
...should create one file with all your output.
You can then use the DOS COMP command, which compares the contents of two files. The syntax is:
COMP [data1] [data2] [/D] [/A] [/L] [/N=number] [/C] [/OFF[LINE]]
Which you could also redirect to an output file using the > symbol.