How to convert multiple documents using the Document Conversion service ina script bash? - ibm-watson

How can I convert more than one document using the Document Conversion service.
I have between 50-100 MS Word and PDF documents that I want to convert using the convert_document API method?
For example, can you supply multiple .pdf or *.doc files like this?:
curl -u "username":"password" -X POST
-F "config={\"conversion_target\":\"ANSWER_UNITS\"};type=application/json"
-F "file=#\*.doc;type=application/msword"
"https://gateway.watsonplatform.net/document-conversion-experimental/api/v1/convert_document"
That gives an error unfortunately: curl: (26) couldn't open file "*.doc".
I have also tried "file=#file1.doc,file2.doc,file3.doc" but that gives errors as well.

The service only accept one file at a time, but you can call it multiple time.
#!/bin/bash
USERNAME="<service-username>"
PASSWORD="<service-password>"
URL="https://gateway.watsonplatform.net/document-conversion-experimental/api/v1/convert_document"
DIRECTORY="/path/to/documents"
for doc in *.doc
do
echo "Converting - $doc"
curl -u "$USERNAME:$PASSWORD" \
-F 'config={"conversion_target":"ANSWER_UNITS"};type=application/json' \
-F "file=#$doc;type=application/pdf" "$URL"
done
Document Conversion documentation and API Reference.

Related

Any way to download latest GitHub release w/ a batch script?

So, I'm trying to download the latest release from GitHub using a Windows batch script. I can get a long list of URLs by running curl -s https://api.github.com/repos/ActualMandM/cemu_graphic_packs/releases/latest, but I can't figure out how to pass the "browser_download_url": "https://github.com/ActualMandM/cemu_graphic_packs/releases/download/Github828/graphicPacks828.zip" it outputs to curl. I've looked online, but everything I found was for PowerShell and most of them used wget.
If you really want to use batch for this, you'll have to search the output JSON for the value you're looking for and then process that string. If the JSON had appeared all on one line, you'd need to take a different approach, but you got lucky.
for /f "tokens=1,* delims=:" %%A in ('curl -ks https://api.github.com/repos/ActualMandM/cemu_graphic_packs/releases/latest ^| find "browser_download_url"') do (
curl -kOL %%B
)
I've added the -k flag because my computer requires it for some reason (so other peoples' might as well).
-O will set the name of the output file to the remote output file name
-L follows a redirect, which is required for downloading from Github.
The Github API url you're accessing returns JSON, so you're going to need a JSON parser.
I can highly recommend xidel. xidel can open and download urls, so you won't need curl or a batch-script.
To query the "browser_download_url"-attribute:
xidel.exe -s "https://api.github.com/repos/ActualMandM/cemu_graphic_packs/releases/latest" -e "$json//browser_download_url"
https://github.com/ActualMandM/cemu_graphic_packs/releases/download/Github874/graphicPacks874.zip
(or -e "$json/(assets)()/browser_download_url" in full)
To download 'graphicPacks874.zip' in the current dir:
xidel.exe ^
-s "https://api.github.com/repos/ActualMandM/cemu_graphic_packs/releases/latest" ^
-f "$json//browser_download_url" ^
--download "{substring-after($headers[starts-with(.,'Content-Disposition')],'filename=')}"
With r8389 or newer (because of this commit) you can just use --download ..

Use curl to post file from pipe

How might i take the output from a pipe and use curl to post that as a file?
E.g. the following workds
curl -F 'file=#data/test.csv' -F 'filename=test.csv' https://mydomain#apikey=secret
I'd like to get the file contents from a pipe instead but I can't quite figure out how to specify it as a file input. My first guess is -F 'file=#-' but that's not quite right.
cat data/test.csv | curl -F 'file=#-' -F 'filename=test.csv' https://mydomain#apikey=secret
(Here cat is just a substitute for a more complex sequence of events that would get the data)
Update
The following works:
cat test/data/test.csv | curl -XPOST -H 'Content-Type:multipart/form-data' --form 'file=#-;filename=test.csv' $url
If you add --trace-ascii - to the command line you'll see that curl already uses that Content-Type by default (and -XPOST doesn't help either). It was rather your fixed -F option that did the trick!

Why does quoting password-variable in curl lead to authorization failure? (Bash)

I have a very specific problem with bash and curl.
What we do is:
reading a password from jenkins and paste it to a config-file (i don't have access to the password)
read parameters from config-file in bash (host, user, password, etc.) and store it in variables
post something with curl to a database and store the result in a variable
Recently we added shellcheck to our deploy-scripts and therefore we need to put the variables in quotes.
That's the request we want to send (shellcheck-approved):
result=$(curl -s -XPOST "${dbURL}" --header "Authorization: Basic $(echo -n "${dbUser}:${dbPwd}" | base64)" --data-binary "blabla")
And here's the error message we get in return:
{"error":"authorization failed"}
It does work, when I unquote the password-variable ("${dbUser}":${dbPwd}). But then spellcheck complains, that I need to put all variables in quotes. Also it does work on another machine with different password (which I have no access to either).
It is the same, when I use --user username:password. So it seems like the problem lies within the password.
Using google and testing the procedure (without the curl) with different special characters couldn't solve it either.
Has anyone experienced something like this?
Edit1:
This is an extract from jenkins-deploy-file ..
stage('config files') {
withCredentials([string(credentialsId: "${env_params.db_password}", variable: 'db_pw')]) {
sshagent(credentials: ["${env_params.user}"]) {
sh "echo \"dbPwd=${db_pw}\" >> environment_variables/config.properties"
This is how the shell script stores the password ..
dbPwd=$(grep ^"$dbPwd" <PATH>/config.properties | cut -d "=" -f2)
thanks for your support.
It seems like there are trailing whitespaces in the password-storage.
I removed them using sed and now it works.
dbPwd=$(grep ^"$dbPwd" <PATH>/config.properties | cut -d "=" -f2 | sed -e 's/[[:space:]]*$//')
You can just set the password in another file and use the content of the file as your password variable.

Solr Server Posting Error

How to post 5000 files to Solr server?
While posting by using command "java -jar post.jar dir/*.xml", command tool tells Argument list is too long.
The quickest solution would be using a bash script like the following:
for i in $( ls *.xml); do
cat $i | curl -X POST -H 'Content-Type: text/xml' -d #- http://localhost:8080/solr/update
echo item: $i
done
which adds to Solr, using curl, all the xml files within the current directory.
Otherwise you can write a Java main similar to the one included in post.jar, which adds all the xml files within a directory instead of having to pass all of them as arguments.

How to boost a SOLR document when indexing with /solr/update

To index my website, I have a Ruby script that in turn generates a shell script that uploads every file in my document root to Solr. The shell script has many lines that look like this:
curl -s \
"http://localhost:8983/solr/update/extract?literal.id=/about/core-team/&commit=false" \
-F "myfile=#/extra/www/docroot/about/core-team/index.html"
...and ends with:
curl -s http://localhost:8983/solr/update --data-binary \
'<commit/>' -H 'Content-type:text/xml; charset=utf-8'
This uploads all documents in my document root to Solr. I use tika and ExtractingRequestHandler to upload documents in various formats (primarily PDF and HTML) to Solr.
In the script that generates this shell script, I would like to boost certain documents based on whether their id field (a/k/a url) matches certain regular expressions.
Let's say that these are the boosting rules (pseudocode):
boost = 2 if url =~ /cool/
boost = 3 if url =~ /verycool/
# otherwise we do not specify a boost
What's the simplest way to add that index-time boost to my http request?
I tried:
curl -s \
"http://localhost:8983/solr/update/extract?literal.id=/verycool/core-team/&commit=false" \
-F "myfile=#/extra/www/docroot/verycool/core-team/index.html" \
-F boost=3
and:
curl -s \
"http://localhost:8983/solr/update/extract?literal.id=/verycool/core-team/&commit=false" \
-F "myfile=#/extra/www/docroot/verycool/core-team/index.html" \
-F boost.id=3
Neither made a difference in the ordering of search results. What I want is for the boosted results to come first in search results, regardless of what the user searched for (provided of course that the document contains their query).
I understand that if I POST in XML format I can specify the boost value for either the entire document or a specific field. But If I do that, it isn't clear how to specify a file as the document contents. Actually, the tika page provides a partial example:
curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" \
--data-binary #tutorial.html -H 'Content-type:text/html'
But again it isn't clear where/how to specify my boost. I tried:
curl \
"http://localhost:8983/solr/update/extract?literal.id=mydocid&defaultField=text&boost=3"\
--data-binary #mydoc.html -H 'Content-type:text/html'
and
curl \
"http://localhost:8983/solr/update/extract?literal.id=mydocid&defaultField=text&boost.id=3"\
--data-binary #mydoc.html -H 'Content-type:text/html'
Neither of which altered search results.
Is there a way to update just the boost attribute of a document (not a specific field) without altering the document contents? If so, I could accomplish my goal in two steps:
1) Upload/index document as I have been doing
2) Specify boost for certain documents
To index a document in Solr, you have to POST it to the /update handler. The documents to index are put in the body of the POST request. In general, you have to use the xml format format of Solr. Using that xml, you can add a boost value to a specific field or to a whole document.

Resources