To index my website, I have a Ruby script that in turn generates a shell script that uploads every file in my document root to Solr. The shell script has many lines that look like this:
curl -s \
"http://localhost:8983/solr/update/extract?literal.id=/about/core-team/&commit=false" \
-F "myfile=#/extra/www/docroot/about/core-team/index.html"
...and ends with:
curl -s http://localhost:8983/solr/update --data-binary \
'<commit/>' -H 'Content-type:text/xml; charset=utf-8'
This uploads all documents in my document root to Solr. I use tika and ExtractingRequestHandler to upload documents in various formats (primarily PDF and HTML) to Solr.
In the script that generates this shell script, I would like to boost certain documents based on whether their id field (a/k/a url) matches certain regular expressions.
Let's say that these are the boosting rules (pseudocode):
boost = 2 if url =~ /cool/
boost = 3 if url =~ /verycool/
# otherwise we do not specify a boost
What's the simplest way to add that index-time boost to my http request?
I tried:
curl -s \
"http://localhost:8983/solr/update/extract?literal.id=/verycool/core-team/&commit=false" \
-F "myfile=#/extra/www/docroot/verycool/core-team/index.html" \
-F boost=3
and:
curl -s \
"http://localhost:8983/solr/update/extract?literal.id=/verycool/core-team/&commit=false" \
-F "myfile=#/extra/www/docroot/verycool/core-team/index.html" \
-F boost.id=3
Neither made a difference in the ordering of search results. What I want is for the boosted results to come first in search results, regardless of what the user searched for (provided of course that the document contains their query).
I understand that if I POST in XML format I can specify the boost value for either the entire document or a specific field. But If I do that, it isn't clear how to specify a file as the document contents. Actually, the tika page provides a partial example:
curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" \
--data-binary #tutorial.html -H 'Content-type:text/html'
But again it isn't clear where/how to specify my boost. I tried:
curl \
"http://localhost:8983/solr/update/extract?literal.id=mydocid&defaultField=text&boost=3"\
--data-binary #mydoc.html -H 'Content-type:text/html'
and
curl \
"http://localhost:8983/solr/update/extract?literal.id=mydocid&defaultField=text&boost.id=3"\
--data-binary #mydoc.html -H 'Content-type:text/html'
Neither of which altered search results.
Is there a way to update just the boost attribute of a document (not a specific field) without altering the document contents? If so, I could accomplish my goal in two steps:
1) Upload/index document as I have been doing
2) Specify boost for certain documents
To index a document in Solr, you have to POST it to the /update handler. The documents to index are put in the body of the POST request. In general, you have to use the xml format format of Solr. Using that xml, you can add a boost value to a specific field or to a whole document.
Related
I'm trying to post a list of identifiers as a form value from a file using curl. I've tried many different formats but each time the filename is posted rather than the actual data.
An example I have tried is below:
curl http:localhost:<port>/test --data "userId=User" --data "uploadIds=<test.txt"
curl http:localhost:<port>/test --data "userId=User" --data "uploadIds=#test.txt"
each of the above results in the filename being posted.
The file contains a comma-separated list of numbers.
Found the solution, the syntax I was looking for is:
curl http:localhost:<port>/test --data "userId=User" --data-urlencode uploadIds#test.txt
Please forgive me for the potentially basic question but I am a z/OS person trying to learn cURL and Cloudant. I have gotten the following example to work to add a record to a database (using DOS from Windows) :
curl -X POST -b /tmp/cloudant.cookie -H "Content-Type: application/json" -d "{\"_id\":\"2\",\"empName\":\"John Doe\",\"phone\":\"646-598-4133\",\"age\":\"28\"}" --url https://xxxxxxxxxx-bluemix.cloudant.com/rcdb
Now I would like to add a _attachment image1.jpg dile to that record...
Could anyone please tell me what the syntax on windows would be...trying a few combinations but nothing so far works.
To add an attachment follow the instructions in the Cloudant documentation at https://docs.cloudant.com/attachments.html
Example:
Assuming you have already created a document with ID "2" and revision number "1-954695fb9642f02975d76b959d0b5e98" in database rcdb, run the following command:
curl -X PUT -H "Content-Type: image/jpeg" --data-binary "#image1.jpg" --url https://xxxxxxxxxx-bluemix.cloudant.com/$DATABASE/$DOCUMENT_ID/$ATTACHMENT?rev=$REV
replacing $DATABASE with rcdb, $DOCUMENT_ID with 2, $REV with 1-954695fb9642f02975d76b959d0b5e98 and $ATTACHMENT with the desired attachment property name, e.g. mypic.
How can I convert more than one document using the Document Conversion service.
I have between 50-100 MS Word and PDF documents that I want to convert using the convert_document API method?
For example, can you supply multiple .pdf or *.doc files like this?:
curl -u "username":"password" -X POST
-F "config={\"conversion_target\":\"ANSWER_UNITS\"};type=application/json"
-F "file=#\*.doc;type=application/msword"
"https://gateway.watsonplatform.net/document-conversion-experimental/api/v1/convert_document"
That gives an error unfortunately: curl: (26) couldn't open file "*.doc".
I have also tried "file=#file1.doc,file2.doc,file3.doc" but that gives errors as well.
The service only accept one file at a time, but you can call it multiple time.
#!/bin/bash
USERNAME="<service-username>"
PASSWORD="<service-password>"
URL="https://gateway.watsonplatform.net/document-conversion-experimental/api/v1/convert_document"
DIRECTORY="/path/to/documents"
for doc in *.doc
do
echo "Converting - $doc"
curl -u "$USERNAME:$PASSWORD" \
-F 'config={"conversion_target":"ANSWER_UNITS"};type=application/json' \
-F "file=#$doc;type=application/pdf" "$URL"
done
Document Conversion documentation and API Reference.
How can I include a file in a curl request form my working directory?
Below I've got a POST request that includes data for "first_name" and for "last_name", but now I need to add in the input for file. Theres examples out there where someone is ONLY sending a file along, but I'm trying to send 1 or more files, and other data.
curl
-H "Content-Type: application/json"
-d '{ first_name: "Donny", last_name: "P", my_file: ???? }'
https://sender.blockspring.com/api/blocks/319bfef4aad7f3477745048a2da3ae6a?api_key=2e0ef0c216078d60630d1321e67b243a
This can be only done with a multipart.
Manually building a multipart may be complex, so curl has a built-in -F option.
curl localhost:8000 -F "my_file=#file.ext" -F "name=daniel;last=P" -v
from man curl
-F, --form
(HTTP) This lets curl emulate a filled-in form in which a user has pressed the submit button. This causes curl to POST
data using the Content-Type multipart/form-data according to RFC 2388. This enables uploading of binary files etc.
To
force the 'content' part to be a file, prefix the file name with an # sign. To just get the content part from a file,
prefix the file name with the symbol <. The difference between # and < is then that # makes a file get attached in the
post as a file upload, while the < makes a text field and just get the contents for that text field from a file.
How to post 5000 files to Solr server?
While posting by using command "java -jar post.jar dir/*.xml", command tool tells Argument list is too long.
The quickest solution would be using a bash script like the following:
for i in $( ls *.xml); do
cat $i | curl -X POST -H 'Content-Type: text/xml' -d #- http://localhost:8080/solr/update
echo item: $i
done
which adds to Solr, using curl, all the xml files within the current directory.
Otherwise you can write a Java main similar to the one included in post.jar, which adds all the xml files within a directory instead of having to pass all of them as arguments.