solr-index from multiple folders - solr

I am trying to make a web search application with solr but I have problems. The problem is that in the example that I followed, all the files are in the same folder. But I want to index files from different directories (ie give the root folder and index all the xml files from all subdirectories). Is that possible?

Try the SimplePostTool recursive option:
java -Dauto -Drecursive -jar post.jar

Try this in a shell script (untested):
#!/bin/sh
FILES=$(find . -iname "*.xml")
URL=http://localhost:8983/solr/update
for f in $FILES; do
echo "Posting $f"
curl $URL --data-binary #$f -H 'Content-type:application/xml'
echo
done
#send the commit command to make sure all the changes are flushed and visible
curl $URL --data-binary '<commit/>' -H 'Content-type:application/xml'
echo
Place it in the root folder where you have the xml files.
(i assume you have linux and the 'post.sh' script is the example you followed)

Related

SOLR POST files with no extension

I am using SOLR 5 and I want to scan documents that have no extensions. Unfortunately changing the file to have extensions is not an option in my case.
the command I am using is simply:
$bin/post -c mycore ../foldertobescaned -type application/pdf
the command works fine for documents that do have extension but I am getting:
Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
If renaming the files is not an option, you can use the following script as a workaround until Solr improves its post method. It is a simple bash for loop that submits each file individually and works regardless of the file extension. Note that this script will be slower than using post on the whole folder, because each individual file transfer needs to be initialized.
Save the script below as postFolderToSolr.sh inside your Solr folder (so that Solrs bin/ folder is a subdirectory), make it executable with chmod +x postFolderToSolr.sh and then use it as follows: ./postFolderToSolr.sh mycore /home/user1/foldertobescaned/ application/pdf
Using no arguments or the wrong number of arguments prints a short usage message as help.
#!/bin/bash
set -o nounset
if [ "$#" -ne 3 ]
then
echo "Post contents of a folder to Solr."
echo
echo "Usage: postFolderToSolr.sh <colletionName> </path/to/folder> <MIME>"
echo
exit 1
fi
collection=$1
inputPath=${2%/} # remove suffix / if it exists
mime=$3
for element in $inputPath"/"*; do
bin/post -c $collection -type $mime $element
done

Downloading file from online database using bash script

I want to download some files from an online database, but it does not allow me to download all the files at once. Instead it offers to download a file for a searched keyword. Because I have more than 20000 keywords, it's not feasible for me.
For example, I want to download whole information about miRNA-mRNA interaction from SarBase, but it does not offer an option to download all of them at once.
I wonder, how can I download it by writing some scripts. Can anybody help me?
Make a file called getdb.sh.
#!/bin/bash
echo "Download keywords in kw.txt."
for kw in $(cat kw.txt)
do
curl http://www.mirbase.org/cgi-bin/get_seq.pl?acc=$kw > $kw.txt
done
Create another file called kw.txt:
MI0000342
MI0000343
MI0000344
Then run this
$ chmod +x getdb.sh
$ ./getdb.sh
Download keywords in kw.txt.
$ ls -1 *.txt
kw.txt
MI0000342.txt
MI0000343.txt
MI0000344.txt
another way
cat kw.txt |xargs -i curl -o {}.txt http://www.mirbase.org/cgi-bin/get_seq.pl?acc={}

Script to change ownership of folders in Plesk vhosts

I'm looking for some help in creating a shell script in Linux to perform a batch ownership change for certain folders in a Plesk environment where the owner:group is apache:apache.
I want to change the owner:group to :psacln.
The FTP user can be ascertained by looking at the owner of the httpdocs folder.
^this is the section I'm having trouble with.
If I was to set all owners to be the same, I could do a one-line:
find /var/www/vhosts/*/httpdocs -user apache -group apache -exec chown user:psacln {} \;
Can anyone help plug the user in to this command?
Thanks
Figured it out... for those who may want to use it in the future:
for dir in /var/www/vhosts/*
do
dir=${dir%*/}
permissions=`stat -c '%U' ${dir##*/}/httpdocs`
find ${dir##*/}/httpdocs -user apache -group apache -exec chown $permissions {} \;
done
Since stat doesn't work on al unices in the same way, I thought I would share my script to set the ownership of all websites to the correct owners in Plesk (tested on Plesk 11, 11.5, 12 and 12.5):
cd /var/www/vhosts/
for f in *; do
if [[ -d "$f" && ! -L "$f" ]]; then
# Get necessary variables
FOLDERROOT="/var/www/vhosts/"
FOLDERPATH="/var/www/vhosts/$f/"
FTPUSER="$(ls -ld /var/www/vhosts/$f/ | awk '{print $3}')"
# Set correct rights for current website, if website has hosting!
cd $FOLDERPATH
if [ -d "$FOLDERPATH/httpdocs" ]; then
chown -R $FTPUSER:psacln httpdocs
chmod -R g+w httpdocs
find httpdocs -type d -exec chmod g+s {} \;
# Print success message
echo "Done... $FTPUSER is now correct owner of $FOLDERPATH."
fi
# Make sure we are back at the root, so we can continue looping
cd $FOLDERROOT
fi
done
\\\
Explanation of code:
Go to vhosts folder
Loop through websites
Store vhosts path, because we are using cdin a loop
If httpdocsfolders exists for the current website, than
set the correct rights of httpdocs and
all underlying folders
Show succes message
cd back to vhosts folder, so we can continue looping
\\\

Getting specific files from server

Using Terminal and Shell/Bash commands is there a way to retrive specific files from a web directory? I.e.
Directory: www.site.com/samples/
copy all files ending in ".h" into a folder
The folder contains text files, and other files associated that are of no use.
Thanks :)
There are multiple ways of achieving this recursively:
1. using find
1.1 making directorys using find -p to create recursive folders without errors
cd path;
mkdir backup
find www.site.com/samples/ -type d -exec mkdir -p {} backup/{} \;
1.2 finding specific files and copying to backup folder -p to perserve permissions
find www.site.com/samples/ -name \*.h -exec cp -p {} backup/{} \;
Using tar well actually for reverse type of work i.e. to exclude specific files which the part of the question related to text files matches this answer more:
You can have as many excludes as you liked added on
tar --exclude=*.txt --exclude=*.filetype2 --exclude=*.filetype3 -cvzf site-backup.tar.gz www.site.com
mv www.site.com www.site.com.1
tar -xvzf site-backup.tar.gz
You can use the wget for that, but if there are no links to that files. I.e. they exist, but they are not referenced from any html page, then bruteforce is the only option.
cp -aiv /www.site.com/samples/*.h /somefolder/
http://linux.die.net/man/1/cp

Solr Server Posting Error

How to post 5000 files to Solr server?
While posting by using command "java -jar post.jar dir/*.xml", command tool tells Argument list is too long.
The quickest solution would be using a bash script like the following:
for i in $( ls *.xml); do
cat $i | curl -X POST -H 'Content-Type: text/xml' -d #- http://localhost:8080/solr/update
echo item: $i
done
which adds to Solr, using curl, all the xml files within the current directory.
Otherwise you can write a Java main similar to the one included in post.jar, which adds all the xml files within a directory instead of having to pass all of them as arguments.

Resources