JDBC feeder for Elasticsearch - sql-server

I am trying to create a JBDC feeder to load data from SQL Server into elasticsearch. I am using the guide here: https://github.com/jprante/elasticsearch-river-jdbc (search for heading 'How to run a standalone JDBC feeder').
I have successfully downloaded and installed elasticsearch and have it up and running. I have downloaded the JDBC driver for SQL server and moved it into the ./plugins/jdbc folder.
I am up to the part that involves creating a bash script. Before today, I have never even looked at a bash script and I'm having trouble getting it to work since I don't yet know half the syntax.
The elasticsearch directory is c:\elasticsearch-1.4.0
and here is my bash script:
#!/bin/sh
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
# ES_HOME required to detect elasticsearch jars
export ES_HOME= C:\elasticsearch-1.4.0
echo '
{
"elasticsearch" : {
"cluster" : "elasticsearch",
"host" : "localhost",
"port" : 9200
},
"type" : "jdbc",
"jdbc" : {
"url" : "jdbc:sqlserver://localhost;databaseName=MyDatabase",
"user" : "MyUser",
"password" : "MyPassword",
"sql" : "select * From MyTable",
"treat_binary_as_string" : true,
"index" : "MyFirstESIndex"
}
}
' | java \
-cp "${DIR}/*" \
org.xbib.elasticsearch.plugin.jdbc.feeder.Runner \
org.xbib.elasticsearch.plugin.jdbc.feeder.JDBCFeeder
What do I need to update in this script? Is it something in this line of the script:
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
The reason I am doing this is because I'm looking for the best method to insert potentially tens of millions of records into elasticsearch from sql server in one go i.e a bulk insert.
Our first iteration of this involved getting each row of data in a table, converting it to a JSON document, and inserting into ES. This took about 10hrs to get all the data in there.
Thanks in advance for any advice.

Instead of this:
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
Do this, if it's important for you to be in that working directory:
DIR="$( dirname "${BASH_SOURCE[0]}" )"
cd $DIR
Also, if it's not a typo, remove the space after ES_HOME=, and, when in doubt, use quotes:
export ES_HOME="C:\elasticsearch-1.4.0"
Additionally, with the Java -cp argument, if you want to include all the jar files (which I'm assuming), don't use quotes so the globbing works:
# Since you are already in the directory, you don't need DIR
JARS=$(echo ./*jar)
...
# On the cp line, substitute spaces with : to build the classpath
-cp "${JARS// /:}" \
I hope this helps. If you can give more details about how the script is failing, I could help more. Are you getting particular error messages?

Related

psql Batch File - Escaping "Not Equal" Operator

I'm working on a batch file that will import data into the PostgreSQL database I use for testing. The batch file drops all of the databases, then recreates/reloads them from a previous dump file made from our production database. However, I sometimes run into a problem if I've accidentally left a connection open to that server/database. The "drop" portion fails because there are still users connected (me).
I've been trying to "tweak" my batch file with a command to disconnect all users from the database(s) prior to issuing the command to drop them, but I can't get that part (disconnection) to work. I've taken the disconnect code from another SO question How to drop a PostgreSQL database if there are active connections to it?, and I've been looking at other questions like How to execute postgres' sql queries from batch file? for help with the syntax.
I've also seen the "alternate" syntax for a not equal operator on the 9.2. Comparison Functions and Operators page of the official PostgreSQL documentation, but that seems to also be using "special" characters that would require escaping, so I'm not sure how to proceed.
At this point, the batch file looks like this:
#Echo OFF
SET PGPASSWORD=PASSWORD
cd /D "C:\PostgreSQL\bin"
psql.exe -h localhost -p 5432 -d postgres -U username -c 'SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = ''betadb'' AND pid \<\> pg_backend_pid();'
dropdb.exe -h localhost -p 5432 -U username betadb
psql.exe -h localhost -p 5432 -d postgres -U username < "C:\PostgresSQL\prodserverdump.sql"
Everything else works except for the pg_terminate_backend query. Every time I run that, I get strange errors indicating a problem with a path, or a file, or something else like that. I believe I've narrowed the problem down to the "not equal" operator (<>) in the query, but I can't seem to find the correct way to escape this so it doesn't try to pipe in data from a file that's not being defined.
I've tried using single backslashes (\) and double backslashes (\\), in front of one or both of the characters in the operator, but that doesn't appear to work. Is there a special way to escape the "greater than" and "less than" characters for the -c command line option in psql?
Using a combination of suggestions and "trial & error", I believe I found the correct syntax for executing this particular SQL command through a batch file.
Trying the "alternative" not equal operator (!=), I was still getting errors. They were different errors (it was giving me some nonsense about too many parameters), but it still wouldn't execute.
Using #Compo's suggestion from the comments, I then tried to enclose the entire SELECT statement in double quotes instead of single quotes. Still not quite there.
Finally, I removed the "extra" single quotes I was using around the database names from before. The query appears to have executed properly.
The final result looks like this:
#Echo OFF
SET PGPASSWORD=PASSWORD
cd /D "C:\PostgreSQL\bin"
psql.exe -h localhost -p 5432 -d postgres -U username -c "SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = 'betadb' AND pid != pg_backend_pid();"
dropdb.exe -h localhost -p 5432 -U username betadb
psql.exe -h localhost -p 5432 -d postgres -U username < "C:\PostgresSQL\prodserverdump.sql"
I suppose I had assumed that, because all of the examples I had found were using single quotes to surround the SQL statement, that's what I had to use. Apparently, that assumption was incorrect.
Regardless, it all seems to be working correctly now. Hope this helps someone else who's looking to accomplish something similar.

SOLR POST files with no extension

I am using SOLR 5 and I want to scan documents that have no extensions. Unfortunately changing the file to have extensions is not an option in my case.
the command I am using is simply:
$bin/post -c mycore ../foldertobescaned -type application/pdf
the command works fine for documents that do have extension but I am getting:
Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
If renaming the files is not an option, you can use the following script as a workaround until Solr improves its post method. It is a simple bash for loop that submits each file individually and works regardless of the file extension. Note that this script will be slower than using post on the whole folder, because each individual file transfer needs to be initialized.
Save the script below as postFolderToSolr.sh inside your Solr folder (so that Solrs bin/ folder is a subdirectory), make it executable with chmod +x postFolderToSolr.sh and then use it as follows: ./postFolderToSolr.sh mycore /home/user1/foldertobescaned/ application/pdf
Using no arguments or the wrong number of arguments prints a short usage message as help.
#!/bin/bash
set -o nounset
if [ "$#" -ne 3 ]
then
echo "Post contents of a folder to Solr."
echo
echo "Usage: postFolderToSolr.sh <colletionName> </path/to/folder> <MIME>"
echo
exit 1
fi
collection=$1
inputPath=${2%/} # remove suffix / if it exists
mime=$3
for element in $inputPath"/"*; do
bin/post -c $collection -type $mime $element
done

mysqldump puts CREATE on the first line with mysqldump

Here's my full bash script:
#!/bin/bash
logs="$HOME/sitedb_backups/log"
mysql_user="user"
mysql_password="pass"
mysql=/usr/bin/mysql
mysqldump=/usr/bin/mysqldump
tbackups="$HOME/sitedb_backups/today"
ybackups="$HOME/sitedb_backups/yesterday"
echo "`date`" > $logs/backups.log
rm $ybackups/* >> $logs/backups.log
mv $tbackups/* $ybackups/ >> $logs/backups.log
databases=`$mysql --user=$mysql_user -p$mysql_password -e "SHOW DATABASES;" | grep -Ev "(Database|information_schema)"`
for db in $databases ; do
$mysqldump --force --opt --user=$mysql_user -p$mysql_password --databases $db | gzip > "$tbackups/$db.gz"
echo -e "\r\nBackup of $db successfull" >> $logs/backups.log
done
mail -s "Your DB backups is ready!" yourmail#gmail.com <<< "Today: "`date`"
DB backups of every site is ready."
exit 0
Problem is when i try to import it with mysql i am gettint error 1044 error connecting to oldname_db. When i opened sql file i have noticed on the first line CREATE command so it tries to create that database with the old name. How can i solve that problem?
SOLVED.
Using --databases parameter in my case is not necessary and because of --databases it was generating CREATE and USE action in the beginning of the sql file, hope it helps somebody else.
Use the --no-create-db option of mysqldump.
From man mysqldump:
--no-create-db, -n
This option suppresses the CREATE DATABASE statements that are
otherwise included in the output if the --databases or --all-databases
option is given.

finding a file in unix using wildcards in file name

I have few files in a folder with name pattern in which one of the section is variable.
file1.abc.12.xyz
file2.abc.14.xyz
file3.abc.98.xyz
So the third section (numeric) in above three file names changes everyday.
Now, I have a script which does some tasks on the file data. However, before doing the work, I want to check whether the file exists or not and then do the task:
if(file exist) then
//do this
fi
I wrote the below code using wildcard '*' in numeric section:
export mydir=/myprog/mydata
if[find $mydir/file1.abc.*.xyz]; then
# my tasks here
fi
However, it is not working and giving below error:
[find: not found [No such file or directory]
Using -f instead of find does not work as well:
if[-f $mydir/file1.abc.*.xyz]; then
# my tasks here
fi
What am I doing wrong here ? I am using korn shell.
Thanks for reading!
for i in file1.abc.*.xyz ; do
# use $i here ...
done
I was not using spaces before the unix keywords...
For e.g. "if[-f" should actually be " if [ -f" with spaces before and after the bracket.

Solr Server Posting Error

How to post 5000 files to Solr server?
While posting by using command "java -jar post.jar dir/*.xml", command tool tells Argument list is too long.
The quickest solution would be using a bash script like the following:
for i in $( ls *.xml); do
cat $i | curl -X POST -H 'Content-Type: text/xml' -d #- http://localhost:8080/solr/update
echo item: $i
done
which adds to Solr, using curl, all the xml files within the current directory.
Otherwise you can write a Java main similar to the one included in post.jar, which adds all the xml files within a directory instead of having to pass all of them as arguments.

Resources