read data for every 100 days untill we get the complete data in hive - database

I am copying data from prod to test for testing purpose in hive using bash script. when i am doing so for a table , I have received a memory heap issue.
to solve this , I am planning to read the data from rundate (day when i am executing the script) to the day where the data available for every 100 days to avoid this issue. can you please let me know how to achieve this using bash and please do let me know if is there any other approach other than setting up the memory

You basically need to run a HiveQL(.hql) script from shell.
Create a .hql script with your query of pulling only last 100 days of data.
example.hql
select * from my_database.my_table
where insert_date BETWEEN '2018-07-01' AND '2018-10-01';
Now you can call this script from hive shell:
hive -f example.hql
Or you can create a shell script and execute your query in it.
run.sh
#!/bin/bash
hive -e "select * from my_database.my_table
where insert_date BETWEEN '2018-07-01' AND '2018-10-01'" >select.txt
result=`echo $?`
if [ $result -ne 0 ]; then
echo "Error!!!!"
echo "Hive error number is: $result"
exit 1
else
echo "no error, do your stuffs"
fi
Then execute your shell script by sh run.sh.

Related

Bash manipulate and sort file content with arrays via loop

Purpose
Create a bash script which loops through certain commands and save the outputs of each command (they print only numbers) into a file (I guess the best way is to save them in a file?) with their dates (unix time) next to each output so we can use these stored values next time we run the script and it looped through again, see if there isn't any change in the outputs of commands within the last hour.
Example output
# ./script
command1 123123
command2 123123
Important notes
There are around 200 commands which the script will loop through.
There'll be new commands in the future so the script will have to check if this command exists in the saved file. If it already present, only compare it within the last hour to see if the number has changed since you last saved the file. If it doesn't exists, save it into the file so we can use it to compare next time.
Order of the commands which the script will run might be different as the commands increase/decrease/change. So if it's only like this for now;
# ./script
command1 123123
command2 123123
and you add a 3rd command in the future, the order might change (it is also not certain what kind of pattern it's following), for example;
# ./script
command1 123123
command3 123123
command2 123123
so we can't, for example, read it line by line and in this case, I believe the best way is to compare them with the command* names.
Structure for stored values
My presumed structure for stored values is like this (don't have to stick with this one tho);
command1 123123 unixtime
command2 123123 unixtime
About the said commands
The things I called commands are basically applications which are running on /usr/local/bin/ an can be accessed by directly running their names on the shell, like command1 getnumber and it will print you the number.
Since the commands are located in the /usr/local/bin/ and following a similar pattern, I'm first looping through the /usr/local/bin/ for command*. See below.
commands=`find /usr/local/bin/ -name 'command*'`
for i in $commands; do
echo "$i" "`$i getnumber`"
done
so this will loop through all files that starts with command and run command* getnumber for each one, which will print out the numbers we need.
Now we need to store these values in a file to compare them next time we run the command.
Catch:
We may even run the script every few minutes but we only need to report if the values (numbers) hasn't changed in the last hour.
The script will list the numbers every time you run it and we may add a styling to those who aren't changed in the last hour to pop them out for the eyes, maybe like adding a red color to them?
Attempt number #1
So this is my first attempt building this script. Here's what it looks like;
#!/bin/bash
commands=`find /usr/local/bin/ -name 'command*'`
date=`date +%s`
while read -r command number unixtime; do
for i in $commands; do
current_block_count=`$i getnumber`
if [[ $command = $i ]]; then
echo "$i exists in the file, checking the number changes within last hour" # just for debugging, will be removed in production
if (( ($date-$unixtime)/60000 > 60 )); then
if (( $number >= $current_number_count )); then
echo "There isn't a change within the last hour, this is a problem!" # just for debugging, will be removed in production
echo -e "$i" "`$i getnumber`" "/" "$number" "\e[31m< No change within last hour."
else
echo "$i" "`$i getnumber`"
echo "There's a change within the last hour, we're good." # just for debugging, will be removed in production
# find the line number of $i so we can change it with the new output
line_number=`grep -Fn '$i' outputs.log`
new_output=`$i getnumber`
sed -i "$line_numbers/.*/$new_output/" outputs.log
fi
else
echo "$i" "`$i getnumber`"
# find the line number of $i so we can change it with the new output
line_number=`grep -Fn '$i' outputs.log`
output_check="$i getnumber; date +%s"
new_output=`eval ${output_check}`
sed -i "$line_numbers/.*/$new_output/" outputs.log
fi
else
echo "$i does not exists in the file, adding it now" # just for debugging, will be removed in production
echo "$i" "`$i getnumber`" "`date +%s`" >> outputs.log
fi
done
done < outputs.log
Which was a quite the disaster and eventually, it did nothing when I've run it.
Attempt number #2
This time, I've tried another approach nesting for loop outside of the while loop.
#!/bin/bash
commands=`find /usr/local/bin/ -name 'command*'`
date=`date +%s`
for i in $commands; do
echo "${i}" "`$i getnumber`"
name=${i}
number=`$i getnumber`
unixtime=$date
echo "$name" "$number" "$unixtime" # just for debugging, will be removed in production
while read -r command number unixtime; do
if ! [ -z ${name+x} ]; then
echo "$name" "$number" "$unix" >> outputs.log
else
if [[ $name = $i ]]; then
if (( ($date-$unixtime)/60000 > 60 )); then
if (( $number >= $current_number_count )); then
echo "There isn't a change within the last hour, this is a problem!" # just for debugging, will be removed in production
echo -e "$i" "`$i getnumber`" "/" "$number" "\e[31m< No change within last hour."
else
echo "$i" "`$i getnumber`"
echo "There's a change within the last hour, we're good." # just for debugging, will be removed in production
# find the line number of $i so we can change it with the new output
line_number=`grep -Fn '$i' outputs.log`
new_output=`$i getnumber`
sed -i "$line_numbers/.*/$new_output/" outputs.log
fi
else
echo "$i" "`$i getnumber`"
# find the line number of $i so we can change it with the new output
line_number=`grep -Fn '$i' outputs.log`
output_check="$i getnumber; date +%s"
new_output=`eval ${output_check}`
sed -i "$line_numbers/.*/$new_output/" outputs.log
fi
else
echo "$i does not exists in the file, adding it now" # just for debugging, will be removed in production
echo "$i" "`$i getnumber`" "`date +%s`" >> outputs.log
fi
fi
done < outputs.log
done
Unfortunately, no luck for me, again.
Can someone give me a helping hand?
Additional notes #2
So basically, you run the script first time, outputs.log is empty, so you write the outputs of commands into outputs.log.
And it's been 10 minutes passed, you run the script again, since it's only 10 minutes passed and not more than an hour, the script won't check if the numbers have changed or not. It will not manipulate the stored values but also display us the outputs of command every time you run it. (Their present outputs and not from the stored values)
In this 10 minutes timeframe, for example, there might have been new commands added so it will check if the commands' outputs are stored every time you run the script, just to deal with new commands.
Now it's been, let's say 1.2 hours passed, you decided to run the script again, this time the script will check if the numbers hasn't changed after more than an hour and report us saying that Hey! It's been more than an hour passed and those numbers still haven't changed, there might be problem!
Simple explanation
You have 100 commands to run, your script will loop through each of them and do the followings for each;
Run the script whenever you want
On each run, check if outputs.log contains the command
If outputs.log contains the commands of each loop, check the last stored date ($unixtime) of each of them.
If last stored date is more than an hour, check the numbers between the current run and the stored value
If the numbers haven't changed for more than an hour, run the command in red text color.
If the numbers have changed, run the command as usual without any warning.
If last stored date is less than an hour, run the command as usual.
If outputs.log doesn't contain the command, simply store them in the file so it can be used for next runs to check.
The following uses a sqlite database to store results, instead of a flat file, which makes querying the history of previous runs easy:
#!/bin/sh
database=tracker.db
if [ ! -e "$database" ]; then
sqlite3 -batch "$database" <<EOF
CREATE TABLE IF NOT EXISTS outputs(command TEXT
, output INTEGER
, ts INTEGER NOT NULL DEFAULT (strftime('%s', 'now')));
CREATE INDEX IF NOT EXISTS outputs_idx ON outputs(command, ts);
EOF
fi
for cmd in /usr/local/bin/command*; do
f=$(basename "$cmd")
o=$("$cmd")
echo "$f $o"
sqlite3 -batch "$database" <<EOF
INSERT INTO outputs(command, output) VALUES ('$f', $o);
SELECT command || ' has unchanged output!'
FROM outputs
WHERE command = '$f' AND ts >= strftime('%s', 'now', '-1 hour')
GROUP BY command
HAVING count(DISTINCT output) = 1 AND count(*) > 1;
EOF
done
It lists commands that have had every run in the last hour produce the same output (and skips commands that have only run once). If instead you're interested in cases where the most recent output of each command is the same as the previous run in that hour timeframe, replace the sqlite3 invocation in the loop with:
sqlite3 -batch $database <<EOF
INSERT INTO outputs(command, output) VALUES ('$f', $o);
WITH prevs AS
(SELECT command
, output
, row_number() OVER w AS rn
, lead(output, 1) OVER w AS prev
FROM outputs
WHERE command = '$f' AND ts >= strftime('%s', 'now', '-1 hour')
WINDOW w AS (ORDER BY ts DESC))
SELECT command || ' has unchanged output!'
FROM prevs
WHERE output = prev AND rn = 1;
EOF
(This requires the sqlite3 shell from release 3.25 or newer because it uses features introduced then.)

How do I output the results of a HiveQL query to CSV using a shell script?

I would like to run multiple Hive queries, preferably in parallel rather than sequentially, and store the output of each query into a csv file. For example, query1 output in csv1, query2 output in csv2, etc. I would be running these queries after leaving work with the goal of having output to analyze during the next business day. I am interested in using a bash shell script because then I'd be able to set-up a cron task to run it at a specific time of day.
I know how to store the results of a HiveQL query in a CSV file, one query at a time. I do that with something like the following:
hive -e
"SELECT * FROM db.table;"
" | tr "\t" "," > example.csv;
The problem with the above is that I have to monitor when the process finishes and manually start the next query. I also know how to run multiple queries, in sequence, like so:
hive -f hivequeries.hql
Is there a way to combine these two methods? Is there a smarter way to achieve my goals?
Code answers are preferred since I do not know bash well enough to write it from scratch.
This question is a variant of another question: How do I output the results of a HiveQL query to CSV?
You can run and monitor parallel jobs in a shell script:
#!/bin/bash
#Run parallel processes and wait for their completion
#Add loop here or add more calls
hive -e "SELECT * FROM db.table1;" | tr "\t" "," > example1.csv &
hive -e "SELECT * FROM db.table2;" | tr "\t" "," > example2.csv &
hive -e "SELECT * FROM db.table3;" | tr "\t" "," > example3.csv &
#Note the ampersand in above commands says to create parallel process
#You can wrap hive call in a function an do some logging in it, etc
#And call a function as parallel process in the same way
#Modify this script to fit your needs
#Now wait for all processes to complete
#Failed processes count
FAILED=0
for job in `jobs -p`
do
echo "job=$job"
wait $job || let "FAILED+=1"
done
#Final status check
if [ "$FAILED" != "0" ]; then
echo "Execution FAILED! ($FAILED)"
#Do something here, log or send messege, etc
exit 1
fi
#Normal exit
#Do something else here
exit 0
There are other ways (using XARGS, GNU parallel) to run parallel processes in shell and a lot of resources on it. Read also https://www.slashroot.in/how-run-multiple-commands-parallel-linux and https://thoughtsimproved.wordpress.com/2015/05/18/parellel-processing-in-bash/
With GNU Parallel it looks like this:
doit() {
id="$1"
hive -e "SELECT * FROM db.table$id;" | tr "\t" "," > example"$id".csv
}
export -f doit
parallel --bar doit ::: 1 2 3 4
If your queries do not share the same template you can do:
queries.txt:
SELECT * FROM db.table1;
SELECT id,name FROM db.person;
... other queries ...
cat queries.txt | parallel --bar 'hive -e {} | tr "\t" "," > example{#}.csv'
Spend 15 minute on reading chapter 1+2 of https://doi.org/10.5281/zenodo.1146014 to learn the basics and chapter 7 to learn more on how to run more jobs in parallel.

Retrieve and save SQLPLUS query result into a variable bash

I am trying to save the response of a query made through SQLPLUS and save it in a local variable, but when i execute the following code, i get the path as output instead of the value of the query, could u please help me? I don't know what am i doing wrong:
#!/bin/bash
SQLPLUS="<Path to sqlplus> -s user/passwd"
X=$SQLPLUS<<EOF_SQL_1
set heading off;
select table1 from table 2 where parameter ='Properties';
exit;
EOF_SQL_1
echo $X
The result of this script is " -s user/passwd" when it should be the esult of the query I made.
Please tell me what am I doing wrong :S
using heredoc and command substitution in the same command would not be recommended, the easier is to use a function
fun_sql() {
sqlplus user/passwd#tnsname <<EOF
...
EOF
}
X=$(fun_sql)

Script for Deleting old data from sqlite3 databases

Currently I'm deleting manually old data (old more than 90 days) from sqlite databases, so below are the steps currently following for that. so guys Is that possible to do this job using Bash Script ??
1. cd /opt/db (my database location)
2. ls -lSh | head -n30 (sorting from top highest size .db files and notes all .db names)
3. sqlite3 test1.db (select database)
4. delete from tbl_outbox where time<='2016-02-10 00:00:00'; (delete data older more than 90 days)
5. vacuum;
there are more than 20 .db files. so I do again above one by one steps 3 to 5 like below
sqlite3 test2.db
delete from tbl_outbox where time<='2016-02-10 00:00:00';
vacuum;
can you someone help me to create bash script for do this task.
thanks
Shell can loop ... supposing you use bash or ksh you can use something like the below example:
cd /opt/db
DATESTRING=$(date "+%Y-%m-%d 00:00:00" -d "now -90 day")
for DBFILE in *.db
do
echo "delete from tbl_outbox where time<='$DATESTRING'; vacuum;" | sqlite3 $DBFILE
done
If you prefer to have it run on a specific list of databases substitute
.db with a space separated list of your db file names ... if it's ok to have this run against /opt/db/.db it will need no editing if you add/remove a database

SGE array jobs and R

I currently have a R script written to perform a population genetic simulation, then write a table with my results to a text file. I would like to somehow run multiple instances of this script in parallel using an array job (my University's cluster uses SGE), and when its all done I will have generated results files corresponding to each job (Results_1.txt, Results_2.txt, etc.).
Spent the better part of the afternoon reading and trying to figure out how to do this, but haven't really found anything along the lines of what I am trying to do. I was wondering if someone could provide and example or perhaps point me in the direction of something I could read to help with this.
To boil down mithrado's answer to the bare essentials:
Create job script, pop_gen.bash, that may or may not take SGE task id argument as input, storing results in specific file identified by same SGE task id:
#!/bin/bash
Rscript pop_gen.R ${SGE_TASK_ID} > Results_${SGE_TASK_ID}.txt
Submit this script as a job array, e.g. 1000 jobs:
qsub -t 1-1000 pop_gen.bash
Grid Engine will execute pop_gen.bash 1000 times, each time setting SGE_TASK_ID to value ranging from 1-1000.
Additionally, as mentioned above, via passing SGE_TASK_ID as command line variable to pop_gen.R you can use SGE_TASK_ID to write to output file:
args <- commandArgs(trailingOnly = TRUE)
out.file <- paste("Results_", args[1], ".txt", sep="")
# d <- "some data frame"
write.table(d, file=out.file)
HTH
I am not used to do this in R, but I've been using the same approach in python. Imagine that you have an script genetic_simulation.r and it has 3 parameter:
--gene_id --khmer_len and --output_file.
You will have one csv file, genetic_sim_parms.csv with n rows:
first_gene,10,/result/first_gene.txt
...
nth_gene,6,/result/nth_gene.txt
A import detail is the first lane of your genetic_simulation.r. It needs to tell which executable the cluster is going to will use. You might need to tweak its parameters as well, depending on your setup, it will look like to:
#!/path/to/Rscript --vanilla
And finally, you will need a array-job bash script:
#!/bin/bash
#$ -t 1:N < change to number of rows in genetic_sim_parms.csv
#$ -N genetic_simulation.r
echo "Starting on : $(date)"
echo "Running on node : $(hostname)"
echo "Current directory : $(pwd)"
echo "Current job ID : $JOB_ID"
echo "Current job name : $JOB_NAME"
echo "Task index number : $SGE_TASK_ID"
ID=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $1}' genetic_sim_parms.csv)
LEN=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $2}' genetic_sim_parms.csv)
OUTPUT=$(awk -F, -v "line=$SGE_TASK_ID" 'NR==line {print $3}' genetic_sim_parms.csv)
echo "id is: $ID"
rscript genetic_simulation.r --gene_id $ID --khmer_len $LEN --output_file $OUTPUT
echo "Finished on : $(date)"
Hope this helps!

Resources