I have a database with 1 table that holds hundreds of records. I need to make a for loop in groovy script that compares first record with second record, second record with third record, etc. i need to compare length changes between records and print out all changes that is higher than 30. Example - first record 30m, second record 40m, third record 100m. It will print out second-third record.
I dont know amount of records in table, so i dont know how to create for loop. Any suggestions?
Also records has ip. Each ip can be multiple times and i need to compare all records in each ip.
record 1:
port_nbr | 1
pair | pairA
length | 30.00
add_date | 2020-06-16 00:01:13.237164
record 2:
port_nbr | 1
pair | pairA
length | 65.00
add_date | 2020-06-16 00:02:13.237164
record 3:
port_nbr | 2
pair | pairc
length | 65.00
add_date | 2020-06-16 00:02:13.237164
I expect that for loop checks if current record port_nbr is the same with next record, if yes, then it checks if pair is same and if its the same, then he compares if length changed for 30+m. In this case it would output that there is 30+m change in 1/2 record. After outputing it, then it would compare second record and third record. But they doesnt have same port_nbr and pair, so i expect it to start comparing again all port_nbr that is 2 with all following records.
There could be even 10 records with port_nbr 1, but with different pairs. I need to check for pairs aswell and only then compare lengths.
My code at this moment:
import java.sql.*;
import groovy.sql.Sql
class Main{
static void main(String[] args) {
def dst_db1 = Sql.newInstance('connection.........')
dst_db1.getConnection().setAutoCommit(false)
def sql = (" select d.* from (select d.*, lead((case when length <> 'N/A' then length else length_to_fault end)::float) over (partition by port_nbr, pair order by port_nbr, pair, d.add_date) as lengthh from diags d)d limit 10")
def lastRow = [id:-1, port_nbr:-1, pair:'', lengthh:-1.0]
dst_db1.eachRow( sql ) {row ->
if( row.port_nbr == lastRow.port_nbr && row.pair == lastRow.pair){
BigDecimal lengthChange =
new BigDecimal(row.lengthh ? row.lengthh : 0 ) - new BigDecimal(lastRow.lengthh ? lastRow.lengthh :0 )
if( lengthChange > 30.0){
print "Port ${row.port_nbr}, ${row.pair} length change: $lengthChange"
println "/tbetween row ID ${lastRow.id} and ${row.id}"
}
lastRow = row
}else{
println "Key Changed"
lastRow = row
}
}
}
}
The following code will report length changes > 30 within the same port_nbr and pair.
def sql = 'Your SQL here.' // Should include "order by pair, port_nbr, date"
def lastRow = [id:-1, port_nbr:-1, pair:'', length:-1.0]
dst_db1.eachRow( sql ) { row ->
if ( row.port_nbr == lastRow.port_nbr && row.pair == lastRow.pair ) {
BigDecimal lengthChange =
new BigDecimal( row.length ) - new BigDecimal( lastRow.length )
if ( lengthChange > 30.0 ) {
print "Port ${row.port_nbr}, ${row.pair} length change: $lengthChange"
println "\tbetween row ID ${lastRow.id} and ${row.id}"
}
lastRow = row
} else {
println "Key changed"
lastRow = row
}
}
To run the above code without a database I prefixed it with this test code:
class DstDb1 {
def eachRow ( sql, closure ) {
rows.each( closure )
}
def rows = [
[id: 1, port_nbr: 1, pair: 'pairA', length: 30.00 ],
[id: 2, port_nbr: 1, pair: 'pairA', length: 65.00 ],
[id: 3, port_nbr: 1, pair: 'pairA', length: 70.00 ],
[id: 4, port_nbr: 1, pair: 'pairA', length: 75.00 ],
[id: 5, port_nbr: 1, pair: 'pairB', length: 130.00 ],
[id: 6, port_nbr: 1, pair: 'pairB', length: 165.00 ],
[id: 7, port_nbr: 1, pair: 'pairB', length: 170.00 ],
[id: 8, port_nbr: 1, pair: 'pairB', length: 175.00 ],
[id: 9, port_nbr: 2, pair: 'pairC', length: 230.00 ],
[id:10, port_nbr: 2, pair: 'pairC', length: 265.00 ],
[id:11, port_nbr: 2, pair: 'pairC', length: 270.00 ],
[id:12, port_nbr: 2, pair: 'pairC', length: 350.00 ]
]
}
DstDb1 dst_db1 = new DstDb1()
Running the test gives this result:
Key changed
Port 1, pairA length change: 35 between row ID 1 and 2
Key changed
Port 1, pairB length change: 35 between row ID 5 and 6
Key changed
Port 2, pairC length change: 35 between row ID 9 and 10
Port 2, pairC length change: 80 between row ID 11 and 12
Related
This is my custom function:
function scope(tbl, depth)
if (depth > 0) then
for k, v in pairs(tbl) do
if (type(v) ~= "table") then
print(v)
else
scope(v, depth - 1)
end
end
end
end
This is the usage: let
stuff = {
fruit = {
yellow = {
"Banana"
}, -- depth = 3
red = {
"Apple"
} -- depth = 3
},
city = {
"Toronto"
}, -- depth = 2
name = {
"Claudia"
} -- depth = 2
}
scope(stuff, 2) returns
Toronto
Claudia
Otherwise, scope(stuff, 3) returns
Banana
Apple
Toronto
Claudia
Advice on how to improve it? Maybe insert some code that displays nil if, as here, I specify depth value of 1 or a number greater than 3 (the depth of the table).
How would I go about compiling values from a table using a string?
i.e.
NumberDef = {
[1] = 1,
[2] = 2,
[3] = 3
}
TextDef = {
["a"] = 1,
["b"] = 2,
["c"] = 3
}
If I was for example to request "1ABC3", how would I get it to output 1 1 2 3 3?
Greatly appreciate any response.
Try this:
s="1ABC3z9"
t=s:gsub(".",function (x)
local y=tonumber(x)
if y~=nil then
y=NumberDef[y]
else
y=TextDef[x:lower()]
end
return (y or x).." "
end)
print(t)
This may be simplified if you combine the two tables into one.
You can access values in a lua array like so:
TableName["IndexNameOrNumber"]
Using your example:
NumberDef = {
[1] = 1,
[2] = 2,
[3] = 3
}
TextDef = {
["a"] = 1,
["b"] = 2,
["c"] = 3
}
print(NumberDef[2])--Will print 2
print(TextDef["c"])--will print 3
If you wish to access all values of a Lua array you can loop through all values like so (similarly to a foreach in c#):
for i,v in next, TextDef do
print(i, v)
end
--Output:
--c 3
--a 1
--b 2
So to answer your request, you would request those values like so:
print(NumberDef[1], TextDef["a"], TextDef["b"], TextDef["c"], NumberDef[3])--Will print 1 1 2 3 3
One more point, if you're interested in concatenating lua string this can be accomplished like so:
string1 = string2 .. string3
Example:
local StringValue1 = "I"
local StringValue2 = "Love"
local StringValue3 = StringValue1 .. " " .. StringValue2 .. " Memes!"
print(StringValue3) -- Will print "I Love Memes!"
UPDATE
I whipped up a quick example code you could use to handle what you're looking for. This will go through the inputted string and check each of the two tables if the value you requested exists. If it does it will add it onto a string value and print at the end the final product.
local StringInput = "1abc3" -- Your request to find
local CombineString = "" --To combine the request
NumberDef = {
[1] = 1,
[2] = 2,
[3] = 3
}
TextDef = {
["a"] = 1,
["b"] = 2,
["c"] = 3
}
for i=1, #StringInput do --Foreach character in the string inputted do this
local CurrentCharacter = StringInput:sub(i,i); --get the current character from the loop position
local Num = tonumber(CurrentCharacter)--If possible convert to number
if TextDef[CurrentCharacter] then--if it exists in text def array then combine it
CombineString = CombineString .. TextDef[CurrentCharacter]
end
if NumberDef[Num] then --if it exists in number def array then combine it
CombineString = CombineString .. NumberDef[Num]
end
end
print("Combined: ", CombineString) --print the final product.
I'll try my best to explain the situation.
I have the following db columns:
oid - task - start - end - realstart - realend
My requirement is to have an output like the following:
oid1 - task1 - start1 - end1
oid2 - task2 - start2 - end2
where task1 is task, task2 is task + "real", start1 is start, start2 is realstart, end1 is end, end2 is realend
BUT
the first row should always be created (those start/end fields are never empty) the second row should only be created if realstart and realend exist which may not be true.
Inputs are 6 arrays (one for each column), Outputs must be 4 arrays, something like this:
#input oid,task,start,end,realstart,realend
#output oid,task,start,end
I was thinking about using something like oid.each but I don't know how to add nodes after the current one. Order is important in the requirement.
For any explanation please ask, thanks!
After your comment and understanding that you don't want (or cannot) change the input/output data format, here's another solution that does what you've asked using classes to group the data and make it easier to manage:
import groovy.transform.Canonical
#Canonical
class Input {
String[] oids = [ 'oid1', 'oid2' ]
String[] tasks = [ 'task1', 'task2' ]
Integer[] starts = [ 10, 30 ]
Integer[] ends = [ 20, 42 ]
Integer[] realstarts = [ 12, null ]
Integer[] realends = [ 21, null ]
List<Object[]> getEntries() {
// ensure all entries have the same size
def entries = [ oids, tasks, starts, ends, realstarts, realends ]
assert entries.collect { it.size() }.unique().size() == 1,
'The input arrays do not all have the same size'
return entries
}
int getSize() {
oids.size() // any field would do, they have the same length
}
}
#Canonical
class Output {
List oids = [ ]
List tasks = [ ]
List starts = [ ]
List ends = [ ]
void add( oid, task, start, end, realstart, realend ) {
oids << oid; tasks << task; starts << start; ends << end
if ( realstart != null && realend != null ) {
oids << oid; tasks << task + 'real'; starts << realstart; ends << realend
}
}
}
def input = new Input()
def entries = input.entries
def output = new Output()
for ( int i = 0; i < input.size; i++ ) {
def entry = entries.collect { it[ i ] }
output.add( *entry )
}
println output
Responsibility of arranging the data is on the Input class, while the responsibility of knowing how to organize the output data is in the Output class.
Running this code prints:
Output([oid1, oid1, oid2], [task1, task1real, task2], [10, 12, 30], [20, 21, 42])
You can get the arrays (Lists, actually, but call toArray() if on the List to get an array) from the output object with output.oids, output.tasks, output.starts and output.ends.
The #Canonical annotation just makes the class implement equals, hashCode, toString and so on...
If you don't understand something, ask in the comments.
IF you need an "array" whose size you don't know from the start, you should use a List instead. But in Groovy, that's very easy to use.
Here's an example:
final int OID = 0
final int TASK = 1
final int START = 2
final int END = 3
final int R_START = 4
final int R_END = 5
List<Object[]> input = [
//oid, task, start, end, realstart, realend
[ 'oid1', 'task1', 10, 20, 12, 21 ],
[ 'oid2', 'task2', 30, 42, null, null ]
]
List<List> output = [ ]
input.each { row ->
output << [ row[ OID ], row[ TASK ], row[ START ], row[ END ] ]
if ( row[ R_START ] && row[ R_END ] ) {
output << [ row[ OID ], row[ TASK ] + 'real', row[ R_START ], row[ R_END ] ]
}
}
println output
Which outputs:
[[oid1, task1, 10, 20], [oid1, task1real, 12, 21], [oid2, task2, 30, 42]]
I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:
Year, Name, County, Number
2000, JOHN, KINGS, 50
2000, BOB, KINGS, 40
2000, MARY, NASSAU, 60
2001, JOHN, KINGS, 14
2001, JANE, KINGS, 30
2001, BOB, NASSAU, 45
And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?
I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)
val names = sc.textFile("names.csv").map(l => l.split(","))
val uniqueCounty = names.map(x => x(2)).distinct.collect
for (i <- 0 to uniqueCounty.length-1) {
val county = uniqueCounty(i).toString;
val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
println("County:" + county + eachCounty.first)
}
Here is the solution using RDD. I am assuming you need top occurring name per county.
val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect
Output:
res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))
You could use the spark-csv and the Dataframe API. If you are using the new version of Spark (2.0) it is slightly different. Spark 2.0 has a native csv data source based on spark-csv.
Use spark-csv to load your csv file into a Dataframe.
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(new File(getClass.getResource("/names.csv").getFile).getAbsolutePath)
df.show
Gives output:
+----+----+------+------+
|Year|Name|County|Number|
+----+----+------+------+
|2000|JOHN| KINGS| 50|
|2000| BOB| KINGS| 40|
|2000|MARY|NASSAU| 60|
|2001|JOHN| KINGS| 14|
|2001|JANE| KINGS| 30|
|2001| BOB|NASSAU| 45|
+----+----+------+------+
DataFrames uses a set of operations for structured data manipulation. You could use some basic operations to become your result.
import org.apache.spark.sql.functions._
df.select("County","Number").groupBy("County").agg(max("Number")).show
Gives output:
+------+-----------+
|County|max(Number)|
+------+-----------+
|NASSAU| 60|
| KINGS| 50|
+------+-----------+
Is this what you are trying to achieve?
Notice the import org.apache.spark.sql.functions._ which is needed for the agg() function.
More information about Dataframes API
EDIT
For correct output:
df.registerTempTable("names")
//there is probably a better query for this
sqlContext.sql("SELECT * FROM (SELECT Name, County,count(1) as Occurrence FROM names GROUP BY Name, County ORDER BY " +
"count(1) DESC) n").groupBy("County", "Name").max("Occurrence").limit(2).show
Gives output:
+------+----+---------------+
|County|Name|max(Occurrence)|
+------+----+---------------+
| KINGS|JOHN| 2|
|NASSAU|MARY| 1|
+------+----+---------------+
I'm querying a database containing entries as displayed in the example. All entries contain the following values:
_id: unique id of overallitem and placed_items
name: the name of te overallitem
loc: location of the overallitem and placed_items
time_id: time the overallitem was stored
placed_items: array containing placed_items (can range from zero: placed_items : [], to unlimited amount.
category_id: the category of the placed_items
full_id: the full id of the placed_items
I want to extract the name, full_id and category_id on a per placed_items level given a time_id and loc constraint
Example data:
{
"_id" : "5040",
"name" : "entry1",
"loc" : 1,
"time_id" : 20121001,
"placed_items" : [],
}
{
"_id" : "5041",
"name" : "entry2",
"loc" : 1,
"time_id" : 20121001,
"placed_items" : [
{
"_id" : "5043",
"category_id" : 101,
"full_id" : 901,
},
{
"_id" : "5044",
"category_id" : 102,
"full_id" : 902,
}
],
}
{
"_id" : "5042",
"name" : "entry3",
"loc" : 1,
"time_id" : 20121001,
"placed_items" : [
{
"_id" : "5045",
"category_id" : 101,
"full_id" : 903,
},
],
}
The expected outcome for this example would be:
"name" "full_id" "category_id"
"entry2" 901 101
"entry2" 902 102
"entry3" 903 101
So if placed_items is empty, do put the entry in the dataframe and if placed_items containts n entries, put n entries in dataframe
I tried to work out an RBlogger example to create the desired dataframe.
#Set up database
mongo <- mongo.create()
#Set up condition
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "loc", 1)
mongo.bson.buffer.start.object(buf, "time_id")
mongo.bson.buffer.append(buf, "$gte", 20120930)
mongo.bson.buffer.append(buf, "$lte", 20121002)
mongo.bson.buffer.finish.object(buf)
query <- mongo.bson.from.buffer(buf)
#Count
count <- mongo.count(mongo, "items_test.overallitem", query)
#Note that these counts don't work, since the count should be based on
#the number of placed_items in the array, and not the number of entries.
#Setup Cursor
cursor <- mongo.find(mongo, "items_test.overallitem", query)
#Create vectors, which will be filled by the while loop
name <- vector("character", count)
full_id<- vector("character", count)
category_id<- vector("character", count)
i <- 1
#Fill vectors
while (mongo.cursor.next(cursor)) {
b <- mongo.cursor.value(cursor)
order_id[i] <- mongo.bson.value(b, "name")
product_id[i] <- mongo.bson.value(b, "placed_items.full_id")
category_id[i] <- mongo.bson.value(b, "placed_items.category_id")
i <- i + 1
}
#Convert to dataframe
results <- as.data.frame(list(name=name, full_id=full_uid, category_id=category_id))
The conditions work and the code works if I would want to extract values on an overallitem level (i.e. _id or name) but fails to gather the information on a placed_items level. Furthermore, the dotted call for extracting full_id and category_id does not seem to work. Can anyone help?