Count unique rows of column in Array of object - arrays

Using
Ruby 1.9.3-p194
Rails 3.2.8
Here's what I need.
Count the different human resources (human_resource_id) and divide this by the total number of assignments (assignment_id).
So, the answer for the dummy-data as given below should be:
1.5 assignments per human resource
But I just don't know where to go anymore.
Here's what I tried:
Table name: Assignments
id | human_resource_id | assignment_id | assignment_start_date | assignment_expected_end_date
80101780 | 20200132 | 80101780 | 2012-10-25 | 2012-10-31
80101300 | 20200132 | 80101300 | 2012-07-07 | 2012-07-31
80101308 | 21100066 | 80101308 | 2012-07-09 | 2012-07-17
At first I need to make a selection for the period I need to 'look' at. This is always from max a year ago.
a = Assignment.find(:all, :conditions => { :assignment_expected_end_date => (DateTime.now - 1.year)..DateTimenow })
=> [
#<Assignment id: 80101780, human_resource_id: "20200132", assignment_id: "80101780", assignment_start_date: "2012-10-25", assignment_expected_end_date: "2012-10-31">,
#<Assignment id: 80101300, human_resource_id: "20200132", assignment_id: "80101300", assignment_start_date: "2012-07-07", assignment_expected_end_date: "2012-07-31">,
#<Assignment id: 80101308, human_resource_id: "21100066", assignment_id: "80101308", assignment_start_date: "2012-07-09", assignment_expected_end_date: "2012-07-17">
]
foo = a.group_by(&:human_resource_id)
Now I got a beautiful 'Array of hash of object' and I just don't know what to do next.
Can someone help me?

You can try to execute the request in SQL :
ActiveRecord::Base.connection.select_value('SELECT count(distinct human_resource_id) / count(distinct assignment_id) AS ratio FROM assignments');

You could do something like
human_resource_count = assignments.collect{|a| a.human_resource_id}.uniq.count
assignment_count = assignments.collect{|a| a.assignment_id}.uniq.count
result = human_resource_count/assignment_count

Related

flink table API not processing records

I read json data from Kafka and tried to process the data with flink table API.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
tEnv.executeSql(
"create table inputTable(" +
"`src_ip` STRING," +
"`src_port` STRING," +
"`bytes_from_src` BIGINT," +
"`pkts_from_src` BIGINT," +
"`ts` TIMESTAMP(2) METADATA FROM 'timestamp'," +
"WATERMARK FOR ts AS ts" +
") WITH (" +
"'connector' = 'kafka'," +
"'topic' = 'test'," +
"'properties.bootstrap.servers' = 'localhost:9092'," +
"'properties.group.id' = 'testGroup'," +
"'scan.startup.mode' = 'earliest-offset'," +
"'format' = 'json'," +
"'json.fail-on-missing-field' = 'true'," +
"'json.ignore-parse-errors' = 'false'" +
")");
Table inputTable = tEnv.from("inputTable");
inputTable.printSchema();
inputTable.execute().print();
Table windowedTable = inputTable
.window(Tumble.over(lit(5).seconds()).on($("ts")).as("w"))
.groupBy($("w"), $("src_ip"))
.select($("w").start().as("window_start"),
$("src_ip"),
$("src_ip").count().as("src_ip_count"),
$("bytes_from_src").avg().as("bytes_from_src_mean")
);
windowedTable.execute().print();
There are 4 records in Kafka. The flink program prints out the schema info and the inputTable as the following:
Connected to the target VM, address: '127.0.0.1:62348', transport: 'socket'
(
`src_ip` STRING,
`src_port` STRING,
`bytes_from_src` BIGINT,
`pkts_from_src` BIGINT,
`ts` TIMESTAMP(2) *ROWTIME* METADATA FROM 'timestamp',
WATERMARK FOR `ts`: TIMESTAMP(2) AS `ts`
)
+----+--------------------------------+--------------------------------+----------------------+----------------------+-------------------------+
| op | src_ip | src_port | bytes_from_src | pkts_from_src | ts |
+----+--------------------------------+--------------------------------+----------------------+----------------------+-------------------------+
| +I | 44.38.5.31 | 53159 | 120 | 3 | 2021-08-13 14:59:56.00 |
| +I | 44.38.132.51 | 39409 | 100 | 2 | 2021-08-13 14:58:11.00 |
| +I | 44.38.4.44 | 56758 | 336 | 6 | 2021-08-13 14:59:14.00 |
| +I | 44.38.5.34 | 40001 | 80 | 2 | 2021-08-13 14:57:04.00 |
After that, nothing is printed out. The program did not exit. I am running the flink within IDEA. At this point, it seems like a black box. There is no output, and I do not know how to trace a flink program.
If I commented out the line inputTable.execute().print();, the schema info is printed out, but nothing after that and the program does not exit.
The flink version used is 1.14.2.
I believe those records are being processed, and are being added to the window. But event time windows are triggered by watermarks, and the watermark isn't becoming large enough to trigger the window. To get this to work you need to process an event with a timestamp past the end of the window -- i.e., 2021-08-13 15:00:00.00 or larger.
For debugging, the Flink web dashboard is helpful in situations like this. You can see if events are being processed, examine the watermarks, etc. See Flink webui when running from IDE for help in setting it up.

Create a list of lists from two columns in a data frame - Scala

I have a data frame where passengerId and path are Strings. The path represents the flight path of the passenger so passenger 10096 started in country CO and traveled to country BM. I need to find out the longest amount of flights each passenger has without traveling to the UK.
+-----------+--------------------+
|passengerId| path|
+-----------+--------------------+
| 10096| co,bm|
| 10351| pk,uk|
| 10436| co,co,cn,tj,us,ir|
| 1090| dk,tj,jo,jo,ch,cn|
| 11078| pk,no,fr,no|
| 11332|sg,cn,co,bm,sg,jo...|
| 11563|us,sg,th,cn,il,uk...|
| 1159| ca,cl,il,sg,il|
| 11722| dk,dk,pk,sg,cn|
| 11888|au,se,ca,tj,th,be...|
| 12394| dk,nl,th|
| 12529| no,be,au|
| 12847| cn,cg|
| 13192| cn,tk,cg,uk,uk|
| 13282| co,us,iq,iq|
| 13442| cn,pk,jo,us,ch,cg|
| 13610| be,ar,tj,no,ch,no|
| 13772| be,at,iq|
| 13865| be,th,cn,il|
| 14157| sg,dk|
+-----------+--------------------+
I need to get it like this.
val data = List(
(1,List("UK","IR","AT","UK","CH","PK")),
(2,List("CG","IR")),
(3,List("CG","IR","SG","BE","UK")),
(4,List("CG","IR","NO","UK","SG","UK","IR","TJ","AT")),
(5,List("CG","IR"))
I'm trying to use this solution but I can't make this list of lists. It also seems like the input used in the solution has each country code as a separate item in the list, while my path column has the country codes listed as a single element to describe the flight path.
If the goal is just to generate the list of destinations from a string, you can simply use split:
df.withColumn("path", split('path, ","))
If the goal is to compute the maximum number of steps without going to the UK, you could do something like this:
df
// split the string on 'uk' and generate one row per sub journey
.withColumn("path", explode(split('path, ",?uk,?")))
// compute the size of each sub journey
.withColumn("path_size", size(split('path, ",")))
// retrieve the longest one
.groupBy("passengerId")
.agg(max('path_size) as "max_path_size")

Calculate total number of orders in different statuses

I want to create simple dashboard where I want to show the number of orders in different statuses. The statuses can be New/Cancelled/Finished/etc
Where should I implement these criteria? If I add filter in the Cube Browser then it applies for the whole dashboard. Should I do that in KPI? Or should I add calculated column with 1/0 values?
My expected output is something like:
--------------------------------------
| Total | New | Finished | Cancelled |
--------------------------------------
| 1000 | 100 | 800 | 100 |
--------------------------------------
I'd use measures for that, something like:
CountTotal = COUNT('Orders'[OrderID])
CountNew = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "New")
CountFinished = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "Finished")
CountCancelled = CALCULATE(COUNT('Orders'[OrderID]), 'Orders'[Status] = "Cancelled")

Import .csv into MongoDB creating arrays with objects

I am starting to familiarize with MongoDB and now i have to import a relative large .csv file into a Mongo collection. So far I was able to import the .csv file without big problems (either with mongoimport or through Studio 3T). However, I would like to create some arrays for each specific document. Maybe by explaining with an example my data I could make it clearer.
worker_id | gender | birth | employer_id | start | end | type |
---------------------------------------------------------------------------
CAD1213 | M | 1990 | WRF1119 | 15jun2018 | 31dec2018 | 11.1.1 |
CAD1213 | M | 1990 | WAC1134 | 1jan2019 | 31dec2019 | 12.1.1 |
CAF1456 | F | 1972 | WAC1134 | 15aug1988 | 17sep2016 | 55.0.1 |
CAR5567 | F | 1991 | WER9280 | 15jun2018 | 31dec2018 | 1.1.1 |
CAR5578 | F | 1989 | WHI1176 | 15jun2011 | 1nov2011 | 12.2.5 |
CAR5578 | F | 1989 | WHI1176 | 2nov2011 | 31dec2012 | 12.2.5 |
This is just a mere example with few fields (columns). I have data for job contracts where: worker_id is the the identification of the worker, gender the sex, birth refers to the birth year of the worker, employer_id is the identification of each employer, start and end the beginning and the end of each specific work contract, respectively, while type is the specific work contract. The whole database obviously is characterize by more fields.
With the classic .csv import, for each line I would obtain a different document with specific _ids and each column would represent the fields. However, I would like to have as '_id' (although not necessarily) the value of the worker_id. As you can see, some worker_id are repeated in the column (in a file of over 17 millions of row you can imagine how many replica I could have), therefore it would be impossibile to select this field as _id.
Here it comes what I am struggling to obtain. I would like to create an array (for example called contracts) which contains information of each specific contract, hence objects, in this case the columns employer_id, start, end, and type. For some workers we would have just an element in the array (CAF1456 and CAR5567 in the example) while for others more than one (CAD1213 and CAR5578 in the example). Therefore, an hypothetical document would be like the following one:
_id: "CAD1213"
gender: "M"
birth: "1990"
contracts: Array
0: Object
employer_id: "WRF1119"
start: "15jun2018"
end: "31dec2018"
type: "11.1.1"
1: Object
employer_id: "WAC1134"
start: "1jan1988"
end: "31dec2019"
type: "12.1.1"
_id: "CAF1456"
gender: "F"
birth: "1972"
contracts: Array
0: Object
employer_id: "WAC1134"
start: "15aug1988"
end: "17sep2016"
type: "55.0.1"
_id: "CAR5567"
gender: "F"
birth: "1991"
contracts: Array
0: Object
employer_id: "WER9280"
start: "15jun2018"
end: "31dec2018"
type: "1.1.1"
_id: "CAR5578"
gender: "F"
birth: "1989"
contracts: Array
0: Object
employer_id: "WHI1176"
start: "15jun2011"
end: "1nov2011"
type: "12.2.5"
1: Object
employer_id: "WHI1176"
start: "2nov2011"
end: "31dec2012"
type: "12.2.5"
As far as I know this procedure cannot be performed directly with mongoimport, then it has to be performed after importing the whole .csv document (I guess). I hope someone could provide me some hints, suggestions or links where I can find some help in achieving this structure. Moreover, if other structures of the documents could be more suitable for my example, I am eager to know.
Thank you in advance.

Subset with loop over an array of strings

I have my code like this:
for (i in 1:b) {
carteraR[[i]]=subset(carteraR[[i]],RUN.FONDO=="8026" | RUN.FONDO=="8036" | RUN.FONDO=="8048" | RUN.FONDO=="8057" | RUN.FONDO=="8059" | RUN.FONDO=="8072" | RUN.FONDO=="8094" |
RUN.FONDO=="8107" | RUN.FONDO=="8110" | RUN.FONDO=="8115" | RUN.FONDO=="8130" | RUN.FONDO=="8230" | RUN.FONDO=="8248" | RUN.FONDO=="8257" | RUN.FONDO=="8319")
}
Where b=length(carteraR), and class(carteraR[[i]])=data.frame. RUN.FONDO is one of the head of these data frames. This code is working fine but I want to save some lines.
What I want is something like:
for (i in 1:b) {
for (j in 1:length(A)){
carteraR[[i]]=subset(carteraR[[i]],RUN.FONDO==A[j])
}
}
And where A= "8026" "8036" "8048" "8057" ... "8319" ....... etc......
What should the code be like ?
Thx
Like this:
carteraR <- lapply(carteraR, subset, RUN.FONDO %in% A)
Just be aware there can be risks with using subset in a programmatic way: Why is `[` better than `subset`?. This usage is fine though.

Resources