Optimizing Query in Power Query M - sql-server

I am trying to get some statistics about my DB and my code seems to work perfect, but I got a real big DB and after trying to run this script on it, I ended up alway with Timeout failure, doesn't matter if I removed some unnecessary rows or not , I still getting the same error and the script is the following :
let
Source = Sql.Database("DBTEST","DB_TST",[CreateNavigationProperties=false]),
#"Filtered Rows" = Table.SelectRows(Source , each ([Kind] = "Table")),
#"Added Custom" = Table.AddColumn( #"Filtered Rows", "Profile",
each Table.Profile([Data])),
#"Expanded Profile" = Table.ExpandTableColumn( #"Added Custom" ,
"Profile",
{"Column", "Min", "Max", "Average", "StandardDeviation", "Count",
"NullCount", "DistinctCount"},
{"Column", "Min", "Max", "Average", "StandardDeviation", "Count",
"NullCount", "DistinctCount"})
in
#"Expanded Profile",
#"Entfernte Spalten" = Table.RemoveColumns(Tables_profile,{"Data"}),
#"Gefilterte Zeilen" = Table.SelectRows(#"Entfernte Spalten", each true)
in
#"Gefilterte Zeilen"

I do it this way.
Sql.Database("Server", "Database", "[Query=" Select * From...", CommandTimeout=#duration(0, 5, 0, 0)])

Related

FQL Fauna Function - Query Indexed Document Data Given Conditions

I have a collection of shifts for employees, data (trimmed out some details, but this is the structure for start/end times) looks like this:
{
"ref": Ref(Collection("shifts"), "123451234512345123"),
"ts": 1234567891012345,
"data": {
"id": 1,
"start": {
"time": 1659279600000
},
"end": {
"time": 1659283200000
},
"location": "12341234-abcd-1234-cdef-123412341234"
}
}
I have an index that will query return an array of shifts_by_location in this format: ["id", "startTime", "endTime"] ...
Now I want to create a user-defined-function to filter these results "start" and "end" times to fall in between given dayStart and dayEnd times to get shifts by date, hoping to get some FQL assistance here, thanks!
Here's my broken attempt:
Query(
Lambda(
["location_id", "dayStart", "dayEnd"], // example: ["124-abd-134", 165996000, 165922000]
Map(
Paginate(Match(Index("shifts_by_location"), Var("location_id"))),
Lambda(["id", "startTime", "endTime"],
If(
And(
GTE(Var("startTime"), Var("dayStart")), // GOAL -> shift starts after 8am on given day
LTE(Var("endTime"), Var("dayEnd")) // GOAL -> shift ends before 5pm on given day
),
Get(Var("shift")) // GOAL -> return shift for given day
)
)
)
)
)
Found a working solution with this query, the biggest fix was really just to use a filter over the map, which seems obvious in hindsight:
Query(
Lambda(
["location_id", "dayStart", "dayEnd"],
Filter(
Paginate(Match(Index("shifts_by_location"), Var("location_id"))),
Lambda(
["start", "end", "id"],
And(GTE(Var("start"), Var("dayStart")), LTE(Var("end"), Var("dayEnd")))
)
)
)
)

In a Ruby on Rails app, I'm trying to loop through an array within a hash within an array. Why am I getting "syntax error" message?

I have a Ruby on Rails application to enter results and create a league table for a football competition.
I'm trying to input some results by creating records in the database through heroku and I get error messages.
The application isn't perfectly designed: to enter the results, I have to create the fixtures and enter the score for each team. Then, independently I have to record each goal scorer, creating a record for each goal which is either associated with an existing player or requires me to firstly create a new player and then create the goal.
When I ran the code below heroku, I got this error:
syntax error, unexpected ':', expecting keyword_end
Maybe I'm missing something simple about lopping through an array within a hash?
Thank you for any advice!
coalition = Team.find_by(name: "Coalition")
moscow_rebels = Team.find_by(name: "Moscow Rebels")
red_star = Team.find_by(name: "Red Star")
unsanctionables = Team.find_by(name: "The Unsanctionables")
cavalry = Team.find_by(name: "Cavalry")
galactics = Team.find_by(name: "The Galactics")
happy_sundays = Team.find_by(name: "Happy Sundays")
hardmen = Team.find_by(name: "Hardmen")
international = Team.find_by(name: "International")
evropa = Venue.find_by(name: "Evropa")
s28 = Season.find_by(number: 28)
start_time = DateTime.new(2020,9,6,11,0,0,'+03:00')
scheduled_matches_1 =
[
{team_1: cavalry, team_1_goals: 1, team_1_scorers: ["Minaev"], team_2_goals: 6, team_2_scorers: ["Kovalev", "Kovalev", "Kovalev", "Thomas", "Thomas", "Grivachev"], team_2: coalition, time: start_time, venue: evropa, season: s28},
{team_1: hardmen, team_1_goals: 4, team_1_scorers: ["Jones", "Jones", "Jones", "Fusi"], team_2_goals: 2, team_2_scorers: ["Kazamula", "Ario"], team_2: galactics, time: start_time + 1.hour, venue: evropa, season: s28},
{team_1: international, team_1_goals: 9, team_1_scorers: ["Kimonnen", "Kimonnen", "Kimonnen", "Burya", "Burya", "Zakharyaev", "Zakharyaev", "Lavruk", "Rihter"], team_2_goals: 0, team_2_scorers: [], team_2: happy_sundays, time: start_time+2.hours, venue: evropa, season: s28}
]
scheduled_matches.each do |match|
new_fixture = Fixture.create(time: match[:time], venue: match[:venue], season: match[:season])
tf1 = TeamFixture.create(team: match[:team_1], fixture: new_fixture)
tf2 = TeamFixture.create(team: match[:team_2], fixture: new_fixture)
ts1 = TeamScore.create(team_fixture: tf1, total_goals: match{:team_1_goals})
ts2 = TeamScore.create(team_fixture: tf2, total_goals: match{:team_2_goals})
match[:team_1_scorers].each do |scorer|
if Player.exists?(team: tf1.team, last_name: scorer)
Goal.create(team_score: ts1, player: Player.find_by(last_name: scorer))
else
new_player = Player.create(team: tf1.team, last_name: scorer)
Goal.create(team_score: ts1, player: new_player)
end
end
match[:team_2_scorers].each do |scorer_2|
if Player.exists?(team: tf2.team, last_name: scorer_2)
Goal.create(team_score: ts2, player: Player.find_by(last_name: scorer_2))
else
new_player = Player.create(team: tf2.team, last_name: scorer_2)
Goal.create(team_score: ts2, player: new_player)
end
end
end
It looks like you are using braces when you meant to use brackets to access the hash. Below is one of the issues, but the same issue is in ts2.
ts1 = TeamScore.create(team_fixture: tf1, total_goals: match{:team_1_goals})
should be match[:team_1_goals]
ts1 = TeamScore.create(team_fixture: tf1, total_goals: match[:team_1_goals])
It may be because you have scheduled_matches_1 at the top and scheduled_matches.each do... further down.
But the real issue here is that your variable names match the data content, rather than being used to hold the content. If a new team joins your league, you have to change the code. Next week, you are going to have to change the hard-coded date value. Your scheduled_matches_1 data structure includes the active record objects returned by the first set of Team.findByName() calls. It would be easier to fetch these objects from the database inside your loops, and just hold the team name as a string in the hash.
There is some duplication too. Consider that each fixture has a home team and an away team. Each team has a name, and an array (possibly empty) of the players who scored. We don't need the number of goals; we can just count the number of players in the 'scorers' array. The other attributes, like the location and season belong to the fixture, not the team. So your hash might be better as
{
"fixtures": [
{
"home": {
"name": "Cavalry",
"scorers": [
"Minaev"
]
},
"away": {
"name": "Coalition",
"scorers": [
"Kovalev",
"Kovalev",
"Kovalev",
"Thomas",
"Thomas",
"Grivachev"
]
},
"venue": "Evropa",
"season": "s28"
}
]
}
because then you can create a reusable method to process each team. And maybe create a new method that returns the player (which it either finds or creates) which can be called by the loop that adds the goals.
Also, as it stands, I'm not sure the code can handle 'own goals', either. Perhaps something for a future iteration :)

How to flatten an array in a nested json in aws glue using pyspark?

I am trying to flatten a JSON file to be able to load it into PostgreSQL all in AWS Glue. I am using PySpark. Using a crawler I crawl the S3 JSON and produce a table. I then use an ETL Glue script to:
read the crawled table
use the 'Relationalize' function to flatten the file
convert the dynamic frame to a dataframe
try to 'explode' the request.data field
Script so far:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
df0 = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
df1 = df0.select(dfc_root_table_name)
df2 = df1.toDF()
df2 = df1.select(explode(col('`request.data`')).alias("request_data"))
<then i write df1 to a PostgreSQL database which works fine>
Issues I face:
The 'Relationalize' function works well except the request.data field which becomes a bigint and therefore 'explode' doesn't work.
Explode cannot be done without using 'Relationalize' on the JSON first due to the structure of the data. Specifically the error is: "org.apache.spark.sql.AnalysisException: cannot resolve 'explode(request.data)' due to data type mismatch: input to function explode should be array or map type, not bigint"
If I try to make the dynamic frame a dataframe first then I get this issue: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.jdbc.
: java.lang.IllegalArgumentException: Can't get JDBC type for struct..."
I tried to also upload a classifier so that the data would flatten in the crawl itself but AWS confirmed this wouldn't work.
The JSON format of the original file is as follows, that I an trying to normalise:
- field1
- field2
- {}
- field3
- {}
- field4
- field5
- []
- {}
- field6
- {}
- field7
- field8
- {}
- field9
- {}
- field10
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
df=flatten_df(df)
It will replace all dots with underscore. Note that it uses explode_outer and not explode to include Null value in case array itself is null. This function is available in spark v2.4+ only.
Also remember, exploding array will add more duplicates and overall row size will increase. Flattening struct will increase column size. In short, your original df will explode horizontally and vertically. It may slow down processing data later.
Therefore my recommendation would be to identify feature related data and store only those data in postgresql and original json files in s3.
Once you have rationalized the json column, you don't need to explode it. Relationalize transforms the nested JSON into key-value pairs at the outermost level of the JSON document. The transformed data maintains a list of the original keys from the nested JSON separated by periods.
Example :
Nested json :
{
"player": {
"username": "user1",
"characteristics": {
"race": "Human",
"class": "Warlock",
"subclass": "Dawnblade",
"power": 300,
"playercountry": "USA"
},
"arsenal": {
"kinetic": {
"name": "Sweet Business",
"type": "Auto Rifle",
"power": 300,
"element": "Kinetic"
},
"energy": {
"name": "MIDA Mini-Tool",
"type": "Submachine Gun",
"power": 300,
"element": "Solar"
},
"power": {
"name": "Play of the Game",
"type": "Grenade Launcher",
"power": 300,
"element": "Arc"
}
},
"armor": {
"head": "Eye of Another World",
"arms": "Philomath Gloves",
"chest": "Philomath Robes",
"leg": "Philomath Boots",
"classitem": "Philomath Bond"
},
"location": {
"map": "Titan",
"waypoint": "The Rig"
}
}
}
Flattened out json after rationalize :
{
"player.username": "user1",
"player.characteristics.race": "Human",
"player.characteristics.class": "Warlock",
"player.characteristics.subclass": "Dawnblade",
"player.characteristics.power": 300,
"player.characteristics.playercountry": "USA",
"player.arsenal.kinetic.name": "Sweet Business",
"player.arsenal.kinetic.type": "Auto Rifle",
"player.arsenal.kinetic.power": 300,
"player.arsenal.kinetic.element": "Kinetic",
"player.arsenal.energy.name": "MIDA Mini-Tool",
"player.arsenal.energy.type": "Submachine Gun",
"player.arsenal.energy.power": 300,
"player.arsenal.energy.element": "Solar",
"player.arsenal.power.name": "Play of the Game",
"player.arsenal.power.type": "Grenade Launcher",
"player.arsenal.power.power": 300,
"player.arsenal.power.element": "Arc",
"player.armor.head": "Eye of Another World",
"player.armor.arms": "Philomath Gloves",
"player.armor.chest": "Philomath Robes",
"player.armor.leg": "Philomath Boots",
"player.armor.classitem": "Philomath Bond",
"player.location.map": "Titan",
"player.location.waypoint": "The Rig"
}
Thus in your case, request.data is already a new column flattened out from request column and its type is interpreted as bigint by spark.
Reference : Simplify/querying nested json with the aws glue relationalize transform

Why can't I retrieve the master appointment of a series via `AppointmentCalendar.FindAppointmentsAsync`?

I'm retrieving multiple appointments via AppointmentCalendar.FindAppointmentsAsync. I'm evaluating the Recurrence.RecurrenceType and noticed an unexpected value of 1 for master appointments of a series. I expect the Recurrence.RecurrenceType to be 0 (Master) but instead it is 1 (Instance).
(Note: I added AppointmentProperties.Recurrence to FindAppointmentsOptions.FetchProperties that is passed to GetAppointmentsAsync, so the Recurrence data should be fetched propertly.)
To double check I retrieved the respective master appointment via GetAppointmentAsync (instead of FindAppointmentsAsync) using its LocalId - and here the RecurrenceType is correctly set to 0.
Here is demo output for a test appointment series:
Data gotten by FindAppointmentsAsync (Instance??):
"Recurrence": {
"Unit": 0,
"Occurrences": 16,
"Month": 1,
"Interval": 1,
"DaysOfWeek": 0,
"Day": 1,
"WeekOfMonth": 0,
"Until": "2016-09-29T02:00:00+02:00",
"TimeZone": "Europe/Budapest",
"RecurrenceType": 1,
"CalendarIdentifier": "GregorianCalendar"
},
"StartTime": "2016-09-14T19:00:00+02:00",
"OriginalStartTime": "2016-09-14T19:00:00+02:00",
Data gotten by GetAppointmentAsync for the same appointment (Master):
"Recurrence": {
"Unit": 0,
"Occurrences": 16,
"Month": 1,
"Interval": 1,
"DaysOfWeek": 0,
"Day": 1,
"WeekOfMonth": 0,
"Until": "2016-09-29T02:00:00+02:00",
"TimeZone": "Europe/Budapest",
"RecurrenceType": 0,
"CalendarIdentifier": "GregorianCalendar"
},
"StartTime": "2016-09-14T19:00:00+02:00",
"OriginalStartTime": null,
Notice the difference in RecurrenceType. Also note that OriginalStartTime is set to null for the master gotten by GetAppointmentAsync but has a value for the appointment gotten by FindAppointmentsAsync.
You can also see that the StartTime for the master appointment is the start time set for the alleged Instance (which in reality is the master).
Shouldn't FindAppointmentsAsync return a master as the first element of a series, instead of an instance?
(SDK: 10.0.14393.0, Anniversary)
Code to explicitly find such a master/instance situation for a given calendar:
var appointmentsCurrent = await calendar.FindAppointmentsAsync(DateTimeOffset.Now, TimeSpan.FromDays(365), findAppointmentOptions);
foreach(var a in appointmentsCurrent)
{
var a2 = await calendar.GetAppointmentAsync(a.LocalId);
if (a2.Recurrence?.RecurrenceType == RecurrenceType.Master &&
a2.StartTime == a.StartTime &&
a.Recurrence?.RecurrenceType == RecurrenceType.Instance &&
a.OriginalStartTime == a2.StartTime)
{
Debug.WriteLine("Gotcha!");
}
}
I tested above code on my side. If you get the count of the appointments which are got from FindAppontmentsAsync by the following code:var count=appointmentsCurrent.Count;, you will find it does return the count of the appointment instances, not the count of master appointments. So the FindAppontmentsAsync method got all instances of the appointments not master appointments. This is the reason why the RecurrenceType is instance.
It seems like we can get one master appointment by method GetAppointmentAsync as you mentioned above, so I suppose this may not block you.
If you think this is not a good design for this API or you require a API for finding all the master appointments in one calendar, you can submit your ideas to the windows 10 feedback tool or the user voice site.

lucene solr - how to know numCount of each word in query

i have a query string with 5 words. for exmple "cat dog fish bird animals".
i need to know how many matches each word has.
at this point i create 5 queries:
/q=name:cat&rows=0&facet=true
/q=name:dog&rows=0&facet=true
/q=name:fish&rows=0&facet=true
/q=name:bird&rows=0&facet=true
/q=name:animals&rows=0&facet=true
and get matches count of each word from each query.
but this method takes too many time.
so is there a way to check get numCount of each word with one query?
any help appriciated!
In this case, functionQueries are your friends. In particular:
termfreq(field,term) returns the number of times the term appears in the field for that document. Example Syntax:
termfreq(text,'memory')
totaltermfreq(field,term) returns the number of times the term appears in the field in the entire index. ttf is an alias of
totaltermfreq. Example Syntax: ttf(text,'memory')
The following query for instance:
q=*%3A*&fl=cntOnSummary%3Atermfreq(summary%2C%27hello%27)+cntOnTitle%3Atermfreq(title%2C%27entry%27)+cntOnSource%3Atermfreq(source%2C%27activities%27)&wt=json&indent=true
returns the following results:
"docs": [
{
"id": [
"id-1"
],
"source": [
"activities",
"activities"
],
"title": "Ajones3 Activity Entry 1",
"summary": "hello hello",
"cntOnSummary": 2,
"cntOnTitle": 1,
"cntOnSource": 1,
"score": 1
},
{
"id": [
"id-2"
],
"source": [
"activities",
"activities"
],
"title": "Common activity",
"cntOnSummary": 0,
"cntOnTitle": 0,
"cntOnSource": 1,
"score": 1
}
}
]
Please notice that while it's working well on single value field, it seems that for multivalued fields, the functions consider just the first entry, for instance in the example above, termfreq(source%2C%27activities%27) returns 1 instead of 2.

Resources