All,
I am new to the graph database area and want to know if this type of example if applicable to a graph database.
Say I am looking at a baseball game. When each player goes to bat, there are 3 possible outcomes: hit, strikeout, or walk.
For each batter and throughout the baseball season, what I want to figure out is the counts of the sequences.
For example, for batters that went to the plate n times, how many people had a particular sequence (e.g, hit/walk/strikeout or hit/hit/hit/hit), and if so, how many of the same batters repeated the same sequence indexed by time. To further explain, time would allow me know if a particular sequence (e.g. hit/walk/strikeout or hit/hit/hit/hit) occurred during the beginning of the season, in the mid, or later half.
For a key-value type database, the raw data would look as follows:
Batter Time Game Event Bat
------- ----- ---- --------- ---
Charles April 1 Hit 1
Charles April 1 strikeout 2
Charles April 1 Walk 3
Doug April 1 Walk 1
Doug April 1 Hit 2
Doug April 1 strikeout 3
Charles April 2 strikeout 1
Charles April 2 strikeout 2
Doug May 5 Hit 1
Doug May 5 Hit 2
Doug May 5 Hit 3
Doug May 5 Hit 4
Hence, my output would appear as follows:
Sequence Freq Unique Batters Time
----------------------- ---- -------------- ------
hit 5000 600 April
walk/strikeout 3000 350 April
strikeout/strikeout/hit 2000 175 April
hit/hit/hit/hit/hit 1000 80 April
hit 6000 800 May
walk/strikeout 3500 425 May
strikeout/strikeout/hit 2750 225 May
hit/hit/hit/hit/hit 1250 120 May
. . . .
. . . .
. . . .
. . . .
If this is feasible for a graph database, would it also scale? What if instead of 3 possible outcomes for a batter, there were 10,000 potential outcomes with 10,000,000 batters?
More so, the 10,000 unique outcomes would be sequenced in a combinatoric setting (e.g. 10,000 CHOOSE 2, 10,000 CHOOSE 3, etc.).
My question then is, if a graphing database is appropriate, how would you propose setting up a solution?
Much thanks in advance.
Related
Activity
Employee
Week of May 17
Week of May 24
Inbox
Alice
3
4
Inbox
Jane
5
8
Alpha Project
Alice
10
3
Beta Project
Francis
7
5
Chi Project
Jane
4
3
I've attempted to use conditional formatting, arrays & Vlookups and unable to cleanly get the following end result.
The End result is to flag anybody working > 10 hours for a given week.
Table is above.
End result should change the color of a cell titled "Alice" outside of this table because Alice worked 13 hours during Week of May 17.
End result should change the color of a cell titled "Jane" outside of this table because Jane worked 11 hours during Week of May 24.
Francis worked 10 hours or below, so no action is needed.
Any help on this is much appreciated --
Create a condition with the following formula:
=SUMIF(B2:B6,F2,C2:C6)>10
Where B2:B6 is the column of Employee names, F2 is the cell you want coloured, and C2:C6 is the column of the May 17th week
I got into databases and normalization. I am still trying to understand normalization and I am confused about its usage. I'll try to explain it with this example.
Every day I collect data which would look like this in a single table:
TABLE: CAR_ALL
ID
DATE
CAR
LOCATION
FUEL
FUEL_USAGE
MILES
BATTERY
123
01.01.2021
Toyota
New York
40.3
3.6
79321
78
520
01.01.2021
BMW
Frankfurt
34.2
4.3
123232
30
934
01.01.2021
Mercedes
London
12.7
4.7
4321
89
123
05.01.2021
Toyota
New York
34.5
3.3
79515
77
520
05.01.2021
BMW
Frankfurt
20.1
4.6
123489
29
934
05.01.2021
Mercedes
London
43.7
5.0
4400
89
In this example I get data for thousands of cars every day. ID, CAR and LOCATION never changes. All the other data can have other values daily. If I understood correctly, normalizing would make it look like this:
TABLE: CAR_CONSTANT
ID
CAR
LOCATION
123
Toyota
New York
520
BMW
Frankfurt
934
Mercedes
London
TABLE: CAR_MEASUREMENT
GUID
ID
DATE
FUEL
FUEL_USAGE
MILES
BATTERY
1
123
01.01.2021
40.3
3.6
79321
78
2
520
01.01.2021
34.2
4.3
123232
30
3
934
01.01.2021
12.7
4.7
4321
89
4
123
05.01.2021
34.5
3.3
79515
77
5
520
05.01.2021
20.1
4.6
123489
29
6
934
05.01.2021
43.7
5.0
4400
89
I have two questions:
Does it make sense to create an extra table for DATE?
It is possible that new cars will be included through the collected data.
For every row I insert into CAR_MEASUREMENT, I would have to check whether the ID is already in CAR_CONSTANT. If it doesn't exist, I'd have to insert it.
But that means that I would have to check through CAR_CONSTANT thousands of times every day. Wouldn't it be more efficient if I just insert the whole data as 1 row into CAR_ALL? I wouldn't have to check through CAR_CONSTANT every time.
The benefits of normalization are dependent on your specific use case. I can see both pros and cons to normalizing your schema, but its impossible to say which is better without more knowledge of your use case.
Pros:
With your schema, normalization could reduce the amount of data consumed by your DB since CAR_MEASUREMENT will probably be much larger than CAR_CONSTANT. This scales up if you are able to factor out additional data into CAR_CONSTANT.
Normalization could also improve data consistency if you ever begin tracking additional fixed data about a car, such as license plate number. You could simply update one row in CAR_CONSTANT instead of potentially thousands of rows in CAR_ALL.
A normalized data structure can make it easier to query data for a specific car. using a LEFT JOIN, the DBMS can search through the CAR_MEASUREMENT table based on the integer ID column instead of having to compare two string columns.
Cons:
As you noted, the normalized form requires an additional lookup and possible insert to CAR_CONSTANT for every addition to CAR_MEASUREMENT. Depending on how fast you are collecting this data, those extra queries could be too much overhead.
To answer your questions directly:
I would not create an extra table for just the date. The date is a part of the CAR_MEASUREMENT data and should not be separated. The only exception that I can think of to this is if you will eventually collect measurements that do not contain any car data. In that case, then it would make sense to split CAR_MEASUREMENT into separate MEASUREMENT and CAR_DATA tables with MEASUREMENT containing the date, and CAR_DATA containing just the car-specific data.
See above. If you have a use case to query data for a specific car, then the normalized form can be more efficient. If not, then the additional INSERT overhead may not be worth it.
New to R and new to this forum, tried searching, hope i dont embarass myself by failing to identify previous answers.
So i got my data, and i intend to do some kind of glmm's in the end but thats far away in the future, first im going to do some simple glm/lm's to learn what im doing
first about my data:
I have data sampled from 2 "general areas" on opposite sides of the country.
in these general areas there are roughly 50 trakts placed (in a grid, random staring point)
Trakts have been revisited each year for a duration of 4 years
A tract contains 16 sample plots, i intend to work on trakt-level so i use the means of the 16 sample plots for each trakt.
2x4x50 = 400 rows (actual number is 373 rows when i have removed trakts where not enough plots could be sampled due to terrain etc)
the data in my excel file is currently divided like this:
rows = trakts
Columns= the measured variable
i got 8-10 columns i want to use
short example how the data looks now:
V1 - predictor, 4 different columns
V2 - Response variable = proportional data, 1-4 columns depending on which hypothesis i end up testing,
the glmm in the end would look something like, (V2~V1+V1+V1,(area,year))
Area Year Trakt V1 V2
A 2015 1 25.165651 0
A 2015 2 11.16894652 0.1
A 2015 3 18.231 0.16
A 2014 1 3.1222 N/A
A 2014 2 6.1651 0.98
A 2014 3 8.651 1
A 2013 1 6.16416 0.16
B 2015 1 9.12312 0.44
B 2015 2 22.2131 0.17
B 2015 3 12.213 0.76
B 2014 1 1.123132 0.66
B 2014 2 0.000 0.44
B 2014 3 5.213265 0.33
B 2013 1 2.1236 0.268
How should i get started on this?
8 different files?
Nested by trakts ( do i start nesting now or later when i'm doing glmms?)
i load my data into r through the read.tables function
If i run: sapply(dataframe,class)
V1 and V2 are factors, everything else integer
if i run sapply(dataframe,mode)
everything is numeric
so finally to my actual problems, i have been trying to do normality tests (only trid shapiro so far) but i keep getting errors that imply my data is not numeric
also, when i run a normality test, do i only run one column and evaluate it before moving on to the next column or should i run several columns? the entire dataset?
should i in my case run independent normality tests for each of my areas and year?
hope it didnt end up to cluttered
best regards
I'm using ReportBuilder 2.0 / SQL Server 2008.
I have a report that uses visibility settings on the row groups which results in some row group headings being hidden, which in turn makes report totals seem incorrect. I can't change the visibility settings (for business reasons); what I'm looking for is a way to test EITHER for hidden items, OR for apparently incorrect totals. Consider the following dataset:
ItemCode SubPhaseCode SubPhase BidItem XTDPrice
1 1 Water Utility 1 5000
2 1 Water Utility 2 4000
3 2 Electrical Utility 3 75,000
4 2 Electrical Utility 3 75,000
5 2 Electrical Utility 3 100000
6 2 Electrical Utility 4 2500
7 2 Electrical Utility 4 2500
8 2 Electrical Utility 4 5064
9 2 Electrical Utility 5 3000
10 2 Electrical Utility 5 3000
11 2 Electrical Utility 5 5796
12 3 Gas Utility 6 60000
13 3 Gas Utility 6 60000
14 3 Gas Utility 6 61547
15 4 Other Utility 7 6000
16 4 Other Utility 7 7000
There are 3 Row Groups on the report, one for SubPhaseCode ("Group1"), and two for BidItem("Group2" and "DetailsGroup"):
Link to Design View Screenshot
The Row Visibility property for Group1 (SubPhaseCode) is:
=IIF(Fields!SubPhaseCode.Value = 3, true, false)
This results in the heading for the SubPhase "Gas" being hidden. This means that, when the report is run, I get something like the following:
Total 475407
Water 9000
-Utility 1 5000
-Utility 2 4000
Electrical 271860
-Utility 3 250000
-Utility 4 10064
-Utility 5 11796
-Utility 6 181547
Other 13000
-Utility 7 13000
The fact that SubPhase 3 ("Gas") is hidden results in 2 apparent errors:
1) The sum for "Electrical" (271860) appears incorrect for the 4 items below it (because there should be another row heading above "Utility 6")
2) The total of 475407 appears incorrect for the 3 groups below it (9000 + 271860 + 13000).
What I am looking for is a way to change the formatting of the headings (especially the Group Headings) if the numbers below them apparently don't add up. I understand how to implement conditional formatting and have done this for the Total. I am unclear how this could be implemented for the Row Group.
I would basically need some kind of a test, for each Row Heading, to see if the following heading would be hidden, according to the rules. This sounds to me like a "NEXT" function, which I know doesn't exist.
Other searches have indicated that I might need to add the desired data to the dataset or modify the underlying SP. Just wondering if there are any simpler solutions.
Thanks much for the help!
I'd avoid to sum the values in the SubPhase group SUM().
Try:
=SUM(IIF(Fields!SubPhaseCode.Value=3,0,Fields!XTDPrice.Value))
Let me know if this helps.
This question asks how to select a user's rank by his id.
id name points
1 john 4635
3 tom 7364
4 bob 234
6 harry 9857
The accepted answer is
SELECT uo.*,
(
SELECT COUNT(*)
FROM users ui
WHERE (ui.points, ui.id) >= (uo.points, uo.id)
) AS rank
FROM users uo
WHERE id = #id
which makes sense. I'd like to understand what the performance tradeoffs would be, between this approach, or by modifying the db structure to store a calculated rank (I guess that would require massive changes every time there's a rank change), or any other approaches that I'm too newb to think of. I'm a db noob.
The performance tradeoff would basically be what you described:
If you modified the structure to store a rank, queries would be very, very simple and fast. However, this would require some overhead any time "points" changed, as you'd have to verify that the rank hasn't changed. If the ranking had changed, you'd have to do multiple updates.
This causes more work (with the potential for bugs) at every update/insert. The tradeoff is very fast reads. If you're typical usage is very few modifications compared to millions of reads, AND you found this query to be a bottleneck, it might be worth considering reworking this. However, I would avoid the added complexity and maintainability headaches unless you truly found this to be a problem, since the current solution requires less storage, and is very flexible.
The link you reference is a MySQL question. If the original database had been Oracle the accepted answer would be to use an analytic function, which does scale, very nicely:
SQL> select id, name, points from users order by id
2 /
ID NAME POINTS
---------- ---------- ----------
1 john 4635
3 tom 7364
4 bob 234
6 harry 9857
8 algernon 1
9 sebastian 234
10 charles 888
7 rows selected.
SQL> select name, id, points, rank() over (order by points)
2 from users
3 /
NAME ID POINTS RANK()OVER(ORDERBYPOINTS)
---------- ---------- ---------- -------------------------
algernon 8 1 1
bob 4 234 2
sebastian 9 234 2
charles 10 888 4
john 1 4635 5
tom 3 7364 6
harry 6 9857 7
7 rows selected.
SQL> select name, id, points, dense_rank() over (order by points desc)
2 from users
3 /
NAME ID POINTS DENSE_RANK()OVER(ORDERBYPOINTSDESC)
---------- ---------- ---------- -----------------------------------
harry 6 9857 1
tom 3 7364 2
john 1 4635 3
charles 10 888 4
bob 4 234 5
sebastian 9 234 5
algernon 8 1 6
7 rows selected.
SQL>
Does not the 'where' portion of that query internally require reading the entire table? I understand about premature optimization. Academically, it seems that this wouldn't scale further than a few thousand rows.