I have this assignment where I have some problems. I have a hard time knowing what to do with my data I've collected.
My assignment is to calculate the constant c in ordo and as well n₀.
We have an unknown code that we execute via the terminal. We can choose how many elements we want to process. The more elements, the longer time it takes for the program to run.
At the end of the program we get a number on how long the program took to complete.
Here is the collected data:
Input | time (s)
--------+----------
1000 | 0.0015
1000 | 0.0016
1000 | 0.0015
2000 | 0.0063
2000 | 0.0063
3000 | 0.0063
4000 | 0.0281
4500 | 0.0344
5000 | 0.0453
6000 | 0.0672
7000 | 0.0953
8000 | 0.1265
9000 | 0.1656
10000 | 0.2078
11000 | 0.2547
12000 | 0.3062
15000 | 0.4875
20000 | 0.8953
25000 | 1.4125
30000 | 2.0390
35000 | 2.8750
40000 | 3.6641
50000 | 5.7641
50000 | 5.7438
70000 | 11.4781
75000 | 13.7312
80000 | 15.0828
85000 | 17.1156
90000 | 19.8610
100000 | 23.2328
110000 | 28.8032
130000 | 40.6344
The thing is: How do I move on from here? I have my guess looking at the chart is telling me that the complexity is O(n²).
Is there any tips for me how to take the next step and calculate c & n₀?
In general its not a good idea to "compute" the complexity of a function by measure the time it takes to run on different input.
E.g. Lets say the function uses big integers and large strings 50% each. Now you have a special compiler extension that speeds up the big integer arithmetics massivly. Now it is possible to miss how the runtime scales with the integer input.
If you have to get an idea of the complexity without the source code of the function, you could use the run time t as input to your "complexity function" f(t). To show that the function is in O(n²) you only have to give a g(t), c and n₀, such that f(t) ≤ c⋅g(t) holds for all t ≥ n₀, It does not have to be exact.
In your case it would be ok to choose g(t) = t² + 1, c = 1 and n₀ = 0.
You also could use g(t) = 1/4⋅10⁻⁸⋅t²+1 with c = 1 and n₀ = 0 (red)
or g(t) = 1/4⋅10⁻⁸⋅t² with c = 1 and n₀ = 40000 (blue).
But notice: You cannot do this exact. It is also possible that this result turns out as wrong if you test 10¹⁰ as input. If you want to get the exact complexity you have to take a look at the code.
Related
Imagine a fact table with a summation of measures over a time period, say 1 hour.
Start Date | Measure 1 | Measure 2
-------------------------------------------
2018-09-08 00:00:00 | 5 | 10
2018-09-08 00:01:00 | 12 | 20
Ideally we want to maintain the grain such that each row is exactly 1 hour. However, each row references dimensions which might ‘break’ the grain. For instance:
Start Date | Measure 1 | Measure 2 | Dim 1
---------------------------------------------------
2018-09-08 00:00:00 | 5 | 10 | key 1
2018-09-08 00:01:00 | 12 | 20 | key 2
It is possible that the dimension value may change 30 minutes into the hour in which case, the above would be inaccurate and should be represented like this:
Start Date | Measure 1 | Measure 2 | Dim 1
---------------------------------------------------
2018-09-08 00:00:00 | 5 | 10 | val 1
2018-09-08 00:00:30 | 5 | 10 | val 2
2018-09-08 00:01:00 | 12 | 20 | val 2
In our scenario, the data needs to be sliced by at least 5 dimension keys with queries like:
sum(measure1) where dim1 = x and dim2 = y..
Is there a design pattern for this requirement? I have considered ‘periodic snapshots’ but I have not read anywhere about this kind of row splitting on dimension changes.
I can see only two options:
Store the dimension values that were most present on each row (e.g. if a dimension value was true for the majority of the time in the hour, use this value). This would lead to some loss of accuracy.
Split each row on every dimension change. This is complex in the ETL, creates more data and breaks the granularity rule in the fact table.
Option 2 is the current solution and serves the purpose but is harder to maintain. Is there a better way to do this, or other options?
By way of a real example, this system records production data in a manufacturing environment so the data is something like:
Line | Date | Crew | Product | Running Time (mins)
-----------------------------------------------------------------------
Line 1 | 2018-09-08 00:00:00 | Crew A | Product A | 60
As noted, the crew, product or any of the other dimension may change multiple times within the hour.
You shouldn't need to split the time portion of your fact table since you clearly want to report hourly data, but you should have two records, one for each dimension value. If this is an aggregate of a transactional fact table, your process that loads the hourly table should be grouping each record by each dimension key. So in your example above, you should have two records for hour like so:
Start Date | Measure 1 | Measure 2 | Dim 1
---------------------------------------------------
2018-09-08 00:00:00 | 5 | 10 | val 1
2018-09-08 00:01:00 | 5 | 10 | val 1
2018-09-08 00:01:00 | 12 | 10 | val 2
You will need to take into account the other measures as well and make sure they all go into the correct bucket (val 1 or val 2). I split them evenly in the example.
Now if you slice by hour 1 and by Dim 1 Value 2, you will only see 12 (measure 1), and if you slice on hour 1, dim 1 value 1, you will only see 5, and if you only slice on hour 1, you will see 17.
Remember, your grain is defined by the level of each dimension, not just the time dimension. HTH.
Say that I gain +5 coins from every room I complete. What I'm trying to do is to make a formula in Excel that gets the total coins I've gotten from the first room to the 100th room.
With C++, I guess it would be something like:
while (lastRoom > 0)
{
totalCoins = lastRoom*5;
lastRoom--;
}
totalCoins, being an array so that you can just output the sum of the array.
So if ever, how do you put this code in excel and get it to work? Or is there any other way to get the total coins?
The are infinite solutions.
One is to build a table like this:
+---+----------+---------------+
| | A | B |
+---+----------+---------------+
| 1 | UserID | RoomCompleted |
| 2 | User 001 | Room 1 |
| 3 | User 002 | Room 1 |
| 4 | User 002 | Room 2 |
| 5 | User 002 | Room 3 |
+---+----------+---------------+
them pivot the spreadsheet to get the following:
+---+----------+-----------------------+
| | A | B |
+---+----------+-----------------------+
| 1 | User | Total Rooms completed |
| 2 | User 001 | 1 |
| 3 | User 002 | 3 |
+---+----------+-----------------------+
where you have the number of completed rooms for each users. You can now multiplicate the number per 5 as a simple formula or (better) as a calculated filed of the pivot.
If I understand you correctly you shouldn't need any special code, just a formula:
=(C2-A2+1)*B2
Where C2 = Nth room, A2 = Previous Room, and B2 = coin reward. You can change A2, B2, or C2 and the formula in D2 will output the result.
You can use the formula for sum of integers less than n: (n - 1)*(n / 2), then multiply it by coin count so you will get something like: 5 * (n - 1)*(n / 2). Then you just hook it up to your table.
Hope it helps
I have four columns of dates(A,B,C,D),I want to excel verify if each period of "period 2" intersects ALL "period 1" from 1,2,3.....
For example 20-28/01/2016 intersects 01-24/01/2016 AND 25/01-03/02/2016. The answer in this case in column E must be "wrong".
I think this have to be an array because in a cell must be verified an entire column. If it could be done without array I would be very happy, because array slow time calculation very much down on my computer .
_________________________________________________________________
| A | B | C | D | E |
| period 1 | period 2 | |
1 |01/01/2016|24/01/2016|20/01/2016|28/01/2016| "wrong" |
2 |25/01/2016|03/02/2016|04/02/2016|10/02/2016| "ok" |
3 |
I am using version 3.0.3, and running my queries in the shell.
I have ~58 million record nodes with 4 properties each, specifically an ID string, a epoch time integer, and lat/lon floats.
When I run a query like profile MATCH (r:record) RETURN count(r); I get a very quick response:
+----------+
| count(r) |
+----------+
| 58430739 |
+----------+
1 row
29 ms
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+--------------------------+----------------+------+---------+-----------+--------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+--------------------------+----------------+------+---------+-----------+--------------------------------+
| +ProduceResults | 7644 | 1 | 0 | count(r) | count(r) |
| | +----------------+------+---------+-----------+--------------------------------+
| +NodeCountFromCountStore | 7644 | 1 | 0 | count(r) | count( (:record) ) AS count(r) |
+--------------------------+----------------+------+---------+-----------+--------------------------------+
Total database accesses: 0
The Total database accesses: 0 and NodeCountFromCountStore tells me that neo4j uses a counting mechanism here that avoids iterating over all the nodes.
However, when I run profile MATCH (r:record) WHERE r.time < 10000000000 RETURN count(r);, I get a very slow response:
+----------+
| count(r) |
+----------+
| 58430739 |
+----------+
1 row
151278 ms
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+-----------------------+----------------+----------+----------+-----------+------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+-----------------------+----------------+----------+----------+-----------+------------------------------+
| +ProduceResults | 1324 | 1 | 0 | count(r) | count(r) |
| | +----------------+----------+----------+-----------+------------------------------+
| +EagerAggregation | 1324 | 1 | 0 | count(r) | |
| | +----------------+----------+----------+-----------+------------------------------+
| +NodeIndexSeekByRange | 1752922 | 58430739 | 58430740 | r | :record(time) < { AUTOINT0} |
+-----------------------+----------------+----------+----------+-----------+------------------------------+
Total database accesses: 58430740
The count is correct, as I chose a time value larger than all of my records. What surprises me here is that Neo4j is accessing EVERY single record. The profiler states that Neo4j is using the NodeIndexSeekByRange as an alternative method here.
My question is, why does Neo4j access EVERY record when all it is returning is a count? Are there no intelligent mechanisms inside the system to count a range of values after seeking the boundary/threshold value within the index?
I use Apache Solr for the same data, and returning a count after searching an index is extremely fast (about 5 seconds). If I recall correctly, both platforms are built on top of Apache Lucene. While I don't know much about that software internally, I would assume that the index support is fairly similar for both Neo4j and Solr.
I am working on a proxy service that will deliver results in a paginated form (using the SKIP n LIMIT m technique) by first getting a count, and then iterating over results in chunks. This works really well for Solr, but I am afraid that Neo4j may not perform well in this scenario.
Any thoughts?
The later query does a NodeIndexSeekByRange operation. This is going through all your matched nodes with the record label to look up the value of the node property time and does a comparison if its value is less than 10000000000.
This query actually has to get every node and read some info for comparison, and that's the reason why it is much slower.
Two functions that convert a rgb image to a gray scale image:
function rgb2gray_loop{T<:FloatingPoint}(A::Array{T,3})
r,c = size(A)
gray = similar(A,r,c)
for i = 1:r
for j = 1:c
#inbounds gray[i,j] = 0.299*A[i,j,1] + 0.587*A[i,j,2] + 0.114 *A[i,j,3]
end
end
return gray
end
And:
function rgb2gray_vec{T<:FloatingPoint}(A::Array{T,3})
gray = similar(A,size(A)[1:2]...)
gray = 0.299*A[:,:,1] + 0.587*A[:,:,2] + 0.114 *A[:,:,3]
return gray
end
The first one is using loops, while the second one uses vectorization.
When benchmarking them (with the Benchmark package) I get the following results for different sized input images (f1 is the loop version, f2 the vectorized version):
A = rand(50,50,3):
| Row | Function | Average | Relative | Replications |
|-----|----------|-------------|----------|--------------|
| 1 | "f1" | 3.23746e-5 | 1.0 | 1000 |
| 2 | "f2" | 0.000160214 | 4.94875 | 1000 |
A = rand(500,500,3):
| Row | Function | Average | Relative | Replications |
|-----|----------|------------|----------|--------------|
| 1 | "f1" | 0.00783007 | 1.0 | 100 |
| 2 | "f2" | 0.0153099 | 1.95527 | 100 |
A = rand(5000,5000,3):
| Row | Function | Average | Relative | Replications |
|-----|----------|----------|----------|--------------|
| 1 | "f1" | 1.60534 | 2.56553 | 10 |
| 2 | "f2" | 0.625734 | 1.0 | 10 |
I expected one function to be faster than the other (maybe f1 because of the inbounds macro).
But I can't explain, why the vectorized version gets faster for larger images.
Why is that?
The answer for the results is that multidimensional arrays in Julia are stored in column-major order. See Julias Memory Order.
Fixed looped version, regarding column-major-order (inner and outer loop variables swapped):
function rgb2gray_loop{T<:FloatingPoint}(A::Array{T,3})
r,c = size(A)
gray = similar(A,r,c)
for j = 1:c
for i = 1:r
#inbounds gray[i,j] = 0.299*A[i,j,1] + 0.587*A[i,j,2] + 0.114 *A[i,j,3]
end
end
return gray
end
New results for A = rand(5000,5000,3):
| Row | Function | Average | Relative | Replications |
|-----|----------|----------|----------|--------------|
| 1 | "f1" | 0.107275 | 1.0 | 10 |
| 2 | "f2" | 0.646872 | 6.03004 | 10 |
And the results for smaller Arrays:
A = rand(500,500,3):
| Row | Function | Average | Relative | Replications |
|-----|----------|------------|----------|--------------|
| 1 | "f1" | 0.00236405 | 1.0 | 100 |
| 2 | "f2" | 0.0207249 | 8.76671 | 100 |
A = rand(50,50,3):
| Row | Function | Average | Relative | Replications |
|-----|----------|-------------|----------|--------------|
| 1 | "f1" | 4.29321e-5 | 1.0 | 1000 |
| 2 | "f2" | 0.000224518 | 5.22961 | 1000 |
Just speculation because I don't know Julia-Lang:
I think the statement gray = ... in the vectorized form creates a new Array where all the calculated values are stored, while the old array is scrapped. In f1 the values are overwritten in place, so no new memory allocation is needed. Memory allocation is quite expensive so the loop-version with in-place overwrites is faster for low numbers.
But memory allocation is usually a static overhead (allocation twice as much doesn't take twice as long) and the vectorized version is computing faster (maybe in parallel ?) so if the numbers get big enough the faster calculation makes more difference than the memory allocation.
I cannot reproduce your results.
See this IJulia notebook: http://nbviewer.ipython.org/urls/gist.githubusercontent.com/anonymous/24c17478ae0f5562c449/raw/8d5d32c13209a6443c6d72b31e2459d70607d21b/rgb2gray.ipynb
The numbers I get are:
In [5]:
#time rgb2gray_loop(rand(50,50,3));
#time rgb2gray_vec(rand(50,50,3));
elapsed time: 7.591e-5 seconds (80344 bytes allocated)
elapsed time: 0.000108785 seconds (241192 bytes allocated)
In [6]:
#time rgb2gray_loop(rand(500,500,3));
#time rgb2gray_vec(rand(500,500,3));
elapsed time: 0.021647914 seconds (8000344 bytes allocated)
elapsed time: 0.012364489 seconds (24001192 bytes allocated)
In [7]:
#time rgb2gray_loop(rand(5000,5000,3));
#time rgb2gray_vec(rand(5000,5000,3));
elapsed time: 0.902367223 seconds (800000440 bytes allocated)
elapsed time: 1.237281103 seconds (2400001592 bytes allocated, 7.61% gc time)
As expected, the looped version is faster for large inputs. Also note how the vectorized version allocated three times as much memory.
I also want to point out that the statement gray = similar(A,size(A)[1:2]...) is redundant and can be omitted.
Without this unnecessary allocation, the results for the largest problem are:
#time rgb2gray_loop(rand(5000,5000,3));
#time rgb2gray_vec(rand(5000,5000,3));
elapsed time: 0.953746863 seconds (800000488 bytes allocated, 3.06% gc time)
elapsed time: 1.203013639 seconds (2200001200 bytes allocated, 7.28% gc time)
So the memory usage went down, but the speed did not noticeably improve.