How to calulcate index pages scanned and relation pages scanned - database

I have some homework problems that require that I calculate the total cost of queries. Some of the assumptions made for the queries are:
All costs are in terms of disk pages, cost of scanning the index, and cost of reading matching
tuples from the relation if needed.
If reading tuples from a relation, use the worst case scenario: all tuples are in different disk
pages.
Assume all indices have 3 levels (root, internal node, and leaf level), and the root of all indices
are in memory so scanning the root incurs no cost. Most searches will have one node from
the internal level and a number of leaf nodes.
Postgresql returns the size of an index in terms of the number of pages using the query below.
For the B-tree index, assume the numbers provided are the number of leaf nodes.
My professor also gives us a sample query to get the cost for, but I don't know exactly how he got to these numbers. The database in question looks like:
create table series (
seriesid int primary key
, title varchar(400)
, yearreleased int
, contentrating varchar(40) -- age group the movie is intended for
, imdbrating float -- imdb rating
, rottentomatoes int -- rotten tomatoes rating
, description text
, seasons int -- how many seasons are available
, date_added date -- date series is added to Netflix
) ;
and the query the calculate the cost for looks like so:
qa: select director from seriesdirectors where seriesid <= 100;
The index referenced is seriesdirectors_pkey which has 2 relation pages, and the query returns 5 tuples in total.
Using these numbers my professor somehow got to the conclusion that the number of index pages scanned is 3 and the number of relation pages scanned is 0.
The reasoning being: "because there are 5 tuples matching this query (1 internal node, and 2 leaf nodes in the worst case), and the index contains all the necessary information (so zero tuples need to be read)." I have been trying to understand the concepts of index pages and how exactly my professor got these numbers so I can work on the rest of the problems. If any more context is needed and I will let you know.
Edit: The definition of the index page in this case is a way to store tuples from a database with a unique identifier (i.e. a standard index for a database, more here: https://en.wikipedia.org/wiki/Database_index).
The big question I have is how can I calculate the total number of index pages queried and the number of relation pages queried. The example query above (qa) was using the index seriesdirectors_pkey which has 2 relation pages and returns a result of 5 rows. I don't know how we get from that to knowing that 3 index pages were queried and 0 relation pages were queried.

Related

Excel: Need to Generate IDs based on multiple criteria with repeating IDs

Looking to create pricing groups bases on multiple criteria. Each group could have multiple items within the group. I'm struggling with the autocreation the naming of each group. I estimate there should be about 6.5K pricing groups out of 14K items.
Below is the criteria -
QTY per case - is the number of bottles in a case
Size - size of the bottle
Family Brand - contains a group of like items
Code - CS1 - This is my unique code for each group that contains each of the above and lowest possible case price.
enter image description here
The "Thinking" column is how I want each group to look, but how do I do this with 14K items quickly?
If I understood correctly your pricing group name consists of two parts: a simple combination of columns and a "special" column, that should be counted.
Part 1 is simple: =C2&"-"&B2&"-"&A1&"-"
To make Part 2 easier you could sort, sorting fields Part 1, CODE-CS1.
After have done this you could use helping columns. If Part 1 is in column x and code-CS1 in column y you could find a formula for
Part 2 (column z): ="T"&IF(X1=X2;IF(Y1=Y2;Z1;Z1+1);1)
That means: If Part 1 is changing your counter starts with T1, if not so if your CODE CS1 changes, it counts, if not, so it keeps last number.
the result code would be =X2&Z2
It is untested and I use german excel, maybe the code doesn't work without any adaption, but in general it should work

Alternative 1 of index data entry

One of the three alternatives of what to store as a data entry in an index is a data entry k* which is an actual record with search key value k. My question is, if you're going to store the actual record in the index, then what's the point of creating the index?
This is an extreme case, because it does not really correspond to
having data entries separated by data records (the hashed file is an
example of this case).
(M. Lenzerini, R. Rosati, Database Management Systems: Access file manager and query evaluation, "La Sapienza" University of Rome, 2016)
Alternative 1 is often used for direct indexing, for instance in B-trees and hash indexes (see also Oracle, Building Domain Indexes)
Let's do a concrete example.
We have a relation R(a,b,c) and we have a clustered B+⁠-⁠tree using alternative 2 on search key a. Since the tree is clustered, the relation R must be sorted by a.
Now, let's suppose that a common query for the relation is:
SELECT *
FROM R
WHERE b > 25
so we want to build another index to efficiently support this kind of query.
Case 1: clustered tree with alt. 2
We know that clustered B+⁠-⁠trees with alternative 2 are efficient with range queries, because they needs just to search for the first good result (say the one with b=25), then do 1 page access to the relation's page to which this result points, and finally scan that page (and eventually, some other pages) until the records fall within the given range.
To sum up:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Use the found pointer to go to a specific page. Cost: 1
Scan the page and eventual other pages. Cost: num. of relevant pages
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + 1 + #relevant-pages
where ƒ is the fan-out and ℓ the number of leaves.
Unfortunately, in our case a tree on search key b must be unclustered, because the relation is already sorted by a
Case 2: unclustered tree with alt. 2 (or 3)
We also know that B+⁠-⁠trees are not so efficient in range queries when they are unclustered. Infact, having a tree with alternative 2 or 3, in the tree we'd store only the pointers to the records, so for each result that falls in the range we'd have to do a page access to a potential different page (because the relation has a different order with respect to the index).
To sum up:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Follow scanning the leaf (and maybe other leaves) and do a different page access for each tuple that falls in the range. Cost: num. of other relevant leaves + num. of relevant tuples
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + #other-relevant-leaves + #relevant-tuples
notice that the number of tuples is pretty bigger respect to the number of pages!
Case 3: unclustered tree with alt. 1
Using alternative 1, we have all the data in the tree, so for executing the query we:
Search for the first good result in the tree. Cost: logƒ(ℓ)
Follow scanning the leaf (and maybe other leaves). Cost: num. of other relevant leaves
The final cost (expressed in terms of page accesses) is
logƒ(ℓ) + #other-relevant-leaves
that is even smaller than (or at most equal to) the cost of case 1, but this instead is allowed.
I hope I was clear enough.
N.B. The cost is expressed in terms of page accesses because the I/O operations from/to second⁠-⁠storage are the most expensive ones in terms of time (we ignore the cost of scanning a whole page in main memory but we consider just the cost of accessing it).

Google App Engine - Search API index growth

I would like to know how can I estimate the growth (how much the size increasez in a period of time) of an index of App engine Search API (FTS) based on the number of entities inserted and amount of information. For this I would like to know basically how is the index size calculated (on what does it depend). Specifically:
When inserting new entities, is the growth (size) influenced by the number of previous existing entities? (ie. is the growth exponential)? For ex. if I have 1000 entities and I insert 10, the index will grow with X bytes. But if I have 100000 entities and insert 10, will it increase with X or much more than X (exponentially, let' say 10*X) ?
Does the number of fields (properties) influences the size exponentially? For ex. if I have entity A with 2 fields and entity B with 4 fields (let's say identical in values, for mathematical simplicity) will the size increase, when adding entity B, twice as that of entity A or much more than that?
What other means can I use to find statistical information; do I have other tools in the cloud console of app engine, or can I do this programmatically ?
Thank you.
You can check the size of a given index by running the code below.
from google.appengine.api import search
for index in search.get_indexes(fetch_schema=True):
logging.info("index %s", index.storage_usage)
# pseudo code
amount_of_items_to_add = 100
x = 0
for x <= amount_of_items_to_add:
search_api_insert_insert(data)
x+=1
#rerun for loop to see how much the size increased
for index in search.get_indexes(fetch_schema=True):
logging.info("index %s", index.storage_usage)
This code is obviously not a complete working example, but you should be able to build a simple method that takes some data inserts it into the search api and returns how much the used storage increased.
I have run a number of tests for different number of entities and different number of indexed properties per entity and it seams thst the estimated growth of the index reported by the api is not exponential it is linear.
But the most interesting fact to know is that although the size reported is realtime almost, after deleting documents from the index, it may take 12, 24 even 36 hours to update.

regarding indexes. i tried a lot but didn't really understand it. how does the realtion really look like? here is it

Consider a table storing temperature readings taken by sensors:
Temps(sensorID, time, temp)
Assume the pair of attributes [sensorID,time] is a key. Consider the following query:
select * from Temps
where sensorID = 'sensor541'
and time = '05:11:02'
Consider the following scenarios:
A - No index is present on any attribute of Temps
B - An index is present on attribute sensorID only
C - An index is present on attribute time only
D - Separate indexes are present on attributes sensorID and time
E - A multi-attribute index is present on (sensorID,time)
Suppose table Temps has 50 unique sensorIDs and each sensorID has exactly 20 readings. Furthermore there are exactly 10 readings for every unique time in Temps.
For each scenario A-E, determine the maximum number of tuples that might be accessed to answer the query, assuming one "best" index is used whenever possible. (Don't count the number of index accesses.) Which of the following combinations of values is correct?
1) A:1000, C:1000, D:10
2) B:10, C:10, E:10
3) B:20, C:10, E:1
4) B:1000, C:10, D:10
Scenario A: Since there are no indexes, all tuples of the table may need to be accessed to look for 'sensor541' and '05:11:02'. The number of tuples in Temps is 50 (unique sensorIDs) * 20 (number of readings per sensor) = 1000.
Scenario B: Using the index on sensorID, 20 readings will match the given sensorID, and all 20 tuples may need to be accessed to look for a matching time.
Scenario C: Using the index on time, 10 readings will match the given time, and all 10 tuples may need to be accessed to look for a matching sensorID.
Scenario D: Using the time index is better than using the sensorID index, so the time index is used and is the same as scenario C (10 tuples).
Scenario E: The index on [sensorID, time] will directly find the single matching tuple, if there is one.

Getting random entry from Objectify entity

How can I get a random element out of a Google App Engine datastore using Objectify? Should I fetch all of an entity's keys and choose randomly from them or is there a better way?
Assign a random number between 0 and 1 to each entity when you store it. To fetch a random record, generate another random number between 0 and 1, and query for the smallest entity with a random value greater than that.
You don't need to fetch all.
For example:
countall = query(X.class).count()
// http://groups.google.com/group/objectify-appengine/browse_frm/thread/3678cf34bb15d34d/82298e615691d6c5?lnk=gst&q=count#82298e615691d6c5
rnd = Generate random number [0..countall]
ofy.query(X.class).order("- date").limit(rnd); //for example -date or some chronic indexed field
Last id is your...
(in average you fatch 50% or at lest first read is in average 50% less)
Improvements (to have smaller key table in cache)!
After first read remember every X elements.
Cache id-s and their position. So next time query condition from selected id further (max ".limit(rnd%X)" will be X-1).
Random is just random, if it doesn't need to be close to 100% fair, speculate chronic field value (for example if you have 1000 records in 10 days, for random 501 select second element greater than fifth day).
Other options, if you have chronic field date (or similar), fetch elements older than random date and younger then random date + 1 (you need to know first date and last date). Second select random between fetched records. If query is empty select greater than etc...
Quoted from this post about selecting some random elements from an Objectified datastore:
If your ids are sequential, one way would be to randomly select 5
numbers from the id range known to be in use. Then use a query with an
"in" filter().
If you don't mind the 5 entries being adjacent, you can use count(),
limit(), and offset() to randomly find a block of 5 entries.
Otherwise, you'll probably need to use limit() and offset() to
randomly select one entry out at a time.
-- Josh
I pretty much adapt the algorithm provided Matejc. However, 3 things:
Instead of using count() or the datastore service factory (DatastoreServiceFactory.getDatastoreService()), I have an entity that keep track of the total count of the entities that I am interested in. The reason for this approach is that:
a. count() could be expensive when you are dealing with a lot of objects
b. You can't test the datastore service factory locally...testing in prod is just a bad practice.
Generating the random number: ThreadLocalRandom.current().nextLong(1, maxRange)
Instead of using limit(), I use offset, so I don't have to worry about "sorting."

Resources