I have an existing table of google datastore, I want only to update its index.
My question is: When I create/delete index, if it is free (not datastore storage cost, I mean only the action of create/delete index itselt)?
I don't find about this in their tarif page https://cloud.google.com/datastore/pricing
Thx
The indexing operations themselves are free - normally they're absorbed in the entity write operation costs. But the storage space used for the indexes contribute is considered in the overall storage costs (see the Summary in the Datastore Dashboard).
You shouldn't normally need to update indexes (they're automatically updated at entity writes) unless:
you add/delete composite indexes (you can't modify existing composite indexes, you can only add new ones and cleanup the ones no longer used)
you change properties from non-indexed to indexed or viceversa - note that you'd need to re-write the existing entities to be included into/excluded from the indexes (with associated write ops costs), see How to query for newly created property in NDB?
Related
Context on the problem statement.
Scroll to the bottom for questions.
Note: The tables are not relational, joins can be done at application level.
Classes
Record
Most atomic unit of the database (each record has key, value, id)
Page
Each file can store multiple records. Each page is a limited chunk (8 kb??), and it also stores an offset to retrieve each id at the top?
Index
A B-tree data structure, that stores ability to do log(n) lookups to find which id lives in which page.
We can also insert id's and page into this B-tree.
Table
Each Table is an abstraction over a directory that stores multiple pages.
Table also stores Index.
Database
Database is an abstraction over a directory which includes all tables that are a part of that database.
Database Manager
Gives ability to switch between different databases, create new databases, and drop existing databases.
Communication In Main Process
Initiates the Database Manager as it's own process.
When the process quits it saves Indexes back to disk.
The process also stores the indexes back to disk based on an interval.
To interact with this DB process we will use http to communicate with it.
Database Manager stores a reference to the current database being used.
The current database attribute stored in the Database Manager stores a reference to all Table's in a hashmap.
Each Table stores a reference to the index that is read from the index page from disk and kept in memory.
Each Table exposes public methods to set and get key value pair.
Get method navigates through b-tree to find the right page, on that page it finds the key val pair based on the offset stored on the first line, and returns a Record.
Each Set method adds a key val pair to the database and then updates the index for that table.
Outstanding Questions:
Am I making any logical errors in my design above?
How should I go about figuring what the data page size should be (Not sure why relation DB's do 8gb)?
How should I store the Index B-tree to disk?
Should the Database load all indexes for the table in memory at the very start ?
A couple of notes from the top of my head:
How many records do you anticipate storing? What are the maximum key and value sizes? I ask, because with a file per page scheme, you might find yourself exhausting available file handles.
Are the database/table distinctions necessary? What does this separation gain you? Truly asking the question, not being socratic.
I would define page size in terms of multiples of your maximum key and value sizes so that you can get good read/write alignment and not have too much fragmentation. It might be worth having a naive, but space inefficient, implementation that aligns all writes.
I would recommend starting with the simplest possible design (load all indices, align writes, flat schema) to launch your database, then layer on complexity and optimizations as they become needed, but not a moment before. Hope this helps!
Currently I am facing an issue that a MongoDB collection might have billion records which contains document based on some rapid event happening in the system. These events get logged in the DB collection.
Since we have some 2-3 composite indexing on the same collection, the search definitely becomes slow.
The escape point to this is our customer has agreed if we can index only N months data in the MongoDB, then the efficiency for read can increase instead of having 2-3 years data indexed and we perform read operation.
My thoughts on solution 1: we can do TTL indexes and set expiry. After this expiry the data gets deleted from main collection. we can some how do backup for that expired records. This way we can only have specific data required in main collection.
My thoughts on solution 2: I can remove all the indexes, create the indexes again based on time frame, for example, Drop current indexes and again create indexes based on condition that indexes must be created only till past N months data only. This way I can maintain limited index. But I am not sure how much is it possible.
Question: I need more help on this on how can I achieve selective indexing. Also it must be rolling as everyday records gets added so does indexing.
If you're on Mongo 3.2 or above, you should be able to use a partial index to create the "selective index" that you want -- https://docs.mongodb.com/manual/core/index-partial/#index-type-partial You'll just need to be sure that your queries share the same partial filter expression that the index has.
(I suspect that there might also be issues with the indexes you currently have, and that reducing index size won't necessarily have a huge impact on search duration. Mongo indexes are stored in a B-tree, so the time to navigate the tree to find a single item is going to scale relative to the log of the number of items. It might be worth examining the explain output for the queries that you have to see what mongo is actually doing.)
I have a Users table in DynamoDB that has a unique hash key username. I want, however, to be able to find a specific user in the most efficient way possible by providing either just the username, or just the email (the email is also unique). I can make the email a global secondary index, but I have a trouble estimating the additional cost of this approach. Will using the index to retrieve a user result in two read operations? Or how many operations exactly?
Also, I want read and write throughput of the index to equal those of the table (and ideally, scale automatically), can I do that by not providing specific throughput values when I create the index with API, or do I have to provide them?
The number of read operations you will need to retrieve values from the index will depend on what values you want to read (all of them vs just a subset) and what the projection type for the index is. If the projection is ALL then it only takes 1 read, but it may cost more. If the projection is KEYS_ONLY you will only get back the table's primary key, then you would have to query the table again by that. That takes more than 1 read, but may be cheaper. It will all depend on your use cases and usage patterns.
See "Attribute Projections" at https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
I think you need to provide the read capacity and write capacity for the index when it is created - it will not inherit any values from the parent table. Although if the table is using autoscaling, the autoscaling configuration can be automatically applied to the GSI. See https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.Console.html#AutoScaling.Console.ExistingTable
In Google App Engine Datastore HRD in Java,
We can't do joins and query multiple table using Query object or GQL directly
I just want to know that my idea is correct approach or not
If We build Index in Hierarchical Order Like Parent - Child - Grand child by node
Node
- Key
- IndexedProperty
- Set
In case if we want to collect all the sub child's & grand child's. We can collect all the keys which are matching within the hierarchy filter condition and provide the result of keys
and In Memcache we can hold each key and pointing to DB entity, if the cache does not have also in a single query using set of keys we can get all the records from DB.
Pros
1) Fast retrieval - Google recommends using get entities by keys.
2) Single Transaction is enough to collect multiple table data.
3) Memcache and Persistent Datastore will represent the same form.
4) It will scan only the related data to the group like user or parent node.
Cons
1) Meta Data of the DB size will increase so the DB size increase.
2) If the Index of the Single Parent is going to take more than 1MB then we have to split and Save as blob in the DB.
This structure is good approach or not.
In case If we have long deeper levels in the hierarchy, this will solve running lot of query operation to collect all the items dependent to parents.
In case of multiple parents -
Collect all the Indexes and Get the Keys related to the Query.
Collect all the data in single transactions using list of keys.
If any one found some more Pros or Cons Please add them and justify this approach will correct or not.
Many thanks
Krishnan
There are quite a few things going on here that are important to think about:
Datastore is not a relational database. You definitely should not be approaching your data storage from a tables and join perspective. It will lead to a messy and most likely inefficient setup.
It seems like you are trying to restructure your use of Datastore to provide complete transactional and strongly consistent use of your data. The reason Datastore cannot provide this natively is that it is too inefficient to provide these guarantees along with high availability.
With the Datastore, you want to be able to provide the ability to support many (thousands, hundreds of thousands, millions, etc) writes per second to different entities. The reason that the Datastore provides the notion of an entity group is that it allows the developer to specify a specific scope of consistency.
Consider an example todo tracking service. You might define a User and a Todo kind. You wouldn't want to provide strong consistency for all Todos, since every time a user adds a new note, the underlying system would have to ensure that it was put transactionally with all other users writing notes. On the other hand, using entity groups, you can say that a single User represents your unit of consistency. This means that when a user writes a new note, this has to be updated transactionally with any other modification to that user's notes. This is a much better unit of consistency since as your service scales to more users, they won't conflict with each other.
You are talking about creating and managing your own indexes. You almost certainly don't want to do this from an efficiency point of view. Further, you'd have to be very careful since it seems you would have a huge number of writes to a single entity / range of entities which represent your table. This is a known Datastore anti-pattern.
One of the hard parts about the Datastore is that each project may have very different requirements and thus data layout. There is definitely not one size fits all for how to structure your data, but here are some resources:
What actually happens when you do a write to Datastore
How Datastore stores data
Datastore Entity relationship modeling
Datastore transaction isolation
I have a student entity which already has about 12 fields.Now, I want to add 12 more fields(all related to his academic details).Should I normalize(as one-to-one) and store it in a different entity or should I keep on adding the information in Student entity only.
I am using gaesession to store the logged in user in memory
session = get_current_session()
session['user'] = user
Will this affect in the read and write performance/cost of the app? Does cost of storing an entity in the memcache(FE instance) related to the number of attributes stored in an entity?
Generally the costs of either writing two entities or fetching two entities will be greater than the cost of writing or fetching a single entity.
Write costs are associated with the number of indexed fields. If you're adding indexed fields, that would increase the write cost whenever those fields are modified. If an indexed field is not modified and the index doesn't need to be updated, you do not incur the cost of updating that index. You're also not charged for the size of the entity, so from a cost perspective, sticking with a single entity will be cheaper.
Performance is a bit more complicated. Performance will be affected by 1) query overhead and 2) the size of the entities you are fetching.
If you have two entities, you're going to suffer double the query overhead, since you'll likely have to query/fetch the base student entity and then issue a second query/fetch for the second entity. There may be certain ways around this if you are able to fetch both entities by id asynchronously. If you need to query though, you're perf is likely going to suffer whenever you need to query for the 2nd entity.
On the flip side, perf scales negatively with entity size. Fetching 100 1MB entities will take significantly longer than fetching 100 500 byte entities. If your extra data is large, and you typically query for many student entities at once, then storing the extra data in a separate entity such that the basic student entity is small, you can increase performance significantly for the cases where you don't need the 2nd entity.
Overall, for performance, you should consider your data access patterns, and try to minimize extraneous data fetching for the common fetching situation. ie if you tend to only fetch one student at a time, and you almost always need all the data for that student, then it won't affect your cost to load all the data.
However, if you generally pull lists of many students, and rarely use the full data for a single student, and the data is large, you may want to split the entities.
Also, that comment by #CarterMaslan is wrong. You can support transactional updates. It'll actually be more complicated to synchronize if you have parts of your data in separate entities. In that case you'll need to make sure you have a common ancestor between the two entities to do a transactional operation.
It depends on how often these two "sets" of data need to be retrieved from datastore. As a general principle in GAE, you should de-normalize your data, thus in your case store all properties in the same model. This, will result in more write operations when you store an entity but will reduce the get and query operations.
Memcache is not billable, thus you don't have to worry about memcache costs. Also, if you you use ndb (and I recommend you to do so), caching in memcache is automatically handled.