I'm now using Google Datastore for my company's database.
Today, I made a index and it successfully listed in 'Index'.
But the size and entities of index which I made is empty.
The documentation of google Datastore says that the index is auto-genarated, but it wasn't.
Is there any command or something to do to generate the index?
The image below is a screenshot.
The upper one is the new one. The below one is already used.
As a matter of fact existing entities will not be indexed automatically. You have to load and save all your old entities (without index) in order to have the necessary indexes created for these entities.
Note, however, that changing a property from unindexed to indexed does
not affect any existing entities that may have been created before the
change. Queries filtering on the property will not return such
existing entities, because the entities weren't written to the query's
index when they were created. To make the entities accessible by
future queries, you must rewrite them to the Datastore so that they
will be entered in the appropriate indexes. That is, you must do the
following for each such existing entity:
Retrieve (get) the entity from the Datastore. Write (put) the entity
back to the Datastore. Similarly, changing a property from indexed to
unindexed only affects entities subsequently written to the Datastore.
The index entries for any existing entities with that property will
continue to exist until the entities are updated or deleted. To avoid
unwanted results, you must purge your code of all queries that filter
or sort by the (now unindexed) property. (source)
Note that the documentation doesn't explicitly say the same for composed indexes. When you deploy a new composite index the index will appear in the developers console as "building" until it reaches "serving" state. Not sure what exactly it's building there, i usually re-saved all my entities and everything worked as it should.
auto-generated is a keyword that tells you whether you have manually created this index or whether it was created by the dev server when you made a query that required this index. This is in no way linked to how and when the indexes are created for the entities.
The <datastore-indexes> element has an autoGenerate attribute that
controls whether this file should be considered along with
automatically generated index configuration. See Using Automatic Index
Configuration below. (source)
When you created a new index and you want this index for all your existing entities I recommend you create a cursor query to handle this. Usually I expose this query in an admin backend and have the query run until there are no results anymore. Why expose the thing? If you have lots of entities this job may run longer than the allowed 60 seconds in the frontend or 10 minutes in the backend. By exposing this I can use the front end instance time and don't have to worry about the time restrictions.
Related
For example , search engines such as Sphinx , Lucene must merge there indexes periodically , but index of database can be updated dynamically . Why must the index of search engine be merged?
I don't know much about Sphinx but I believe the answer to this question will not be related to it.
First, why databases do not need updates periodically? This is because of database is the major data store for the applications most of the time. By this I mean, if you create, delete or update any data; that data is the means of a database record. You're removing data from there to get rid of it within the application or you first get the data from database to update since old version is kept there. All this indicates that databases are being updated all the time and your data is always up-to-date there.
Why an index of a search engine needs periodic reindexing? Index is the data store for a search engine basically that you're processing your data, putting it into index and then retrieving it by the means of your search system. That index is your secondary data resource. This does not hold for all applications but most of the time, you have database as primary resource that is being synchronized with your application as I explained above and then index where you're not reflecting all changes in real-time. Then you find your data in index a little bit outdated according to the database. That reindexing step is necessary for you to keep your data resources consistent.
As I said this explanation does not hold for all applications but it can give you the basic idea.
ps: You have a "index of database" phrase in your question but it is totally a different topic.
i have a solr standalone server (not solr cloud), holding documents from a few different sources.
Routinely i need to update the documents for a source, typically i do this by deleting all documents from that source/group, and indexing the new documents for that source, but this creates a time gap where i have no documents for that source, and that's not ideal.
Some of these documents will probably remain from one update to the other, some change and could be updated, but some may disappear, and need to get deleted.
What's the best way to do this?
Is there a way to delete all documents from a source, but not committing, and in the same transaction index that source again and only then commit? (that would not create a time gap of no information for that source)
Is using core swapping a solution? (or am i over complicating?)
Seems like you need a live index which will keep serving queries while you update the index without having any downtime. In a way you are partially re-indexing your data.
You can look into maintaining two indices, and interacting with them using ALIASES.
Check this link: https://www.elastic.co/guide/en/elasticsearch/guide/current/multiple-indices.html
Although its on Elasticsearch website, you can easily use the concepts in solr.
Here is another link on how to create/use ALIASES
http://blog.cloudera.com/blog/2013/10/collection-aliasing-near-real-time-search-for-really-big-data/
Collection aliases are also useful for re-indexing – especially when
dealing with static indices. You can re-index in a new collection
while serving from the existing collection. Once the re-index is
complete, you simply swap in the new collection and then remove the
first collection using your read side aliases.
When I started off with my project, I thought there was no need to create indexes on certain fields of entities but to generate certain daily reports, statistics we have a need to create indexes on some fields of existing entities.
As explained in the post Retroactive indexing in GAE Datastore, only way is to first change these properties from unindexed to indexed then retrieve and write all the entities again.
My question is if I take a back up from Datastore Admin and restore after changing the properties to indexed, will my project have all the required properties indexed? or do I need to retrieve and write through a program?
PS: My project is a java project on GAE
Edit: Work around I mentioned earlier does not work. The only way to change the field is to re-upload the entities. Sorry.
I have an application which requires very flexible searching functionality. As part of this, users will need have the ability to do full-text searching of a number of text fields but also filter by a number of numeric fields which record data which is updated on a regular basis (at times more than once or twice a minute). This data is stored in an NDB datastore.
I am currently using the Search API to create document objects and indexes to search the text-data and I am aware that I can also add numeric values to these documents for indexing. However, with the dynamic nature of these numeric fields I would be constantly updating (deleting and recreating) the documents for the search API index. Even if I allowed the search API to use the older data for a period it would still need to be updated a few times a day. To me, this doesn't seem like an efficient way to store this data for searching, particularly given the number of search queries will be considerably less than the number of updates to the data.
Is there an effective way I can deal with this dynamic data that is more efficient than having to be constantly revising the search documents?
My only thoughts on the idea is to implement a two-step process where the results of a full-text search are then either used in a query against the NDB datastore or manually filtered using Python. Neither seems ideal, but I'm out of ideas. Thanks in advance for any assistance.
It is true that the Search API's documents can include numeric data, and can easily be updated, but as you say, if you're doing a lot of updates, it could be non-optimal to be modifying the documents so frequently.
One design you might consider would store the numeric data in Datastore entities, but make heavy use of a cache as well-- either memcache or a backend in-memory cache. Cross-reference the docs and their associated entities (that is, design the entities to include a field with the associated doc id, and the docs to include a field with the associated entity key). If your application domain is such that the doc id and the datastore entity key name can be the same string, then this is even more straightforward.
Then, in the cache, index the numeric field information by doc id. This would let you efficiently fetch the associated numeric information for the docs retrieved by your queries. You'd of course need to manage the cache on updates to the datastore entities.
This could work well as long as the size of your cache does not need to be prohibitively large.
If your doc id and associated entity key name can be the same string, then I think you may be able to leverage ndb's caching support to do much of this.
I am trying to understand how the Google App Engine (GAE) datastore is designed and how to use it. I am having a bit of a hard time to visualise the structure from the description at the getting started page.
Can somebody explain the datastore using figures for us visually oriented people? Or point to a good tutorial again with visual learning in mind?
I am specifically looking for answers with diagrams/figures that explains how GAE is used.
The 2008 IO session "Under the Covers of the Google App Engine Datastore" has a good visual overview of the datastore.
https://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore
http://snarfed.org/datastore_talk.html
For more IO talks go to:
https://developers.google.com/appengine/docs/videoresources
Very simplified I've understood that GAE can be viewed as a hashmap of hashmaps.
That said you could view it like this:
I guess there's no correct answer here, just different mind models. Depending on your programming background you may find mine enlightning, disturbing or both. I picture the datastore as a single huge distributed key-value collection of buckets that comprises all entity data of any kind in any namespace and all GAE apps of all users. A single bucket is called an entity group. It has a root key which (under the hood) consists of your appID, a namespace, a kind, an entity ID or name. In an entity group resides one ore more entities which have keys extending the root key. The entity belonging to the root key itself may or may not exist. Operations within a single entity group are atomic (transactional). An entity is a simple map-like datastructure. The 2 built-in indexes (ascending and descending) again are 2 giant sorted collections of index entries. Each index entry is a datastructure of appID,namespace,kind,property name,property type,property value,entity key - in that order.
Each (auto-)indexed value of each property of each entity creates 2 such index entries. There's another index with just entity keys in it. Custom indexes however go to yet another sorted collection with entries containing appID,namespace,index type,combined index value, entity key. That's the only part of the whole datastore that uses meta-data. It stores an index definition which tells the store how the combined index value is formed from the entity. This is the picture that's burnt into my mind and from which I know how to make the datastore happy.