Appengine search index and namespaces - google-app-engine

I am writing a multi-tenant application using appengine namespaces. We need a separate index per tenant for searching employees (to avoid the 10GB limit of a search index).
If I create a search index "employees" (in golang search.Open("employees") ) and index the following docs (using golang api search.Put(ctx, id, doc) )
doc1 from tenant 1 with namespace "abc" and
doc2 from tenant 2 with namespace "xyz"
do these docs go into a single index or two different indexes in two different namespaces? I want to make sure that I am not hitting the 10GB limit.
thanks

Based on NAMESPACE the data divides.
if "employees" has one namespace it will store under its namespace.
if we provide another NAMESPACE the data will store in that namespace only.
i think u r asking the same thing.

Related

When should I use ObjectId vs UUID in MongoDB

I'm making a simple CRUD application with MongoDB so I can learn more about it.
The application is a simple blog, I have a collection named "articles" which stores various documents, each one representing a post for my blog.
When I display the list of all blog posts, I can do a db.collection.find(), and list all of them.
But the question lies when I need to show a single post individually, when I need to query the collection for a single, specific document.
The logical solution would be to use a RDBMS and an auto increment feature, but MongoDB is NoSQL and does not have auto increment.
I'm using the auto generated _id field of the document which stores an ObjectId by default, which means that my url's look like this:
http://localhost/blog/article.php?_id=5d41f6e5fc1a2f3d80645185
I saw in the documentation that the ObjectId contains a unique identifier for the server, together with a timestamp and a counter, isn't exposing these things a security risk?
As a solution, I stumbled into UUID https://docs.mongodb.com/manual/reference/method/UUID/ which is an auto-generated unique ID, that doesn't expose timestamp and machine info in it. It seems like a logical solution to use this instead of the _id that contains my ObjectId for querying and finding a document.
So I can make my url's look like this:
http://localhost/blog/article.php?_id=23829651-26f7-4092-99d0-5be8658c966e
But still, should I keep the _id property? should I add another one called "id" that stores the UUID? should I even use UUID's at all?
Here's what I would consider before choosing an identifier:
Collision
Risk of collision is very low for both UUIDs and ObjectIDs. This has been discussed in detail in another question.
Nature
UUIDs are random whereas ObjectID values always increase over time. This makes ObjectIDs a bad choice for sharding.
Other uses
ObjectIDs have the creation timestamp as a part and can be used as a substitute of commonly used the createdAt field. A sort by ObjectIDs is a sort by creation time.
Insecure object references (OWASP)
Short def: An attacker cannot deduce the ID of another object if they have the ID of one object. You can read more about this here. Both UUIDs and ObjectIDs are not vulnerable to this.
Link to another question that discusses the security of ObjectIDs (thanks zbee).
Ease of use
Note: This is subjective
Using ObjectIds is a lot easier in the Mongo ecosystem. The existence of speical aggregation operators to deal with ObjectIDs + libraries add to it.
Portability
UUIDs are more portable than ObjectIDs. I do not know of any other system that uses ObjectIDs internally except for Mongo. Whereas there are other DBs such as Postgres that have a special data type for UUIDs + extensions for random generation etc.

Google App Engine (datastore) - will a deleted key regenerate?

I've got a simple question about datastore keys. If I delete an entity, is there any possibility that the key will be created again? or each key is unique and can be generated only one-time?
Thanks.
It is definitely possible to re-use keys.
Easy to test, for example using the datastore admin page:
create an entity for one of your entity models using a custom/specified key name and some property values
delete the entity
create another one using the same key name and different property values...
As for the keys with auto-generated IDs it is theoretically possible, but I guess rather unlikely due to the high number of possibilities. From Assigning identifiers:
Cloud Datastore can be configured to generate auto IDs using two
different auto id policies:
The default policy generates a random sequence of unused IDs that are approximately uniformly distributed. Each ID can be up to 16
decimal digits long.
The legacy policy creates a sequence of non-consecutive smaller integer IDs.

Issues understanding Google App Engine key

I'm looking at the GAE example for datastoring here, and among other things this confused me a bit.
def guestbook_key(guestbook_name=DEFAULT_GUESTBOOK_NAME):
"""Constructs a Datastore key for a Guestbook entity with guestbook_name."""
return ndb.Key('Guestbook', guestbook_name)
I understand why we need the key, but why is 'Guestbook' necessary? Is it so you can query for all 'Guestbook' objects in the datastore? But if you need to search a datastore for a type of object why isn't there a query(type(Greeting)? Concidering that that is the ndb.model that you are putting in?
Additionally, if you are feeling generous, why in creating the object you are storing, do you have to set parent?
greeting = Greeting(parent=guestbook_key(guestbook_name))
First: GAE Datastore is one big distributed database used by all GAE apps concurrently. To distinguish entities GAE uses system-wide keys. A key is composed of:
Your application name (implicitly set, not visible via API)
Namespace, set via Namespace API (if not set in code, then an empty namespace is used).
Kind of entity. This is just a string and has nothing to do with types at database level. Datastore is schema-less so there are no types. However, language based APIs (Java JDO/JPA/objectify, Python NDB) map this to classes/objects.
Parent keys (afaik, serialised inside key). This is used to establish entity groups (defining scope of transactions).
A particular entity identifier: name (string) or ID (long). They are unique within namespace and kind (and parent key if defined) - see this for more info on ID uniqueness.
See Key methods (java) to see what data is actually stored within the key.
Second: It seems that GAE Python API does not allow you to query Datastore without defining classes that map to entity kind (I don't use GAE Python, so I might be wrong). Java does have a low-level API that you can use without mapping to classes.
Third: You are not required to define a parent to an entity. Defining a parent is a way to define entity groups, which are important when using transactions. See ancestor paths and
transactions.
That's what a key is: a path consisting of pairs of kind and ID. The key is what identifies what kind it is.
I don't understand your second question. You don't have to set a parent, but if you want to set one, you can only do it when creating the entity.

Avoid default index but keep explicitly defined index in AppEngine?

I have some properties that are only referenced in queries that require composite indices. AppEngine writes all indexed properties to their own special indexes, which requires 2 extra write operations per property.
Is there any way to specify that a property NOT be indexed in its own index, but still be used for my composite index?
For example, my entity might be a Person with properties name and group. The only query in my code is select * from Person where group = <group> and name > <name>, so the only index I really need is with group ascending and name ascending. But right now AppEngine is also creating an index on name and an index on group, which triples the number of write operations required to write each entity!
I can see from the documentation how to prevent a property from being used for indexing at all, but I want to turn off indexing only for a few indexes (the default ones).
From what I understand currently, you can either disable indexing on a property all together (which includes composite indexes) or you are stuck with having all indexes (automatic indexes + your composite indexs from index.yaml).
There was some discussion about this in the GAE google group and a feature request to do exactly what you are suggesting, but I wasn't able to find it. Will update the answer when I get home and search more.

How to design a database schema for a search engine?

I'm writing a small search engine in C with curl, libxml2, and mysql. The basic plan is to grab pages with curl, parse them with libxml2, then iterate over the DOM and find all the links. Then traverse each of those, and repeat, all while updating a SQL database that maintains the relationship between URLs.
My question is: how can I best represent the relationship between URLs?.
Why not have a table of base urls (ie www.google.com/) and a table of connections, with these example columns:
starting page id (from url table)
ending page id (from url table)
the trailing directory of the urls as strings in two more columns
This will allow you to join on certain urls and pick out information you want.
Your solution seems like it would be better suited to a non relational datastore, such as a column store.
Most search engine indices aren't stored in relational databases, but stored in memory as to minimize retrieval time.
Add two fields to table - 'id' and 'parent_id'.
id - unique identifier for URL
parent_id - link between URL's
If you want to have a single entry for each URL then you should create another table that maps the relationships.
You then lookup the URL table to see if it exists. If not create it.
The relationship table would have
SourceUrlId,
UrlId
Where the SourceUrlId is the page and the UrlId is the url it points to. That way you can have multiple relationships for the same URL and you won't need to have a new entry in the Url table for every link to that url. Will also mean only 1 copy of any other info you are storing.
Why are you interested in representing pages graph?
If you want to compute the ranking, then it's better to have a more succinct and efficient representation (e.g., matricial form if you want to compute something similar to PageRank).

Resources