Google Search API Wildcard - google-app-engine

I have a Python project running on Google App Engine. I have a set of data currently placed at datastore. On user side, I fetch them from my API and show them to the user on a Google Visualization table with client side search. Because the limitations I can only fetch 1000 record at one query. I want my users search from all records that I have. I can fetch them with multiple queries before showing them but fetching 1000 records already taking 5-6 second so this process can exceed 30 seconds timeout and I don't think putting around 20.000 records on a table is good idea.
So I decided to put my records on Google Search API. Wrote a script to sync important data between datastore and Search API Index. When perform a search, couldn't find anything like wildcard character. For example let's say I have user field stores a string which contains "Ilhan" value. When user search for "Ilha" that record not show up. I want to show record includes "Ilhan" value even if it partially typed. So basically SQL equivalent of my search should be something like "select * from users where user like '%ilh%'".
I wonder if there is a way to that or is this not how Search API works?

I setup similar functionality purely within datastore. I have a repeated computed property that contains all the search substrings that can be formed for a given object.
class User(ndb.Model):
# ... other fields
search_strings = ndb.ComputedProperty(
lambda self: [i.lower() for i in all_substrings(strings=[
self.email,
self.first_name,
self.last_name,], repeated=True)
Your search query would then look like this:
User.query(User.search_strings == search_text.strip().lower()).fetch_page(20)
If you don't need the other features of Google Search API and if the number of substrings per entity won't put you at risk of hitting the 900 properties limit, then I'd recommend doing this instead as it's pretty simple and straight forward.
As for taking 5-6 seconds to fetch 1000 records, do you need to fetch that many? why not fetch only 100 or even 20 and use the query cursor for the user to pull the next page only if they need it.

Related

Limit the number of rows counted in in google data studio

I have a scorecard that looks at the number of URL clicks driven by all queries which works as expected. I am now trying to display the number of clicks driven by the top 10 queries in the scorecard. I was able to limit the number of rows in my table by disabling pagination to show only the top 10 queries but now I'm looking to sum the clicks in a scorecard to provide a quick summary rather than having a table.
I don't think what you want to do is possible dynamically via just the Search Console connector. Google Data Studio does not provide any way to calculate rankings via calculated fields, so there's no way for you to know which query is in the top 10 without looking at a sorted table. A few imperfect alternatives (roughly in order of increasing complexity):
You apply a filter so that the score card only aggregates values above a certain threshold. This would be hardcoded, so you would be filtering on the Clicks (ie aggregate all URL clicks above 100)
You apply a filter to the score card so that it only aggregates clicks from the top 10 URLs. This would not be a dynamically updating filter, so you'd have to look at the table to see which URLs are in the top 10, which would change as time goes on. This would end up being a filter like: "Include URLS Contains www.google.com,www.stackoverflow.com"
If you do not mind using google sheets as an intermediary, you could dump your Search Console data into a spreadsheet so that you can manipulate it however you like and then use the spreadsheet as the data source for data studio (as opposed to the Search Console connector). It looks like there might be some addons out there that you can use out of the box although I haven't used it myself, so not sure how difficult it is. Alternatively, you can build something out yourself via the Google Script and the Search Console API
You could build a custom Data Studio Community Visualization. (BTW just because they are called 'Community Visualizations' does not mean you have to make them publicly available.) Essentially here, you would be building a scorecard like component that aggregates the data according to your own rules, although this does require more coding experience. (Before you build one, check if something like what you need exists in the gallery, but at a quick glance, I don't see anything that would meet your needs.)

Google app engine - help in query optimization

I have run into a scenario while running query in app engine which is increasing my cost considerably.
I am writing the below query to fetch book names -
Iterable<Entity> entities =
datastore.prepare(query).asIterable(DEFAULT_FETCH_OPTIONS);
After that I run a loop to match the name with the name the user has requested. This is causing data reads for the entire books in the datastore and with the book details increasing day by day in the datastore, it is further impacting the cost since it is reading the entire list.
Is there an alternative to fetch data for only the requested book detail by the user so that I dont have to read the complete data store? Will SQL help or filters? I would appreciate if someone provides the query.
You have two options:
If you match the title exactly, make it an indexed field and use a filter to fetch only books with exactly the same title.
If you search within titles too:
a. You can use Search API to index all titles and use it to find the books your users are looking for.
b. A less optimal but quick solution is to create a projection query that reads only the book titles.

Clarify the usage of AppEngine Search API

I have started to try out to use the new Search API, the demo is running smoothly, however, there are some points I am still confused about being outsider of the search world.
First of all is how to build a document. Obviously you can't hard-coded each line into a document, but what else can I do. Say if I have a user class (I'm using Java, but I guess Python makes no difference here), and I would add the user's information into the document, and be able to do a full-text search against the field of address.
class User {
String username;
String password;
String address;
}
In my datastore, I have this entity with 10000 instances there, and if I will need to build this document, do I have to
Step 1: retrieve the 10000 instance from datastore
Step 2: Iterate through each of the user entity, and create 10000 documents
Step 3: Add all 10000 docs into an index, and then I will be able to search
Please correct me if above three steps I mentioned is wrong.
If that is the case, then does it that later each time a new User registered, we need to create a new document, and add to the index?
Unfortunately I haven't play around with that much. I learned a few things.
When first implementing it, I hade to create a lot of documents as well (as you describe). But kept running in to deadline exceptions. So I ended upp using the task queue for building documents for all my old records.
Remember to create a cross-reference between the search Document and you datastore entity. So you can easily update your document record. And from a search result get the match entity.
For cross-reference add a new property on your datastore model called something like search_document_id where you store the doc_id (I prefixed all my doc_id's with the datastore model name). And add a text field on you Document containing the entity key as a string.
But I would say in a nutshell you are correct.

Google App Engine: efficient large deletes (about 90000/day)

I have an application that has only one Model with two StringProperties.
The initial number of entities is around 100 million (I will upload those with the bulk loader).
Every 24 hours I must remove about 70000 entities and add 100000 entities. My question is now: what is the best way of deleting those entities?
Is there anyway to avoid fetching the entity before deleting it? I was unable to find a way of doing something like:
DELETE from xxx WHERE foo1 IN ('bar1', 'bar2', 'bar3', ...)
I realize that app engine offers an IN clause (albeit with a maximum length of 30 (because of the maximum number of individual requests per GQL query 1)), but to me that still seems strange because I will have to get the x entities and then delete them again (making two RPC calls per entity).
Note: the entity should be ignored if not found.
EDIT: Added info about problem
These entities are simply domains. The first string being the SLD and the second the TLD (no subdomains). The application can be used to preform a request like this http://[...]/available/stackoverflow.com . The application will return a True/False json object.
Why do I have so many entities? Because the datastore contains all registered domains (.com for now). I cannot perform a whois request in every case because of TOSs and latency. So I initially populate the datastore with an entire zone file and then daily add/remove the domains that have been registered/dropped... The problem is, that these are pretty big quantities and I have to figure out a way to keep costs down and add/remove 2*~100000 domains per day.
Note: there is hardly any computation going on as an availability request simply checks whether the domain exists in the datastore!
1: ' A maximum of 30 datastore queries are allowed for any single GQL query.' (http://code.google.com/appengine/docs/python/datastore/gqlreference.html)
If are not doing so already you should be using key_names for this.
You'll want a model something like:
class UnavailableDomain(db.Model):
pass
Then you will populate your datastore like:
UnavailableDomain.get_or_insert(key_name='stackoverflow.com')
UnavailableDomain.get_or_insert(key_name='google.com')
Then you will query for available domains with something like:
is_available = UnavailableDomain.get_by_key_name('stackoverflow.com') is None
Then when you need to remove a bunch of domains because they have become available, you can build a big list of keys without having to query the database first like:
free_domains = ['stackoverflow.com', 'monkey.com']
db.delete(db.Key.from_path('UnavailableDomain', name) for name in free_domains)
I would still recommend batching up the deletes into something like 200 per RPC, if your free_domains list is really big
have you considered the appengine-mapreduce library. It comes with the pipeline library and you could utilise both to:
Create a pipeline for the overall task that you will run via cron every 24hrs
The 'overall' pipeline would start a mapper that filters your entities and yields the delete operations
after the delete mapper completes, the 'overall' pipeline could call an 'import' pipeline to start running your entity creation part.
pipeline api can then send you an email to report on it's status

Using search server with cakephp

I am trying to implement customized search in my application. The table structure is given below
main table:
teacher
sub tables:
skills
skill_values
cities
city_values
The searching will be triggered with location which is located in the table city_values with a reference field user_id, and city_id . Here name of the city and its latitude and longitude is found under the table cities.
Searching also includes skills, the table relations are similar to city. users table and skill_values table can be related with field user_id in the table skill_values. The table skills and skill_values related with field skill_id in table skill_values.
Here we need find the location of the user who perform this search, and need to filter this results with in 20 miles radius. there are a few other filters also.
My problem is that i need to filter these results without page reload. So i am using ajax, but if number of records increase my ajax request will take a lot of time to get response.
Is that a good idea that if i use some opensource search servers like sphinx or solr for fetching results from server?
I am using CAKEPHP for development and my application in hosted on cloud server.
... but if number of records increase my ajax request will take a lot of time to get response.
Regardless of the search technology, there should be a pagination mechanism of some kind.
You should therefore be able to set the limit or maximum number of results returned per page.
When a user performs a search query, you can use Javascript to request the first page of results.
You can then simply incrementing the page number and request the second, third, fourth page, etc.
This should mean that the top N results always appear in roughly the same amount of time.
It's then up to you to decide if you want to request each page of search results sequentially (ie. as the callback for each successful response), or if you wait for some kind of user input (ie. clicking a 'more' link or scrolling to the end of the results).
The timeline/newsfeed pages on Twitter or Facebook are a good example of this technique.

Resources