Putting large number of documents: Google App Engine and Search API

Putting large number of documents: Google App Engine and Search API - google-app-engine

It is question about limit on putting large number of documents in Search API. I intend to put 2057 documents(paragraphs from text file). When I parse each paragraph from text file, create document for each paragraph and put it into index, app seems to be running forever and not responding at all. What can be reason for such behavior?
With regards

I researched the documentation and found the following method:
put(java.lang.Iterable<Document> documents)
My way of importing like this:
1. I collect all documents to be put into index in collector(ArrayList, List) until 200 documents(it's limit by GAE)
2. Put this collector into this method
In my case, it decreased the time of putting by 100 times

Related

Azure suggester returning all content

I'm trying to implement an Azure suggester feature into our pilot Azure search app and running into issues. The content I'm indexing are PDF files, so my suggester definition is based on the content field itself which can be thousands of lines of text. Following examples online, when I implement the suggester, I'm returned the entire content of the body of text from the PDF file. What I'd really like to do is return just a phrase found in the text.
For instance, suppose I'm indexing a Harry Potter book and I type into my search field "Dum", I'd like to see suggested results back like "Dumbledore", "Dementor", etc VS the whole book. Is this possible?
Tks

If we want to search for words sharing the same prefix, Autocomplete is the right API for this job. https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
In contrast, Suggester API helps users find the documents containing words with that prefix. It returns text snippets containing those worlds.
If you still believe suggester api does not behave as expected and autocomplete is not suitable, let me know your source document, query and expected results.

Solr - Bringing back snippets from indexed data

I have a Solr/Lucene set up where I have indexed a set of documents (MS Word files) and can happily search the content of these documents. However I would like to return a snippet from within the content of the document which shows where the matching line (+/- 5 words from the match term) is. I have tried to follow a range of Google hits but my indexing does not seem to have a direct access to the "content".
Can anyone give me some basic and simple pointers to where I might have made any errors on this - I have based all my work so far on the guidance and examples of the Solr Reference Guide - so I am not sure if the issue is in the search parameters or the original index.
I am doing this to create a clear set of user requirements for building an end solution rather than creating the end solution myself, so I am no expert on the tools and do not need to become one, just need to evidence what is possible with this tool set.

As MatsLindh noted above the issue was that the config was not drawing across the actual content of the Tika parse into a specific field, and so there was no full content of the text to display and highlight
To resolve this I followed the link (https://lucene.apache.org/solr/guide/7_1/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler) to the guidance documents and reviewed the part on fmap and used the example given for Last Modified Date as a guide on what to apply.
I then went to my solrconfig.xml file in the relevant core folder and added in the following line in the code beneath an already present fmap entry:
<str name="fmap.content">testcontent</str>
I had previously set up the testcontent field under the solr web interface in my core. I then re-ran my indexing line via a command prompt and that seemed to do the trick in terms of pulling out the basic content and rapping it with a basic emphasis.
All thanks for the input on this - still a lot more I want to test to help develop a clear requirement set but this really helps prove some of the basics are not complected.

wikipedia dump all page titles and pageIDs

I'm trying to find a wikipedia dump containing pageIds and Titles. I don't want to request it on runtime or request 2000 per request, i want it ALL, i want to make a long list of all the pageIds and titles belonging to them and put them into my own database, so that i can use it in an application that requests the data from my own database.
Anybody know which dumps contain those information? It doesn't matter if they also contain more information that what i need - i can just write an app that picks the info i need.
I did try to request it ... it would have taken 140 days and they put up some limit of 2700 requests ... so it would take forever to get the whole thing, instead i want to download a file dumb and clean the data and upload a file to my own database containing only the info i need

Ok found it myself after getting multiple dumps, in short the answer is:
enwiki-latest-page.sql.gz
It contains pageids and Titles.
Entries look like this:
(1217768,0,'Black_River_(South_Carolina)','',0,0,0,0.6285160577990001,'20161001141146','20161001142916',738899573,1654,'wikitext')
First number is pageId. Third entry is title.
Rest i don't know what is - but no matter :D Thanks to myself i solved this issue and will close it :D Big pat on the bag

How to implement auto suggestion (auto complete) functionality in GAE

I want to implement auto suggest functionality in Google App Engine (GAE/GWT).
The client side of the implementation works fine with GWT SuggestBox and RPC.
My main issue is the server side of the implementation. I tried the Google search API but it seems that there is a limitation of 250MB of total indexed data and the search can be performed on complete words and not parts of each word!
How should I approach this? I read that lucene or solr is not supported in GAE.
I would appreciate your thoughts on this.

You can achieve a basic text search using these techniques described here: http://googlecode.blogspot.com.br/2010/05/google-app-engine-basic-text-search.html
In short:
Build a query using content >= yourQuery && content < yourQuery + "\ufffd", where the content property of your entity can be a String or a List of Strings.

I've taken this approach and it works fine for me:
Split up text into separate words. Get rid of duplicates, special characters and short words (in, of, and, etc..).
Add this list of words to entity as a list property.
Search via text range query: listProperty >= wordPart && listProperty < wordPart + "\ufffd"

Freebase: What data dump file contains the "imdb_id"?

I run IMDbAPI.com and have been using Bing's Search API for finding IMDb ID's from title searches. Bing is currently changing their API over to the Azure Marketplace (August 1st) and is no longer available for free. I started testing my API using Freebase to resolve these ID's and hit their 100k limit in the first 8 hours (my site currently gets about 3 million requests a day, but only 200-300k are title searches)
This is exactly why they offer the data dump files,
I downloaded most of the files in the Film folder but cannot find where they are storing the "/authority/imdb/title" imdb id namespace data.
https://www.googleapis.com/freebase/v1/mqlread?query={"type":"/film/film","name":"True%20Grit","imdb_id":null,"initial_release_date>=":"1969-01","limit":1}
This is how I'm currently accessing the ID.
Does anyone know which file contains this information? and how to link back to it from the film title/id?

That imdb_id property is backed by a key in the /authority/imdb/title namespace, so you're looking for the line:
/m/015gxt /type/object/key /authority/imdb/title tt0065126
in the file http://download.freebase.com/datadumps/latest/freebase-datadump-quadruples.tsv.bz2
That's a 4 GB file, so be prepared to wait a little while for the download. Note that everything is keyed by MID, so you'll need to figure that out first if you don't have it in your database.
The equivalent query using MQL instead of the data dumps is https://www.googleapis.com/freebase/v1/mqlread?query=%7B%22type%22%3a%22/film/film%22,%22name%22%3a%22True%20Grit%22,%22imdb_id%22%3anull,%22initial_release_date%3E=%22%3a%221969-01%22,%22mid%22:null,%22key%22:[{%22namespace%22:%22/authority/imdb/title%22}],%22limit%22:1%7D&indent=1
EDIT: p.s. I'm pretty sure the files in the Browse directory are going away, so I wouldn't depend on them even if you could find the info there.

The previous answer works fine, it's just that a snappier version of such a query could be:
query = [{
'type': '/film/film',
'name': 'prometheus',
'imdb_id': null,
...
}];
The rest of the MQL request isn't mentionned as it doesn't differ from the aforementioned. Hope that helps.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Putting large number of documents: Google App Engine and Search API - google-app-engine

Related

Azure suggester returning all content

Solr - Bringing back snippets from indexed data

wikipedia dump all page titles and pageIDs

How to implement auto suggestion (auto complete) functionality in GAE

Freebase: What data dump file contains the "imdb_id"?

Categories

Resources