Suggestion for choice of database/design - database

Okay, i'm building a search engine based on urls stored in a database
| link_id || link_url || link_tags| <== schema
link_tags for a site say w3schools.com represent [web-design,html,php,js] etc..
and the database(MySql) has like 10,00,000+ rows
Now, i need them to be searchable by a search engine which also takes the link_tags into consideration while processing queries such as "best html tutorial" to return optimal results, also the entire web content of each url would also need to be stored to generate additional input to the engine based on keywords.
Which opensource search engine or any previous implementation should i be looking at to acheive this?

There is a small opensource search engine here. It is written in php and uses mysql. it my be able to be stretched for your needs.

Related

How to do FTS within Google Cloud Platform

Does Google Cloud Platform have a product to do full-text search via an API with non-web data (such as json or xml documents)? This may seem like a pretty silly question, but the only options I have come across are:
Search inside of Google App Engine (only available for python2, not python3) -- https://cloud.google.com/appengine/training/fts_intro/.
Related to web search only: https://developers.google.com/custom-search/docs/tutorial/introduction
Using a managed Elasticsearch: https://console.cloud.google.com/marketplace/details/google/elasticsearch.
Cloud firestore explicitly states it doesn't offer that and suggests using Aloglia (and gives details on integrating): https://cloud.google.com/firestore/docs/solutions/search
Is there something I'm missing? I'm basically looking to index and search about a million documents in a sort of free-form type of search. Is this offered as a product from Google outside of App Engine? If so, how can I access it?
You have pretty much covered it there. There is currently no specific Google service for full-text search. As you mentioned, App Engine Search API is available for Python 2.7, which will stop being maintained after January 2020, and not Python 3.
There is one more option you could consider, which is using Lucene foe GAE. I found this blog where several possibilities are studied, perhaps could be an interesting reading for you.
To sum up, I would recommend ElasticSearch or Aloglia, but for the latter you need a Firebase project.

Using Azure Search for PDFs in Azure Blob Storage

We are trying to enable full text search. Application stores PDF files in the Azure Blob Storage, which is the data source for Azure Search. Majority of this works fine however the Indexer is not able to extract text from couple of PDFs. Are there any specific kinds of PDFs that Azure Search Indexer can extract?. If Yes, What are they?
Any information, Help/Support in this regard greatly appreciated.
Azure Search can extract all text from PDF text elements. Extracting text from embedded images (which requires OCR) or tables is not yet integrated in Azure Search, but it is on the roadmap.
If your PDFs contain images and you want to extract text from those as well, then you can try following the steps here.
Are there any specific kinds of PDFs that Azure Search Indexer can extract?
Based on my experience, there are no specific kinds of PDFs that Azure search Indexer can't extract. According to your description, I assume that it reaches the Azure search limitation. For more detailed information please refer to Indexing Documents in Azure Blob Storage with Azure Search.
Azure Search limits how much text it extracts depending on the pricing tier: 32,000 characters for Free tier, 64,000 for Basic, and 4 million for Standard, Standard S2 and Standard S3 tiers. A warning is included in the indexer status response for truncated documents.
I recently wrote a blog post about my experience with this. I ended up using a python-based script running in a Docker container within Azure Somewhat complicated, but the blog lays it out pretty clearly (and the results have been very good as far as OCR/searchability)
http://martyice.github.io/docker-in-azure/

Does GAE Datastore support 'partial text search'?

I'm very beginner.
I want to make a information management system using Go language on Google App Engine.
Users will create, edit, delete and search entities.
I have navigated GAE site, but could not find 'partial text search' on Datastore.
Partial text search, I mean, search entities contain 'partial text' entered.
Or, can you give me a tip to make such a system. (for free)
Very Sorry for low-grade question.
You can't do this with the datastore, you need to use the full-text search API. Unfortunately, that is not yet available with Go: apparently the best way to use it is to set up a module in your app that uses Python2.7, and exposes the search functionality.

Developing an web directory search engine for enterprises information, what's better? use a database or files?

I want to develop an web app for storing enterprises' information, so this info can be searched by keywords as by category, but principally by keywords, because the interface it's going to be as simple a Google. The doubt I have is, is it better to store this info in a database or in text files?
If you want full text search, probably neither. You should look into a search index such as Elasticsearch (http://www.elasticsearch.org/overview/). A search index stores data in a way that is optimized for searching.

How can one perform full text search in Google App Engine?

It's a simple question, but I haven't found the answer anywhere. Thoughts and input appreciated.
I'm using Django, too, for what it's worth. :)
Cheers.
The Search API is now available as experimental for Java and Python .
With Java GAE, you could use Compass, but that won't help with Django. For Python, Bill Katz offers one solution -- open source -- and these guys offer a Django-specific approach which, however, is free only for non-commercial applications (i.e. if your app makes money they want you to pay for their full-text search). I have no real-world experience with either of these solutions so I can't really give well-grounded recommendations, but from what one can see with just a little playing around they seem quite useful.
An overview of the Python App Engine searches that I am aware of:
Google did add a cut down search using SearchableModel although that has limitations (5000 indexed word limit, String property only not Text):
http://groups.google.com/group/google-appengine/browse_thread/thread/f64eacbd31629668/8dac5499bd58a6b7?lnk=gst&q=searchablemodel
Or as another posters have pointed out there are these options:
The Quick and simple text search:
http://www.billkatz.com/2009/6/Simple-Full-Text-Search-for-App-Engine
This product which has a fairly comprehensive free version and a more extensive commercial version:
http://gae-full-text-search.appspot.com/customers/download/
I've read that Google do have a project to bring full text search to App Engine although this is not scheduled to happen any time soon
I'd really like to see a comparison of the various searching frameworks and see how they stack up to each other. Does anyone know of any report like this?
Edit:
Google Search API now available (although still experimental)
For now, the real answer is that there is no real full-text search on Google App Engine. The solutions provided by the other answers here are fine for toy data sets, but do not scale to anything more than O(10000) documents or so. Google will have to provide search as an infrastructural feature of GAE. See the feature request for (mostly superfluous) discussion.
# The Quick and simple text search:
http://www.billkatz.com/2009/6/Simple-Full-Text-Search-for-App-Engine
this solution did not work for me - and looking at the limitations below, it is unlikely to be useful for real use cases.
It uses StringListProperty to store phrases which has a limitation of 500 characters.
It does not work with the standard query filters.
Issue 217 Bill Katz released a package to deal with and http://gae-full-text-search.appspot.com/ is available alternatively, levensthein is a another match measure
You should be able to adapt Whoosh! to write in the datastore instead of on disk. It's a pure python full-text search engine. It's not as fast or full-featured as Lucene, but it should run on GAE without too many modifications.

Resources