ESI (Edge Side Include) and hinclude content on search engine? - esi

First of all I'm very sorry if this question does belong, if so please let me know where I should post it if possible.
I have recently discovered ESI and Hinclude as ways to boost my side performance, but I have searched up and down and could not find any document on how search engines will index the pages using these. I would think that since hinclude use javascript it would not be possible for search engines to index the content included dynamically by them. I'm not so sure about ESI, does the content get indexed or will it be left out as well? I hope someone can shed some light on me on this topic.
Regards

ESI is processed "edge-side" - for example Akamai or Varnish. So the client never sees the ESI include tags, they see the fully rendered page.
HInclude processes client-side which means the search engine's spider will need to be able to process JavaScript. Search engines seem to be moving this direction but they may not all support it.

Related

Matching article text against pre-existing list of categories

I'm new to Azure Cognitive Services, and while I'm pretty sure it can help me solve my problem, I don't quite understand which part of it to use for it...
Here's what I want to do:
We have blog posts, say ~1k, and those blog posts all have categories and tags (multiple each). What I want to do, is to "guess" the right categories/tags for each article based on the content, and then present that to the editor as a suggestions at the time of input ("looks like this article is about: health, well-being, ..."). The ~1k articles we already have in the system are currently correctly tagged/categorized, so I'd like to use these a data source for this "guessing".
I've used Azure Search before, and it seems like some combination of EntityRecognition and KeyPhraseExtraction might be a way in the right direction? Azure Cognitive Services also seems to have an API that supports TextAnalytics that would do something similar. I'm a bit confused about why these are two different things (or are they not?)
This also seems like an entirely common problem (matching text against pre-defined categories based on other text that is categorized), so I'm wondering if I'm just missing an obvious solution here?
Thanks in advance.
I think the Azure Cognitive Text Analytics API is your best bet as you are looking for real-time analysis prior to tagging/categorizing for storage.
Text Analytics could return a list of named entities that you could map to your available tags/categories and present to the user.
Azure Cognitive Search requires an indexer and skillset to process target text with an end result of storing the processed results to an index specifically for searching.

How to implement Solr into Sitecore

I have to implement Solr index into Sitecore and I would like to know what is the best approach?
I looked at following approaches:
Capture publish end event (or other events) and then push item to solr index
Implement custom database crawler and get all changes from history table. Then using custom index push data to solr.
Second approach sounds like a way to go (in my opinion). In this case do I need to create a new search index, or search manager?
If anyone's done it before, can you point me into the right direction? Also if you could post some links to articles about sitecore-solr implementation.
UPDATE
Ok, after reading sitecore documentation this is what I came up with :
Create your custom SolrConfiguration class where you can set properties like solrserviceurl, add indexes and its definition (custom solr indexes)
Create SolrIndex and add it (in the config file) to your SolrConfiguration. Which instantiating, solrindex should subscribe to AddEntry event of Sitecore History Manager, and communicate with solr crawlers.
Create custom processor and hook into sitecore initialisation pipeline. Processor should initialize SolrConfiguration (from step 1)
Since everything in your config file in will be build using refrection, you can get instance of your cofiguration based on your config file
How does that sound like. Can I have any comments please?
We've done this on a few sites and tend to have a new "published" solr index and "unpublished" index
We interrupt:
OnItemSaving
Event to push things into the unpublished index (you may not need this, it depends if you want things in preview mode)
OnPublishItemProcessed
We process additions and updates to the published index here, I'm not sure what we do about deletions here without digging right into the code but certainly deal with deletions on the OnItemDelete (mentioned below)
OnItemDelete
We interrupt here to remove things from the published and non-published index (I think we remove from the published index here because Sitecore makes you publish the parent node in order to publish out deletions to the web database)
I hope that helps, I'd post the code if I could (but I'd be scowled at).
In addition to the already posted answer (which I think is a good way to do things) I'll share how we do it.
We basically just took a look at the Sitecore database crawler and decided to do things kind of like how it was doing it.
We utilize a significantly modified version of the Custom Item Generator to facilitate mapping between strongly typed objects and an object that has properties that correspond to our Solr schema. For actual communication with Solr we use SolrNet.
The general idea is that we loop through all the items (starting with the site root) recursively and map them to the appropriate type based on its template. Then we go through an indexing process for that item (some items need to index multiple documents to Solr in our implementation).
This approach is working very well for us except I will note that because we are indexing everything at once, it tends to introduce a slight bit of lag time between publish and the site reflecting any changes made to the index. One oversight we made in the beginning but will be working to fix soon is that we don't have an "unpublished" index (meaning we need to publish the site to see updates). It doesn't impact our solution that much really, but I can definitely see where it would others, so keep that in mind.
We didn't particularly want to get into the deletion of items from the index so we do the indexing as a publish:end event.
I hope this additional insight helps you. As far as I know there's not a whole lot of information out there about this specific combination of products, but I can tell you it's definitely possible and quite useful.

How does Google Docs store documents (on the backend)?

I half imagine there being these great .docs in the sky... but another part of me doubts that my documents are even being stored in anything we'd traditionally call a "file." Does Google have its own document format? I feel like it must. Some branch of some existing format like ODF, maybe? Any idea what it's like, what's special about it (if anything), and/or why it is the way it is?
As far as I'm aware, Google Docs originally generated RTF files. Now, however, with the recent push of HTML5 and integration of the ContentEditable module, they may very well just store documents as plain HTML within their database.
I would guess that google definitly extracts some information for indexing from the file. For editing purposes however, I do not think the internal format will be so much different from ODF/MS-Office or other file formats. But those are only guesses, maybe someone else knows more.

Best Way to automatically find links to your content?

So, here is the task I've found myself thinking of. Pretend for a moment, that I have a large body of content. I want to see what websites are linking to my content. I know that I could look into TrackBack or PingBack but what about those that aren't using tools capable of dealing with that?
It would seem that some form of Web Crawler that looks for pages linking to the original document might be useful. My question to the greater community is what would be the best way to get started here? Do TrackBack and PingBack do more than I assume? Are there services or tools out there that already do what I'm thinking?
Google is your friend!
Use the link prefix:
link:whatsite.com
And yes, trackbacks do more.
If you have HTTP referers setup in your logs, you can mine them.
You can even discover pages taht does not know about.
Else, there is the paying Linkscape from Seomoz or the free majesticSEO (if you confirm ownership of the domain).
MajesticSEO has a bigger backlink index and an API (need to login!).

Webapps: Storing and searching through user submitted blocks of text

Background:
I'm building a poetry site with user submitted content. The relevant user actions for my questions are that users can:
a. Go to fancysitename.com/view to see all poems so far
b. Go to fancysitename.com/submit to submit your own poem.
c. Go to fancysitename.com/apoemid to view a particular poem you've bookmarked before.
d. Go to fancysitename.com/search to enter a word to search for in all the poems.
All the poems are stored as text fields in a database and referenced by a poem id. So the "apoemid" in step c will be the primary key of the tuple and I'll just pull up the text after getting the key from the url.
Question:
The poems exist nowhere except in a database. My webapp is literally 4 html files. Will this approach affect my search engine rankings?
Is there a more efficient way to do 'd' rather than do a Select * on the db and manually parsing the text on the server? Each poem will be at the most 10 lines long, so I would imagine using a full text search engine like Lucerne will probably be overkill.
Caveat
I'm running this on the google app engine for now, so my database customization options are pretty limited. So while I'd certainly be interested in hearing about the ideal way to do this, this is a pet side project so my budget is limited :(
Thanks!
Edit: Apparently I don't google so well at 7am. I've since found a solution for question 2 here so please disregard question 2.
AppEngine currently doesnt support full text indexing, they do have a better than nothing SearchableModel.
Some details of SearchableModel can be found here:
http://groups.google.com/group/google-appengine/browse_thread/thread/f64eacbd31629668/8dac5499bd58a6b7?lnk=gst&q=searchablemodel
Regarding search engine ranking, yes having all your poems in the datastore can affect your ranking. This is generally overcome through the use of a sitemap. Here is an article about how StackOverflow uses a sitemap to help its search ranking.
http://www.codinghorror.com/blog/archives/001174.html
In most database engines, you can accomplish this kind of searching. For example MysQL does have full text searching. I am not sure how app engine works but you can always have a stored procedure does this search.
Where you store your data will not affect your site's ranking, only how you serve it up (on what URLs, etc). There's absolutely no way for an arbitrary search spider to tell where you store your data, and no reason for it to care, either.
Regardless of the length of your text, you will need full-text searching if you want to search inside a string. As Sam points out, SearchableModel ought to work just fine for that.

Resources