How does solr choose labels when using STC algorithm - solr

I am currently trying to use Solr to do clustering. I am using the STC algorithm. However, I do not know how the labels of clusters are generated. I know that the labels of the nodes in the suffix tree are used, but in what way? What suffix(terms) will be chosen? Thank you.

STC is the implementation of Oren Zamir's Suffix Tree Clustering algorithm. For an in-depth description of the algorithm, take a look at Zamir's PhD dissertation.

Related

best-first Vs. breadth-first

What is the difference between best-first-search and the breadth-first-search ? and which one do we call "BFS" ?
To answer your second question first:
which one do we call "BFS" ?
Typically when we refer to BFS, we are talking Breadth-first Search.
What is the difference between best-first-search and the breadth-first-search
The analogy that I like to consult when comparing such algorithms is robots digging for gold.
Given a hill, our goal is to simply find gold.
Breadth-first search has no prior knowledge of the whereabouts of the gold so the robot simply digs 1 foot deep along the 10-foot strip if it doesn't find any gold, it digs 1 foot deeper.
Best-first search, however, has a built-in metal detector, thus meaning it has prior knowledge. There is, of course, the cost in having a metal detector, and cost in turning it on and seeing which place would be the best to start digging.
Best-first search is informed whereas Breadth-first search is uninformed, as in one has a metal detector and the other doesn't!
Breadth-first search is complete, meaning it'll find a solution if one exists, and given enough resources will find the optimal solution.
Best-first search is also complete provided the heuristic — estimator of the cost/ so the prior knowledge — is admissible — meaning it overestimates the cost of getting to the solution)
I got the BFS image from http://slideplayer.com/slide/9063462/ the Best-first search is my failed attempt at photoshop!
Thats 2 algorithms to search a graph (tree).
Breadth first looks at all elements(nodes) of a certain depth, trying to find a solutuion (searched value or whatever) then continous one level deeper and looks at every node and so on.
Best first looks at the "best" node defined mostly by a heuristic, checks the best subnode of that node and so on.
A* would be an example for heursitic (best first search) and its way faster. But you need a heuristic what you wouldn't need for breadth search.
Creating a heuristic needs some own effort. Breadth first is out of the box.

SOLR: Create term vector (like data returned from TermVectorComponent) from raw text

Using http://wiki.apache.org/solr/TermVectorComponent I can get indexed terms and their frequencies for any document stored in my index. How can I get the same information for a text, without storing the text in my index? I just want SOLR to process the text and return the information, but without having to store the document in my index.
AFAIK this isn't possible without storing data in SOLR.
If you are looking to do text analysis (I understand this is broader than what you ask for), I would recommend the below alternatives:
MAUI - does keyphrase and terminology extraction.
Gensim - does topic modelling
Kea - keyword extraction
I've also come across some python scripts that do term frequency analysis. Have a look at Mincemeat, particulary the example, which does term frequency calculation.
From what you ask for I conclude that you actually need a search library, not a full search engine (service). That library is Lucene. Perhaps, this will help for starters: How to extract Document Term Vector in Lucene 3.5.0. You could store the index in RAM for the sake of computing necessary bits and then get rid of the index.
I wrote an application in Java several years ago that did heavy text analysis based on Lucene. I had to custom-write the search functions to find words within a certain distance of each other. You can import your text documents into the software and have it count the term frequencies, or you can take the code and taylor it to your needs.
Free download:
http://www.minoesoftware.com/download.php
Source:
https://github.com/danspiteri/MINOE/blob/master/src/minoe/SearchFiles.java
If you are using Solr4 and you are not storing the text, you can use a Solr pivot on the text field. But then, obviously you will get terms after the analyzer processing:
http://192.168.0.202:8080/solr/fr_00_0425_sem/select?q=renault&wt=xml&facet=true&facet.pivot=uniqueKey,yourText
This is a pretty heavy query, I hope you don't have too many documents that match...

Dataset help for TF-IDF and Vector Model

I want to compare TF-IDF, Vector model and some optimization of TF-IDF algorithm.
For that I need a dataset (at least 100 documents of English text). I am not able to find one. any suggestions ?
It depends the application that you use TF-IDF. for example if you want to find keywords you could use "Mendely" dataset or for tagging using "Delicious" data.

Finding a handwritten dataset with an already extracted features

I want to test my clustering algorithms on data of handwritten text, so I'm searching for a dataset of handwritten text (e.g. words) with already extracted features (the goal is to test my clustering algorithms on, not to extract features). Does anyone have any information on that ?
Thanks.
There is a dataset of images of handwritten digits : http://yann.lecun.com/exdb/mnist/ .
Texmex has 128d SIFT vectors
"to evaluate the quality of approximate
nearest neighbors search algorithm on different kinds of data and varying database sizes",
but I don't know what their images are of; you could try asking the authors.

how to find a path to go home - algorithm

(source: blogcu.com)
Assume there is a rabbit and at position (1,1). Moreover, its home is at position (7,7). How can it reach that position ?
Home positon is not fix place.
Real question, I am trying to solve a problem on a book for exersizing c.What algorithm should I apply to find solution?
Should I use linked list to store data?
Data is (1,1), (1,2),..., (3,3) ..., (7,7)
Place marked with black shows wall.
Use A*. It is the classic go-to algorithm for path-finding (that article lists many other algorithms you can consider too).
By using A* you learn an algorithm that you might actually need in your normal programming career later ;)
An example evaluation of a maze similar to that in the question using A*:
There are a bunch of search algorithms you can use. The easiest to implement will be either breadth-first search or depth-first search.
Algorithms like A* are likely to be more efficient but are a little harder to code.
Check out the Wikipedia "Search algorithms" page. It has links to a number of well-known algorithms.
Breadth-first search is always a good one.
http://www.codeproject.com/KB/recipes/mazesolver.aspx

Resources