I am using mahout to cluster text documents indexed using solr.
I have used the "text" field in the document to form vectors. Then I used the k-means driver in mahout for clustering and then the clusterdumper utility to dump the results.
I am having difficulty in understanding the output results from the dumper. I could see the clusters formed with term vectors in those clusters.
But how do I extract the documents from these clusters. I want the result to be the input documents appearing in different clusters.
I also had this problem. The idea is that cluster dumper dumps all your cluster data with points and so on. You have two choices:
modify ClusterDumper.printClusters() method so it will not print all the terms and weights. I have some code like:
String clusterInfo = String.format("Cluster %d (%d) with %d points.\n", value.getId(), clusterCount, value.getNumPoints());
writer.write(clusterInfo);
writer.write('\n');
// list all top terms
if (dictionary != null) {
String topTerms = getTopFeatures(value.getCenter(), dictionary, numTopFeatures);
writer.write("\tTop Terms: ");
writer.write(topTerms);
writer.write('\n');
}
// list all the points in the cluster
List points = clusterIdToPoints.get(value.getId());
if (points != null) {
writer.write("\tCluster points:\n\t");
for (Iterator iterator = points.iterator(); iterator.hasNext();) {
WeightedVectorWritable point = iterator.next();
writer.write(String.valueOf(point.getWeight()));
writer.write(": ");
if (point.getVector() instanceof NamedVector) {
writer.write(((NamedVector) point.getVector()).getName() + " ");
}
}
writer.write('\n');
}
do some grep magic if possible and eliminate all the info about terms and weights.
Related
I am trying to set up a search system for a database where each element (a code) in one table has tags mapped by a Many to many relationship. I am trying to write a controller, "search" where I can search a set of tags which basically act like key words, giving me an element list where the elements all have the specified tags. My current function is incredibly naive, basically it consists of retrieving all the codes which are mapped to be a tag, then adding those a set, then sorting the codes by how many times the tags for each code is found in the query string.
public List<Code> naiveSearch(String queryText) {
String[] tagMatchers = queryText.split(" ");
Set<Code> retained = new HashSet<>();
for (int i = 0; i < Math.min(tagMatchers.length, 4); i++) {
tagRepository.findAllByValueContaining(tagMatchers[i]).ifPresent((tags) -> {
tags.forEach(tag -> {
retained.addAll(tag.getCodes());
}
);
});
}
SortedMap<Integer, List<Code>> matches = new TreeMap<>();
List<Code> c;
for (Code code : retained) {
int sum = 0;
for (String tagMatcher : tagMatchers) {
for (Tag tag : code.getTags()) {
if (tag.getValue().contains(tagMatcher)) {
sum += 1;
}
}
}
c = matches.getOrDefault(sum, new ArrayList<>());
c.add(code);
matches.put(sum, c);
}
c = new ArrayList<>();
matches.values().forEach(c::addAll);
Collections.reverse(c);
return c;
}
This is quite slow and the overhead is unacceptable. My previous trick was a basically retrieval on the description for each code in the CRUDrepository
public interface CodeRepository extends CrudRepository<Code, Long> {
Optional<Code> findByCode(String codeId);
Optional<Iterable<Code>> findAllByDescriptionContaining(String query);
}
However this is brittle since the order of tags in containing factors into whether the result will be found. eg. I want "tall ... dog" == "dog ... tall"
So okay, I'm back several days later with how I actually solved this problem. I used hibernate's built in search library which has a very easy implementation in spring. Just paste the required maven coordinates in your POM.xml and it was ready to roll.
First I removed the manytomany for the tags<->codes and just concatenated all my tags into a string field. Next I added #Field to the tags field and then wrote a basic search Method. The method I wrote was a very simple search function which took a set of "key words" or tags then performed a boolean search based on fuzzy terms for the the indexed tags for each code. So far it is pretty good. My database is fairly small (100k) so I'm not sure about how this will scale, but currently each search returns in about 20-50 ms which is fast enough for my purposes.
UPDATED: Added examples
We have an API on top of Lucene 4.6 that I'm trying to adapt to running under Solr 4.6. The problem is the way we're reading a term's character offsets from the index work as expected when the index is created by Lucene, but always return -1 when the index is created by Solr. In the latter case I can see the character offsets via Luke, and I can even get them from Solr when I access the /tvrh search handler, which uses the TermVectorComponent class.
This is roughly how I'm reading a character offset in my Lucene code:
public void showOffsets(Directory dir, Term term) {
IndexReader indexReader = DirectoryReader.open(dir);
IndexReaderContext topContext = indexReader.getContext();
for (AtomicReaderContext context : topContext.leaves()) {
AtomicReader reader = context.reader();
termMatches(term, reader);
}
}
private void termMatches(Term term, AtomicReader reader) throws IOException {
DocsAndPositionsEnum postings = reader.termPositionsEnum(term);
if (postings != null) {
while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) {
for (int i = 0; i < postings.freq(); i++) {
System.out.println(
"term:" + term.toString() +
" tokpos:" + postings.nextPosition() +
" start:" + postings.startOffset() +
" end:" + postings.endOffset());
}
}
}
}
Notice that I want the values for a single term. When run against an index created by Solr, the above calls to startOffset() and endOffset() return -1, although the call to nextPosition() works OK. Solr's TermVectorComponent prints the correct offsets like this (paraphrased):
IndexReader reader = searcher.getIndexReader();
final Terms vector = reader.getTermVector(docId, field);
TermsEnum termsEnum = vector.iterator(termsEnum);
final int freq = (int) termsEnum.totalTermFreq();
DocsAndPositionsEnum dpEnum = null;
while((text = termsEnum.next()) != null) {
String term = text.utf8ToString();
dpEnum = termsEnum.docsAndPositions(null, dpEnum);
dpEnum.nextDoc();
for (int i = 0; i < freq; i++) {
final int pos = dpEnum.nextPosition();
System.out.println("start:" + dpEnum.startOffset());
System.out.println("end:" + dpEnum.endOffset());
}
}
but in this case it is getting the offsets per doc ID, rather than a single term.
Could anyone tell me:
Why I'm not able to get the offsets using my first example, and/or
A better way to get the offsets for a given term?
Robert Muir on the Solr email list pointed out that I was confused between the indexing options in Solr. I didn't need term vectors. Instead, I needed to add storeOffsetsWithPositions="true" to my field definition in the schema. After doing so and reindexing, I now get offsets as expected.
I am trying to store triplets inside of OrientDB as Vertex-Edge-Vertex relationships inside of a Java application that I am working on. My understanding of using OrientDB is that I can use the Tinkerpop API and instantiate a graph like this:
OrientGraph graph = new OrientGraph("local:/tmp/orient/test_db");
That is really all I do to instantiate the graph, then I try to connect vertices with edges in a loop like this: (Note that a Statement is a triplet consisting of subject-relationship-object.)
for (Statement s : statements) {
Vertex a = graph.addVertex(null);
Vertex b = graph.addVertex(null);
a.setProperty("Subject", s.getSubject().toBELShortForm());
RelationshipType r = s.getRelationshipType();
if (s.getObject() != null) {
b.setProperty("Object", s.getObject().toBELShortForm());
Edge e = graph.addEdge(null, a, b, r.toString());
}
else {
b.setProperty("Object", "null");
Edge e = graph.addEdge(null, a, b, "no-relationship");
}
}
I then loop through the vertices of the graph and print them out like this:
for (Vertex v : graph.getVertices()) {
out.println("Vertex: " +v.toString());
}
It does print a lot of vertices, but when I log into the server via the command line, using server.sh, all I see are the 3 records for ORole and 4 records for OUser. What am I missing here? Because it seems like although my java program runs and completes, the data is not being put into the database.
The answer, at least for now, seems to be not to use the Tinkerpop API but rather the Orient API directly. This is the same thing I was doing with Tinkerpop, but using the OrientDB API. This actually does store my data into the database:
for (Statement s : statements) {
ODocument sNode = db.createVertex();
sNode.field("Subject", s.getSubject().toBELShortForm());
sNode.save();
ODocument oNode = db.createVertex();
if (s.getObject() != null) {
oNode.field("Object", s.getObject().toBELShortForm());
oNode.save();
}
else {
oNode.field("Object", "null");
oNode.save();
}
RelationshipType r = s.getRelationshipType();
ODocument edge = db.createEdge(sNode, oNode);
if (r != null) {
edge.field(r.toString());
edge.save();
}
else {
edge.field("no relationship");
edge.save();
}
}
Create the Graph under the Server's databases directory. Below an example assuming OrientDB has been installed under "/usr/local/orient":
OrientGraph graph = new OrientGraph("local:/usr/local/orient/databases/test_db");
When you start the server.sh you should find this database correctly populated.
Lvc#
I'm looking for a good graph database for finding set intersections -- taking any two nodes and looking at whether their edge endpoints "overlap." Social network analogy would be two look at two people and see whether they are are connected to the same people.
I've tried to get FlockDB (from the folks at Twitter) working, because intersection functions are built in, but found there wasn't much in terms of user community/support. So any recommendations of other graph databases, especially where the kind of intersection functionality I'm looking for already exists...?
Isn't that just the shortest paths between the two nodes with length == 2 ?
In Neo4j you can use the shortestPath() Finder from the GraphAlgoFactory for that.
This would tell you if there is a connection:
Node from_node = index.get("guid", "user_a").getSingle();
Node to_node = index.get("guid", "user_b").getSingle();
if(from_node != null && to_node != null) {
RelationshipExpander expander = Traversal.expanderForAllTypes(Direction.BOTH);
PathFinder<Path> finder = GraphAlgoFactory.shortestPath(expander, 2);
if(finder.findSinglePath(from_node, to_node) != null) {
//Connected by at least 1 common friend
} else {
//Too far apart or not connected at all
}
}
This would tell you who are the common friends are:
Node from_node = index.get("guid", "user_a").getSingle();
Node to_node = index.get("guid", "user_b").getSingle();
if(from_node != null && to_node != null) {
RelationshipExpander expander = Traversal.expanderForAllTypes(Direction.BOTH);
PathFinder<Path> finder = GraphAlgoFactory.shortestPath(expander, 2);
Iterable<Path> paths = finder.findAllPaths(from_node, to_node);
if(paths != null) {
for(Path path : paths) {
Relationship relationship = path.relationships().iterator().next();
Node friend_of_friend = relationship.getEndNode();
}
} else {
//Too far apart or not connected at all
}
}
This code is a little rough and is much easier to express in Cypher (taken from the Cheet Sheet in the Neo4J Server console (great way to play with Neo4J after you populate a database):
START a = (user, name, "user_a")
MATCH (a)-[:FRIEND]->(friend)-[:FRIEND]->(friend_of_friend)
RETURN friend_of_friend
This will give you a list of the nodes shared between to otherwise disconnected nodes. You can pass this query to an embedded server thought the CypherParser class.
What would be the best (i.e. most efficient) way of counting the number of objects returned by a query, w/o actually loading them, using Objectify on AppEngine? I'm guessing the best is to fetch all the keys and counting the result:
public int getEntityCount(Long v) {
Objectify ofy = ObjectifyService.begin();
Iterable<Key<MyEntity>> list = ofy.query(MyEntity.class)
.filter("field", v).fetchKeys();
int n = 0;
for (Key<MyEntity> e : list)
n++;
return n;
}
Doesn't seems to have any dedicated method for doing that. Any ideas?
Found it:
int n = Iterable<Key<MyEntity>> list = ofy().query(MyEntity.class)
.filter("field", v).count();
It's that simple, though efficient because it retrieves all the keys. It's better to design your UI to handle unknown numbers of results (such as Google which gives clues to the number of pages but not the actual number)