I am trying to write a spell corrector using the lucene spellchecker. I would want to give it a single text file with blog text content. The problem is that it works only when I give it one sentence/word per line in my dictionary file. Also the suggest API returns results without giving any weightage to number of occurences. Following is the source code
public class SpellCorrector {
SpellChecker spellChecker = null;
public SpellCorrector() {
try {
File file = new File("/home/ubuntu/spellCheckIndex");
Directory directory = FSDirectory.open(file);
spellChecker = new SpellChecker(directory);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
spellChecker.indexDictionary(
new PlainTextDictionary(new File("/home/ubuntu/main.dictionary")), config, true);
//Should I format this file with one sentence/word per line?
} catch (IOException e) {
}
}
public String correct(String query) {
if (spellChecker != null) {
try {
String[] suggestions = spellChecker.suggestSimilar(query, 5);
// This returns the suggestion not based on occurence but based on when it occured
if (suggestions != null) {
if (suggestions.length != 0) {
return suggestions[0];
}
}
} catch (IOException e) {
return null;
}
}
return null;
}
}
Do I need to make some changes?
Regarding your first issue, sounds like the expected, documented dictionary format, here in the PlainTextDictionary API. If you want to pass arbitrary text in, you might want to index it and use a LuceneDictionary instead, or possibly a HighFrequencyDictionary, depending on your needs.
The Spellchecker suggests replacements based on the similarity between the words (based on Levenstein Distance), before any other concern. If you want it to only recommend more popular terms as suggestions, you should pass a SuggestMode to SpellChecker.suggestSimilar. This ensures that matches suggested are at least as strong, popularity-wise, as the word they are intended to replace.
If you must override how Lucene decides on best matches, you can do that with SpellChecker.setComparator, creating your own Comparator on SuggestWords. Since SuggestWord exposes freq to you, it should be easy to arrange found matches by popularity.
Related
// Language Selection
public static void SelectLanguage() {
waitForElementToBeClickable(driver.findElement(By.xpath("//div[#class=\"lang-identifier\"]")));
driver.findElement(By.xpath("//div[#class=\"lang-identifier\"]")).click();
List<WebElement> elements = driver.findElements(By.xpath("//ul[#class=\"dropdown-menu pull-right\"]/li"));
for (WebElement e : elements) {
String text = e.getAttribute("value");
System.out.println(e.getText());
if (text.equalsIgnoreCase("English")) {
e.click();
break;
} else if (e.getText().equalsIgnoreCase("Español")) {
e.click();
break;
} else if (e.getText().equalsIgnoreCase("Italiano")) {
e.click();
break;
} else if (e.getText().equalsIgnoreCase("Pусский")) {
e.click();
break;
} else if (e.getText().equalsIgnoreCase("Français")) {
e.click();
break;
} else if (e.getText().equalsIgnoreCase("Português")) {
e.click();
break;
} else {
System.out.println("Please select appropriate language");
}
}
}
I would suggest a much simpler but more flexible version of your method.
Some suggestions:
I would change your method to take the desired language as a parameter to significantly simplify the code but also make it very flexible.
WebDriverWait, in most cases, will return the found element(s). Use that to simplify your code to a one-liner, e.g.
new WebDriverWait(...).until(ExpectedConditions.elementToBeClickable).click();
You didn't provide the code of your custom method, waitForElementToBeClickable, but if you really want to keep it, have it return the element(s) waited for to make it more useful and save having to write extra code.
If you have nested double quotes, I would suggest you use a combination of double and single quotes. It's a personal preference but I think it makes it easier to read than \", e.g.
"//div[#class=\"lang-identifier\"]"
would turn into
"//div[#class='lang-identifier']"
Instead of grabbing all options and then looping through them to compare the contained text to some desired string, use an XPath that contains the desired text instead, e.g. for "English" the XPath will look like
//ul[#class='dropdown-menu pull-right']/li[text()='English']
NOTE: .getAttribute("value") gets the value of an INPUT and will not work on other elements, e.g. the LI elements in your elements variable. .getText() returns the text contained in an element but will not work on INPUTs.
After implementing these suggestions, the code turns into a two-liner and is very flexible.
public static void SelectLanguage(String language) {
new WebDriverWait(driver, Duration.ofSeconds(10)).until(ExpectedConditions.elementToBeClickable(By.cssSelector("div.lang-identifier"))).click();
driver.findElement(By.xpath("//ul[#class='dropdown-menu pull-right']/li[text()='" + language + "']")).click();
}
I'm trying to read a binary file, convert it into a pojo format and then output as CSV. The unmarshalling (and marshalling) seems to be fine, but I'm having trouble optimising the converting the relevant records to Foo.class. The attempt below returns no results.
from(String.format("file://%s?move=%s", INPUT_DIRECTORY, MOVE_DIRECTORY))
.unmarshal(unmarshaller)
.split(bodyAs(Iterator.class), new ListAggregationStrategy())
.choice()
.when(not(predicate)).stop()
.otherwise().convertBodyTo(Foo.class)
.end()
.end()
.marshal(csv)
.to(String.format("file://%s?fileName=${header.CamelFileName}.csv", OUTPUT_DIRECTORY));
I was able to get it to work like this, but it feels like there has to be a better way - This will be need to be efficient, and having a 1s timeout feels like it goes against that, which is why I was attempting to use the built in split aggregation. Alternatively some way of using completionFromBatchConsumer, but I was struggling to make that work either!.
from(String.format("file://%s?move=%s", INPUT_DIRECTORY, MOVE_DIRECTORY))
.unmarshal(unmarshaller)
.split(bodyAs(Iterator.class))
.streaming()
.filter(predicate)
.convertBodyTo(Foo.class)
.aggregate(header("CamelFileName"), new ListAggregationStrategy())
.completionTimeout(1000)
.marshal(csv)
.to(String.format("file://%s?fileName=${header.CamelFileName}.csv", OUTPUT_DIRECTORY));
You could create your own AggregationStrategy in your first solution.
Instead of calling stop() in your choice statement, put a simple header like "skipMerge" to true.
In your strategy, test if this header exists and if so, skip it.
class ArrayListAggregationStrategy implements AggregationStrategy {
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
Object newBody = newExchange.getIn().getBody();
Boolean skipMerge = newExchange.getIn().getHeader("skipMerge", Boolean.class);
if (!skipMerge) { return oldExchange; }
ArrayList<Object> list = null;
if (oldExchange == null) {
list = new ArrayList<Object>();
list.add(newBody);
newExchange.getIn().setBody(list);
return newExchange;
} else {
list = oldExchange.getIn().getBody(ArrayList.class);
list.add(newBody);
return oldExchange;
}
}
}
Currently, your code never goes to marshal(csv) because the aggregator does not receive all the splitted parts.
I am trying to set up a search system for a database where each element (a code) in one table has tags mapped by a Many to many relationship. I am trying to write a controller, "search" where I can search a set of tags which basically act like key words, giving me an element list where the elements all have the specified tags. My current function is incredibly naive, basically it consists of retrieving all the codes which are mapped to be a tag, then adding those a set, then sorting the codes by how many times the tags for each code is found in the query string.
public List<Code> naiveSearch(String queryText) {
String[] tagMatchers = queryText.split(" ");
Set<Code> retained = new HashSet<>();
for (int i = 0; i < Math.min(tagMatchers.length, 4); i++) {
tagRepository.findAllByValueContaining(tagMatchers[i]).ifPresent((tags) -> {
tags.forEach(tag -> {
retained.addAll(tag.getCodes());
}
);
});
}
SortedMap<Integer, List<Code>> matches = new TreeMap<>();
List<Code> c;
for (Code code : retained) {
int sum = 0;
for (String tagMatcher : tagMatchers) {
for (Tag tag : code.getTags()) {
if (tag.getValue().contains(tagMatcher)) {
sum += 1;
}
}
}
c = matches.getOrDefault(sum, new ArrayList<>());
c.add(code);
matches.put(sum, c);
}
c = new ArrayList<>();
matches.values().forEach(c::addAll);
Collections.reverse(c);
return c;
}
This is quite slow and the overhead is unacceptable. My previous trick was a basically retrieval on the description for each code in the CRUDrepository
public interface CodeRepository extends CrudRepository<Code, Long> {
Optional<Code> findByCode(String codeId);
Optional<Iterable<Code>> findAllByDescriptionContaining(String query);
}
However this is brittle since the order of tags in containing factors into whether the result will be found. eg. I want "tall ... dog" == "dog ... tall"
So okay, I'm back several days later with how I actually solved this problem. I used hibernate's built in search library which has a very easy implementation in spring. Just paste the required maven coordinates in your POM.xml and it was ready to roll.
First I removed the manytomany for the tags<->codes and just concatenated all my tags into a string field. Next I added #Field to the tags field and then wrote a basic search Method. The method I wrote was a very simple search function which took a set of "key words" or tags then performed a boolean search based on fuzzy terms for the the indexed tags for each code. So far it is pretty good. My database is fairly small (100k) so I'm not sure about how this will scale, but currently each search returns in about 20-50 ms which is fast enough for my purposes.
I'm trying to learn D and I thought after doing the hello world stuff, I could try something I wanted to do in Java before, where it was a big pain because of the way the Regex API worked: A little template engine.
So, I started with some simple code to read through a file, character by character:
import std.stdio, std.file, std.uni, std.array;
void main(string [] args) {
File f = File("src/res/test.dtl", "r");
bool escape = false;
char [] result;
Appender!(char[]) appender = appender(result);
foreach(c; f.rawRead(new char[f.size])) {
if(c == '\\') {
escape = true;
continue;
}
if(escape) {
escape = false;
// do something special
}
if(c == '#') {
// start of scope
}
appender.put(c);
}
writeln(appender.data());
}
The contents of my file could be something like this:
<h1>#{hello}</h1>
The goal is to replace the #{hello} part with some value passed to the engine.
So, I actually have two questions:
1. Is that a good way to process characters from file in D? I hacked this together after searching through all the imported modules and picking what sounded like it might do the job.
2. Sometimes, I would want to access more than one character (to improve checking for escape-sequences, find a whole scope, etc. Should I slice the array for that? Or are D's regex functions up to that challenge? So far, I only found matchFirst and matchAll methods, but I would like to match, replace and return to that position. How could that be done?
D standard library does not provide what you require. What you need is called "string interpolation", and here is a very nice implementation in D that you can use the way you describe: https://github.com/Abscissa/scriptlike/blob/4350eb745531720764861c82e0c4e689861bb17e/src/scriptlike/core.d#L139
Here is a blog post about this library: https://p0nce.github.io/d-idioms/#String-interpolation-as-a-library
I need to search a drive (C:, D: etc) for a partuicular file type (extension like .xml, .csv, .xls). How do I preform a recursive search to loop all directories and inner directories and return the full path of where the file(s) are? or where can I get information on this?
VB.NET or C#
Thanks
Edit ~ I am running into some errors like unable to access system volume access denied etc. Does anyone know where I can see some smaple code on implementing a file search? I just need to search a selected drive and return the full path of the file type for all the files found.
System.IO.Directory.GetFiles(#"c:\", "*.xml", SearchOption.AllDirectories);
How about this? It avoids the exception often thrown by the in-built recursive search (i.e. you get access-denied to a single folder, and your whole search dies), and is lazily evaluated (i.e. it returns results as soon as it finds them, rather than buffering 2000 results). The lazy behaviour lets you build responsive UIs etc, and also works well with LINQ (especially First(), Take(), etc).
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
static class Program { // formatted for vertical space
static void Main() {
foreach (string match in Search("c:\\", "*.xml")) {
Console.WriteLine(match);
}
}
static IEnumerable<string> Search(string root, string searchPattern) {
Queue<string> dirs = new Queue<string>();
dirs.Enqueue(root);
while (dirs.Count > 0) {
string dir = dirs.Dequeue();
// files
string[] paths = null;
try {
paths = Directory.GetFiles(dir, searchPattern);
} catch { } // swallow
if (paths != null && paths.Length > 0) {
foreach (string file in paths) {
yield return file;
}
}
// sub-directories
paths = null;
try {
paths = Directory.GetDirectories(dir);
} catch { } // swallow
if (paths != null && paths.Length > 0) {
foreach (string subDir in paths) {
dirs.Enqueue(subDir);
}
}
}
}
}
It looks like the recls library - stands for recursive ls - now has a pure .NET implementation. I just read about it in Dr Dobb's.
Would be used as:
using Recls;
using System;
static class Program { // formatted for vertical space
static void Main() {
foreach(IEntry e in FileSearcher.Search(#"c:\", "*.xml|*.csv|*.xls")) {
Console.WriteLine(e.Path);
}
}