TLDR
Is there a dataset with every letter in consolas regular size 11 font as an image corresponding to its appearance in a console? If not, is there a quick way to generate this dataset?
Context
I am trying to create a neural network that generates asci art. My network will input an image and output a list of symbols. In order for the cost function to work properly, I need the network's outputted list of symbols to be mapped to an image. As an example, I need a t character outputted by the network be mapped into the 6x8 pixel space corresponding to the shape of the t. This will happen to each character outputted by the network, and then I will apply the loss to this asci generated image and the original image.
Unfortunately, I do not have a dataset of the consolas regular size 11 characters to do this. I can use snip tool, but this seems too prone to error. Does there exist a dataset of all consolas regular characters? If not, is there a faster and less error prone way to create it than snipping all characters in a jupyter notebook?
Situation
I have a Rails application using Postgresql.
Texts are added to the application (ranging in size from a few words to, say, 5,000 words).
The texts get parsed, first automatically, and then with some manual revision, to associate each word/position in the text with specific information (verb/noun/etc, base word (running ==> run), definition_id, grammar tags)
Given a lemma (base word, ex. "run"), or a part of speech (verb/noun), or grammar tags, or a definition_id (or a combination), I need to be able to find all the other text positions in the database that contain the same information.
Conflict
I can't do a full-text search because, for example, if I click "left" on "I left Nashville", I don't want "turn left at the light" to appear. the traffic light. I just want "Leave" as a verb, as well as other forms of "Leave" as a verb.
Also, I might want just "left" with a specific definition_id (eg "Left" used as "The political party", not used as "the opposite of the right").
In short, I am looking for some advice on which of the following 3 routes I should take (or if there's a 4th or 5th route that I haven't considered).
Solutions
There are three options I can think of:
Option 1: TextPosition
A TextPosition table to store each word position, with columns for each of the above attributes.
This would make searching very easy, but there would be MANY records (1 for each position), but maybe that's not a problem? Is storing this amount of tickets a bad idea for some specific reason?
Option 2: JSON on the Text object
A JSON column on the Text object, to store all word positions in a large array of hashes, or a hash of hashes.
This would add zero records, but, a) Building a query to search all texts with certain information would probably be difficult, b) That query would probably be slow, and c) It could take up more storage space than a separate table (TextPosition).
Option 3: TWO JSON columns: one on the Text object, and one on each dictionary object
A JSON in each text object, as in option 2, but only to render the text (not to search), containing all the information about each position in that same text.
Another JSON in each "dictionary object" (definition, base word, grammar concept, grammar tag), just for searching (not to render the text). This column would track the matches of this particular object across ALL texts. It would be an array of hashes, where each hash would be {text_id: x, text_index: y}.
With this option, the search would be "easier", but it would still not be ideal: to find all the text positions that contain a certain attribute, I would have to do the following:
Find the record for that attribute
Extract the text_ids / indexes from the record
Find the texts with those IDs
Extract the matching line from each text, using the index that comes with each text_id within the JSON.
If it was a combination of attributes that I were looking for, I would have to do those 4 steps for each attribute, and then find the intersection between the sets of matches for each attribute (to end up only having the positions that contain both).
Furthermore, when updating a position (for example, if a person indicates that an attribute is wrongly associated and that it should actually be another), I would have to update both JSONs.
Also, will storing 2 JSON columns actually bring any tangible benefit over a TextPosition table? It would probably take up MORE storage space than using a TextPosition table, and for what benefit?
conclusion
In sum, I am looking for some advice on which of those 3 routes I should follow. I hope the answer is "option 1", but if so, I would love to know what drawbacks/obstacles could come up later when there are a ton of entries.
Thanks, Michael King
Text parsing and searching make my brain hurt. But anytime I have something with the complexity of what you are talking about, ElasticSearch is my tool of choice. You can do some amazingly complex indexing and searching with it.
So my answer is 4) ElasticSearch.
I use TIKA to index documents. then I want to get the whole paragraph from paragraph start to the paragraph end which contains the key words. I tried to use HighlightFragsize but it does not work. For example: there is a document like below:
When I was very small, my parents took me to many places, because they wanted me to learn more about the world. Thanks to them, I
witnessed the variety of the world and a lot of beautiful scenery.
But no matter where I go, in my heart, the place with the most
beautiful scenery is my hometown.
there are two paragraphs above. If I search 'my parents', I hope I can get the whole paragraph "When I was very small, my parents....... a lot of beautiful scenery". not only part of this paragraph. I used HighlightFragsize to limit the sentence, but the result is not what I want. Please help. thanks in advance
You haven't provided a lot of information to go off of but I'm assuming that you're using a highlighter so here are a couple of things you should check for:
The field that holds your parsed data - is it stored? Can you see the entire contents?
If (1), is the text longer than 51200 chars? The default highlighter configuration has a setting maxAnalyzedChars that is set to 51200. This means that the highlighter will not process more than 51200 characters from a highlighted field in a matched document to look for highlights. If this is the case, increase this value until you get the desired results.
Highlighting on extremely large fields may incur a significant performance penalty which you should be mindful of before choosing a configuration.
See this for more details.
UPDATE
I don't think there's any parameter called HighlightFragsize but there's one called hl.fragsize which can do what you want when set to zero.
Try the following query and see if it works for you:
q=my+parents&hl=true&hl.fl=my_field&hl.fragsize=0
Additionally, you should, in any case, be mindful of the first 2 points I posted above.
UPDATE 2
I don’t think there’s a direct way to do what you’re looking for. You could possibly split up your field into a multi valued field with each paragraph being stored as a separate value.
You can then possibly use hl.preserveMulti, hl.maxMultiValuedToExamine and hl.maxMultiValuedToMatch to achieve what you need.
I am trying use the Lucene/Solr for full text searching of documents that are structured like books. The structure of a book looks like:
Book
- has many Chapters
- has many Lines
I can put each line of text as a separate document into the system and text search works fine. The problem I'm running into is that sometimes the query is spread across two lines of text so neither line matches very well and the match isn't made.
Simple Example:
It was the best of times.
It was the worst of times.
A query for "best worst" wouldn't match either line very well individually but together they would match very well.
Is there a way in Lucene (or better solution) to do this type of structured data searching?
did any one know how to approximate lines from grayscale image resulted from line segment detector: using opencv or C language! in the image attached you see that each finger composed of many lines, what i need to do is to make each finger consists of exactly two parallel lines (i.e. approximate small lines to fit into only one line), if any one helps me, i will appreciate that.
N.B. i'm new to stackocerflow therefore i'm not allowed to post images, so for more clarification, that's the link of the image.
http://www.2shared.com/photo/Ff7mFtV3/Optimal.html
grayscale image resulted from line segment detector (LSD)
What have you done so far? You might need some heuristics. First add all segments on a table, try calculating the inclination of each of the segments and then sorting them by this as index. Afterwards, consider all segments that have an inclination say close by 5% or something to have the exact same inclination. This will induce a partitioning in the table. You might want to draw them using different colors so that you find the perfect parameter value.
Now you need to 'merge' all segments that have the same inclination and are close together. I'd try to measure the distance between the segments (google an algorithm for that) and sort the segments of each partition according to this. Consider merging segments that are close by less than, for instance, 3% of the total image height in pixels or something (find that empirically).
Last step, merging the segments should be very easy compared to the rest.
If you really want to find the fingers, you can stop earlier and compare the groups of same inclination to check if there are two almost (by 7% or so) parallel. The 5 closest pairs of inclinations should be fingers :-)