WPF RichTextBox TextChanged event - how to find deleted or inserted text? - wpf

While creating a customized editor with RichTextBox, I've face the problem of finding deleted/inserted text with the provided information with TextChanged event.
The instance of TextChangedEventArgs has some useful data, but I guess it does not cover all the needs. Suppose a scenario which multiple paragraphs are inserted, and at the same time, the selected text (which itself spanned multiple paragraphs) has been deleted.
With the instance of TextChangedEventArgs, you have a collection of text changes, and each change only provides you with the number of removed or added symbols and the position of it.
The only solution I have in mind is, to keep a copy of document, and apply the given list of changes on it. But as the instances of TextChange only give us the number of inserted/removed symbols (and not the symbols), so we need to put some special symbol (for example, '?') to denote unknown symbols while we transform our original copy of document.
After applying all changes to the original copy of document, we can then compare it with the richtextbox's updated document and find the mappings between unknown symbols and the real ones. And finally, get what we want !!!
Anybody has tried this before? I need your suggestions on the whole strategy, and what you think about this approach.
Regards

It primarily depends on your use of the text changes. When the sequence includes both inserts and deletes it is theoretically impossible to know the details of each insert, since some of the symbols inserted may have subsequently been deleted. Therefore you have to choose what results you really want:
For some purposes you must to know the exact sequence of changes even if some of the inserted symbols must be left as "?".
For other purposes you must know exactly how the new text differs from the old but not the exact sequence in which the changes were made.
I will techniques to achieve each of these results. I have used both techniques in the past, so I know they are effective.
To get the exact sequence
This is more appropriate if you are implementing a history or undo log or searching for specific actions.
For these uses, the process you describe is probably best, with one possible change: Instead of "finding the mappings between the unknown symbols and the real ones", simply run the scan forward to find the text of each "Delete" then run it backward to find the text of each "Insert".
In other words:
Start with the initial text and process the changes in order. For each insert, insert '?' symbols. For each delete, remove the specified number of symbols and record them as the text deleted.
Start with the final text and process the changes in reverse order. For each delete, insert '?' symbols. For each insert, remove the specified number of symbols and record them as the text inserted.
When this is complete, all of your "Insert" and "Delete" change entries will have the associated text to the best of our knowledge, and any text that was inserted and immediately deleted will be '?' symbols.
To get the difference
This is more appropriate for revision marking or version comparison.
For these uses, simply use the text change information to compute a set of integer ranges in which changes might be found, then use a standard diff algorithm to find the actual changes. This tends to be very efficient in processing incremental changes but still gives you the best updates.
This is particularly nice when you paste in a replacement paragraph that is almost identical to the original: Using the text change information will indicate the whole paragraph is new, but using diff (ie. this technique) will mark only those symbol runs that are actually different.
The code for computing the change range is simple: Represent the change as four integers (oldstart, oldend, newstart, newend). Run through each change:
If changestart is before newstart, reduce newstart to changestart and reduce oldstart an equal amount
If changeend is after newend, increase newend to changeend and increase oldend an equal amount
Once this is done, extract range [oldstart, oldend] from the old document and the range [newstart, newend] from the new document, then use the standard diff algorithm to compare them.

Related

Ive got a pipe that consists of 5 pieces, each including 5 properties

Inlet -> front -> middle -> rear -> outlet
Those five properties have a value anything between 4 - 40. Now i want to calculate a specific match for each of those values that is either a full 10 or a 5 when a single property is summed from each pipe piece. There might be hundreds of different pipe pieces all with different properties.
So if i have all 5 pieces and when summed, their properties go like 54,51,23,71,37. That is not good and not what im looking.
Instead 55,50,25,70,40. That would be perfect.
My trouble is there are so many of the pieces that it would be insane to do the miss'matching manually, and new ones come up frequently.
I have manually inserted about 100 of these already into SQLite, but should be easy to convert into any excel or other database formats, so answer can be related to anything like mysql or googlesheets.
I need the calculation that takes every piece in account and results either in "no match" or tells me the id of each piece that is required for a match and if multiple matches are available, it separates them.
Edit: Even just the math needed to do this kind of calculation would be a lot of help here, not much of a math guy myself. I guess there should be a reference piece i need to use and then that gets checked against every possible scenario.
If the value you want to verify is in A1, use: =ROUND(A1/5,0)*5
If the pipes may not be shorter than the given values, use =CEILING(A1,5)

Parsing text, then searching it: one entry per position, vs. 1 JSON column per text

Situation
I have a Rails application using Postgresql.
Texts are added to the application (ranging in size from a few words to, say, 5,000 words).
The texts get parsed, first automatically, and then with some manual revision, to associate each word/position in the text with specific information (verb/noun/etc, base word (running ==> run), definition_id, grammar tags)
Given a lemma (base word, ex. "run"), or a part of speech (verb/noun), or grammar tags, or a definition_id (or a combination), I need to be able to find all the other text positions in the database that contain the same information.
Conflict
I can't do a full-text search because, for example, if I click "left" on "I left Nashville", I don't want "turn left at the light" to appear. the traffic light. I just want "Leave" as a verb, as well as other forms of "Leave" as a verb.
Also, I might want just "left" with a specific definition_id (eg "Left" used as "The political party", not used as "the opposite of the right").
In short, I am looking for some advice on which of the following 3 routes I should take (or if there's a 4th or 5th route that I haven't considered).
Solutions
There are three options I can think of:
Option 1: TextPosition
A TextPosition table to store each word position, with columns for each of the above attributes.
This would make searching very easy, but there would be MANY records (1 for each position), but maybe that's not a problem? Is storing this amount of tickets a bad idea for some specific reason?
Option 2: JSON on the Text object
A JSON column on the Text object, to store all word positions in a large array of hashes, or a hash of hashes.
This would add zero records, but, a) Building a query to search all texts with certain information would probably be difficult, b) That query would probably be slow, and c) It could take up more storage space than a separate table (TextPosition).
Option 3: TWO JSON columns: one on the Text object, and one on each dictionary object
A JSON in each text object, as in option 2, but only to render the text (not to search), containing all the information about each position in that same text.
Another JSON in each "dictionary object" (definition, base word, grammar concept, grammar tag), just for searching (not to render the text). This column would track the matches of this particular object across ALL texts. It would be an array of hashes, where each hash would be {text_id: x, text_index: y}.
With this option, the search would be "easier", but it would still not be ideal: to find all the text positions that contain a certain attribute, I would have to do the following:
Find the record for that attribute
Extract the text_ids / indexes from the record
Find the texts with those IDs
Extract the matching line from each text, using the index that comes with each text_id within the JSON.
If it was a combination of attributes that I were looking for, I would have to do those 4 steps for each attribute, and then find the intersection between the sets of matches for each attribute (to end up only having the positions that contain both).
Furthermore, when updating a position (for example, if a person indicates that an attribute is wrongly associated and that it should actually be another), I would have to update both JSONs.
Also, will storing 2 JSON columns actually bring any tangible benefit over a TextPosition table? It would probably take up MORE storage space than using a TextPosition table, and for what benefit?
conclusion
In sum, I am looking for some advice on which of those 3 routes I should follow. I hope the answer is "option 1", but if so, I would love to know what drawbacks/obstacles could come up later when there are a ton of entries.
Thanks, Michael King
Text parsing and searching make my brain hurt. But anytime I have something with the complexity of what you are talking about, ElasticSearch is my tool of choice. You can do some amazingly complex indexing and searching with it.
So my answer is 4) ElasticSearch.

How to get paragraph result from solr keyword search after using tika to index some documents?

I use TIKA to index documents. then I want to get the whole paragraph from paragraph start to the paragraph end which contains the key words. I tried to use HighlightFragsize but it does not work. For example: there is a document like below:
When I was very small, my parents took me to many places, because they wanted me to learn more about the world. Thanks to them, I
witnessed the variety of the world and a lot of beautiful scenery.
But no matter where I go, in my heart, the place with the most
beautiful scenery is my hometown.
there are two paragraphs above. If I search 'my parents', I hope I can get the whole paragraph "When I was very small, my parents....... a lot of beautiful scenery". not only part of this paragraph. I used HighlightFragsize to limit the sentence, but the result is not what I want. Please help. thanks in advance
You haven't provided a lot of information to go off of but I'm assuming that you're using a highlighter so here are a couple of things you should check for:
The field that holds your parsed data - is it stored? Can you see the entire contents?
If (1), is the text longer than 51200 chars? The default highlighter configuration has a setting maxAnalyzedChars that is set to 51200. This means that the highlighter will not process more than 51200 characters from a highlighted field in a matched document to look for highlights. If this is the case, increase this value until you get the desired results.
Highlighting on extremely large fields may incur a significant performance penalty which you should be mindful of before choosing a configuration.
See this for more details.
UPDATE
I don't think there's any parameter called HighlightFragsize but there's one called hl.fragsize which can do what you want when set to zero.
Try the following query and see if it works for you:
q=my+parents&hl=true&hl.fl=my_field&hl.fragsize=0
Additionally, you should, in any case, be mindful of the first 2 points I posted above.
UPDATE 2
I don’t think there’s a direct way to do what you’re looking for. You could possibly split up your field into a multi valued field with each paragraph being stored as a separate value.
You can then possibly use hl.preserveMulti, hl.maxMultiValuedToExamine and hl.maxMultiValuedToMatch to achieve what you need.

Most efficient way to pull values that may or may not change?

I am not a trained programmer, but I assist in developing/maintaining macros within our VBA-based systems to expedite various tasks our employees do manually. For instance, copying data from one screen to another. By hand, any instance of this could take 30 seconds to 2 minutes, but with a macro, it could take 2-3 seconds.
Most of the macros we develop rely on the ability to accurately pull data as displayed (not from its relative field!) based on a row/column format for each character. As such, we employ the use of a custom command (let's call it, say... Instance.Grab) that pulls what we need from the screen using row x/column y coordinates and the length of what we want to pull. Example, where the we would normally pull a 8 character string from coordinates 1,1:
dim PulledValue as String
PulledValue = Instance.Grab(1,1,8)
If I ran that code on my question so far, the returned value for our macro would have been "I am not"
Unfortunately, our systems are getting their displays altered to handled values of an increased character length. As such, the coordinates of the data we're pulling is getting altered significantly. Rather than go through our macros and change the coordinates and length manually in each macro (which would need to be repeated if the screen formats change again), I'm converting our macros so that any time they need to pull the needed string, we can simply change the needed coordinate/length from a central location.
My question is, what would be the best way to handle this task? I've thought of a few ideas, but want to maximize effectiveness and minimize the time I spend developing it, given my limited programming experience. For the sake of this, let's call what I need to make happen CoorGrab, and where an array is needed, make an array called CoorArray:
1) creating Public Function CoorGrab(ThisField As Variant) -if I did it this way, than I would simply list all the needed coordinate/length sets based on the variant I enter, then pull whichever set as needed using a 3 dimensional array. For instance: CoorGrab(situationA) would return CoorArray(5, 7, 15). This would be easy enough to edit for one of us who know something about programming, but if we're not around for any reason, there could be issues.
2) creating all the needed coordinates in public arrays in the module. I'm not overly familiar with how to implement this, but I think I read up on something called public constants? I kinda like this idea for its simplicity, but am hesitant to use any variable or array as public.
3) creating a .txt file in a shared drive that has all the needed data and a label to identify them, and save it to a shared drive that any terminal can access when running these macros. This would be the easiest for a non-programmer to jump in and edit in case I or one of our other programming-savvy employees aren't available, but it seems like far more work than is needed, and I fear what could happen if the .txt file got a type or accidentally deleted.
Any thoughts on how I should proceed? Are one of the above options inherently better/easier than the others? Or is there another way to handled this situation that I didn't cover? Any info or advice you all can provide would be greatly appreciated!
8/2/15 Note - Should probably mention the VBA is used as part of a terminal emulator with custom applications for the needs of our department. I don't manage the emulator or its applications, nor do I have system admin access; I just create/edit macros used within it to streamline some of the ways our users handle their workloads. Of the three of us who do this, I'm the least skilled at programming, but also the only on who could be pulled that could update them before the changes take effect.
Your way is not so bad, I would:
Use a string as a label as parameter for CoorGrab
Return a range instead of a string (because you can use a single cell range as text and you keep a trace where your data is)
public CoorGrab(byval label as string) as range
Create an Excel Sheet with 3 rows: 1 = label, 2 = x, 3 = y (you could
add a 4 if you need to search in an other sheet)
CoorGrab() Find the label in the Excel Sheet and return X / Y
If developers aren't availables, they just have to edit the Excel sheet.
You could too create and outsource Excel File to read coordinates outside the local file, or use it to update files of everybody (Read file from server, add/update all label in the server file but not in local file)

CollaborativeString.setText does not result in "minimum number of" edits

CollaborativeString.setText: Sets the contents of this collaborative string. Note that this method performs a text diff between the current string contents and the new contents so that the string will be modified using the minimum number of text inserts and deletes possible to change the current contents to the newly-specified contents.
This is a minor point but the documentation is technically inaccurate. The minimum number of edits to change one string to another is always at most 2: delete the whole string and insert the new string.
For example, to change, to change baaaaaaaab to caaaaaaaac, the realtime api does the sensible thing which is to use a delete event for each b and a corresponding insert event for each c.
Out of curiosity, can the exact text diff algorithm for this be made public? I have tried several diff algorithms which haven't reproduced the exact algorithm.
I guess the documentation isn't totally clear, but its the min number of inserts and deletes to get to the end state without restating things that stayed the same.
I doubt we'd want to state anything more specific about the algorithm, since its subject to change without notice :) Why do you care?

Resources