CollaborativeString.setText does not result in "minimum number of" edits - google-drive-realtime-api

CollaborativeString.setText: Sets the contents of this collaborative string. Note that this method performs a text diff between the current string contents and the new contents so that the string will be modified using the minimum number of text inserts and deletes possible to change the current contents to the newly-specified contents.
This is a minor point but the documentation is technically inaccurate. The minimum number of edits to change one string to another is always at most 2: delete the whole string and insert the new string.
For example, to change, to change baaaaaaaab to caaaaaaaac, the realtime api does the sensible thing which is to use a delete event for each b and a corresponding insert event for each c.
Out of curiosity, can the exact text diff algorithm for this be made public? I have tried several diff algorithms which haven't reproduced the exact algorithm.

I guess the documentation isn't totally clear, but its the min number of inserts and deletes to get to the end state without restating things that stayed the same.
I doubt we'd want to state anything more specific about the algorithm, since its subject to change without notice :) Why do you care?

Related

(SPSS) Assign values to remaining time point based on value on another variable, and looping for each case

I am currently working on analyzing a within-subject dataset with 8 time-ordered assessment points for each subject.
The variables of interest in this example is ID, time point, and accident.
I want to create two variables: accident_intercept and accident_slope, based on the value on accident at a particular time point.
For the accident_intercept variable, once a participant indicated the occurrence of an accident (e.g., accident = 1) at a specific time point, I want the values for that time point and the remaining time points to be 1.
For the accident_slope variable, once a participant indicated the occurrence of an accident (e.g., accident = 1) at a specific time point, I want the value of that time point to be 0, but count up by 1 for the remaining time points until the end time point, for each subject.
The main challenge here is that the process stated above need to be repeated/looped for each participant that occupies 8 rows of data.
Please see how the newly created variables would look like:
I have looked into the instruction for different SPSS syntax, such as loop, the lag/lead functions. I also tried to break my task into different components and google each one. However, I have not made any progress :)
I would be really grateful of any helps and directions that you provide.
Here is one way to do what you need using aggregate to calculate "accident time":
if accident=1 accidentTime=TimePoint.
aggregate out=* mode=addvariables overwrite=yes /break=ID/accidentTime=max(accidentTime).
if TimePoint>=accidentTime Accident_Intercept=1.
if TimePoint>=accidentTime Accident_Slope=TimePoint-accidentTime.
recode Accident_Slope accidentTime (miss=0).
Here is another approach using the lag function:
compute Accident_Intercept=0.
if accident=1 Accident_Intercept=1.
if $casenum>1 and id=lag(id) and lag(Accident_Intercept)=1 Accident_Intercept=1.
compute Accident_Slope=0.
if $casenum>1 and id=lag(id) and lag(Accident_Intercept)=1 Accident_Slope=lag(Accident_Slope) +1.
exe.

Sort by constant number

I need to randomize Solr (6.6.2) search results, but the order needs to be consistent given a specific seed. This is for a paginated search that returns a limited result set from a much larger one, so I must do the ordering at the query level and not at the application level once the data has been fetched.
Initially I tried this:
https://localhost:8984/solr/some_index/select?q=*:*&sort=random_999+ASC
Where 999 is a constant that is fed in when constructing the query prior to sending it to Solr. The constant value changes for each new search.
This solution works. However, when I run the query a few times, or run it on different Solr instances, the ordering is different.
After doing some reading, random_ generates a number via:
fieldName.hashCode() + context.docBase + (int)top.getVersion()
This means that when the random number is generated, it takes the index version into account. This becomes problematic when using a distributed architecture or when indexes are updated, as is well explained here.
There are various recommended solutions online, but I am trying to avoid writing a custom random override. Is there some type of trick where I can feed in some type of function or equation to the sort param?
For example:
min(999,random_999)
Though this always results in the same order, even when either of the values change.
This question is somewhat similar to this other question, but not quite.
I searched for answers on SO containing solr.RandomSortField, and while they point out what the issue is, none of them have a solution. It seems the best way would be to override the solr.RandomSortField logic, but it's not clear how.
Prior Research
https://lucene.472066.n3.nabble.com/Random-sorting-and-result-consistency-across-successive-calls-based-on-seed-td4170508.html
Solr: Random sort order after index version change
https://mail-archives.apache.org/mod_mbox/lucene-dev/201811.mbox/%3CJIRA.13196983.1541639245000.300557.1541639520069#Atlassian.JIRA%3E
Solr - Return random results (Sort by Random)
https://realize.be/blog/random-results-apache-solr-and-drupal
https://lucene.472066.n3.nabble.com/Sorting-with-customized-function-of-score-td3987281.html
Even after implementing a custom random sort field, the results still differed across instances of Solr.
I ended up adding a new field that is populated at index time which is a 32 bit hash of an ID field that already existed in the document.
I then built a "stateless" linear congruential generator to produce a set of acceptably random numbers to use for sorting:
?sort=mod(product(hash_int_id,{seedConstant},982451653), 104395301) asc
Since this function technically passes a new seed for each row, and because it does not store state (like rand.Next() would), this solution is admittedly inferior and it is not a true PRNG; however, it does seem to get me most of the way there. Note that you will have to tune your values depending on the size of your data set and the size of the values in your hash_int_id equivalent field.

Most efficient way to pull values that may or may not change?

I am not a trained programmer, but I assist in developing/maintaining macros within our VBA-based systems to expedite various tasks our employees do manually. For instance, copying data from one screen to another. By hand, any instance of this could take 30 seconds to 2 minutes, but with a macro, it could take 2-3 seconds.
Most of the macros we develop rely on the ability to accurately pull data as displayed (not from its relative field!) based on a row/column format for each character. As such, we employ the use of a custom command (let's call it, say... Instance.Grab) that pulls what we need from the screen using row x/column y coordinates and the length of what we want to pull. Example, where the we would normally pull a 8 character string from coordinates 1,1:
dim PulledValue as String
PulledValue = Instance.Grab(1,1,8)
If I ran that code on my question so far, the returned value for our macro would have been "I am not"
Unfortunately, our systems are getting their displays altered to handled values of an increased character length. As such, the coordinates of the data we're pulling is getting altered significantly. Rather than go through our macros and change the coordinates and length manually in each macro (which would need to be repeated if the screen formats change again), I'm converting our macros so that any time they need to pull the needed string, we can simply change the needed coordinate/length from a central location.
My question is, what would be the best way to handle this task? I've thought of a few ideas, but want to maximize effectiveness and minimize the time I spend developing it, given my limited programming experience. For the sake of this, let's call what I need to make happen CoorGrab, and where an array is needed, make an array called CoorArray:
1) creating Public Function CoorGrab(ThisField As Variant) -if I did it this way, than I would simply list all the needed coordinate/length sets based on the variant I enter, then pull whichever set as needed using a 3 dimensional array. For instance: CoorGrab(situationA) would return CoorArray(5, 7, 15). This would be easy enough to edit for one of us who know something about programming, but if we're not around for any reason, there could be issues.
2) creating all the needed coordinates in public arrays in the module. I'm not overly familiar with how to implement this, but I think I read up on something called public constants? I kinda like this idea for its simplicity, but am hesitant to use any variable or array as public.
3) creating a .txt file in a shared drive that has all the needed data and a label to identify them, and save it to a shared drive that any terminal can access when running these macros. This would be the easiest for a non-programmer to jump in and edit in case I or one of our other programming-savvy employees aren't available, but it seems like far more work than is needed, and I fear what could happen if the .txt file got a type or accidentally deleted.
Any thoughts on how I should proceed? Are one of the above options inherently better/easier than the others? Or is there another way to handled this situation that I didn't cover? Any info or advice you all can provide would be greatly appreciated!
8/2/15 Note - Should probably mention the VBA is used as part of a terminal emulator with custom applications for the needs of our department. I don't manage the emulator or its applications, nor do I have system admin access; I just create/edit macros used within it to streamline some of the ways our users handle their workloads. Of the three of us who do this, I'm the least skilled at programming, but also the only on who could be pulled that could update them before the changes take effect.
Your way is not so bad, I would:
Use a string as a label as parameter for CoorGrab
Return a range instead of a string (because you can use a single cell range as text and you keep a trace where your data is)
public CoorGrab(byval label as string) as range
Create an Excel Sheet with 3 rows: 1 = label, 2 = x, 3 = y (you could
add a 4 if you need to search in an other sheet)
CoorGrab() Find the label in the Excel Sheet and return X / Y
If developers aren't availables, they just have to edit the Excel sheet.
You could too create and outsource Excel File to read coordinates outside the local file, or use it to update files of everybody (Read file from server, add/update all label in the server file but not in local file)

What is a good approach to check if an item is in a very big hashset?

I have a hashset that cannot be entirely loaded into the memory. So let's say it has ABC part and each one could be loaded into memory but not all at one time.
I also have random entries coming in from time to time which I can barely tell which part it could potentially belong to. So one of the approaches could be that I load A first and then make a check, and then B, C. But next entry could belong to B so I have to unload C, and then load A, then B...Hopefully I make this understood.
This clearly would be very slow so I wonder is there a better way to do that? (if using db is not an alternative)
I suggest that you don't use some criteria to put data entry either to A or to B. In other words, A,B,C - it's just result of division of whole data to 3 equal parts. Am I right? If so I recommend you add some criteria when you adding new entry to your set. For example, if your entries are numbers put those who starts from 0-3 to A, those who starts from 4-6 - to B, from 7-9 to C. When your search something, you apriori now that you have to search in A or in B, or in C. If your entries are words - the same solution, but now criteria is first letter. May be here better use not 3 sets but 26 - size of english alphabet. Please note, that you anyway have to store one of sets in memory. You see one advantage - you do maximum 1 load/unload operation, you don't need to check all sets - you now which of them can really store your value. This idea is widely using in DB - partitioning. If you store in sets nor numbers nor words but some complex objects you anyway can invent some simple criteria.

WPF RichTextBox TextChanged event - how to find deleted or inserted text?

While creating a customized editor with RichTextBox, I've face the problem of finding deleted/inserted text with the provided information with TextChanged event.
The instance of TextChangedEventArgs has some useful data, but I guess it does not cover all the needs. Suppose a scenario which multiple paragraphs are inserted, and at the same time, the selected text (which itself spanned multiple paragraphs) has been deleted.
With the instance of TextChangedEventArgs, you have a collection of text changes, and each change only provides you with the number of removed or added symbols and the position of it.
The only solution I have in mind is, to keep a copy of document, and apply the given list of changes on it. But as the instances of TextChange only give us the number of inserted/removed symbols (and not the symbols), so we need to put some special symbol (for example, '?') to denote unknown symbols while we transform our original copy of document.
After applying all changes to the original copy of document, we can then compare it with the richtextbox's updated document and find the mappings between unknown symbols and the real ones. And finally, get what we want !!!
Anybody has tried this before? I need your suggestions on the whole strategy, and what you think about this approach.
Regards
It primarily depends on your use of the text changes. When the sequence includes both inserts and deletes it is theoretically impossible to know the details of each insert, since some of the symbols inserted may have subsequently been deleted. Therefore you have to choose what results you really want:
For some purposes you must to know the exact sequence of changes even if some of the inserted symbols must be left as "?".
For other purposes you must know exactly how the new text differs from the old but not the exact sequence in which the changes were made.
I will techniques to achieve each of these results. I have used both techniques in the past, so I know they are effective.
To get the exact sequence
This is more appropriate if you are implementing a history or undo log or searching for specific actions.
For these uses, the process you describe is probably best, with one possible change: Instead of "finding the mappings between the unknown symbols and the real ones", simply run the scan forward to find the text of each "Delete" then run it backward to find the text of each "Insert".
In other words:
Start with the initial text and process the changes in order. For each insert, insert '?' symbols. For each delete, remove the specified number of symbols and record them as the text deleted.
Start with the final text and process the changes in reverse order. For each delete, insert '?' symbols. For each insert, remove the specified number of symbols and record them as the text inserted.
When this is complete, all of your "Insert" and "Delete" change entries will have the associated text to the best of our knowledge, and any text that was inserted and immediately deleted will be '?' symbols.
To get the difference
This is more appropriate for revision marking or version comparison.
For these uses, simply use the text change information to compute a set of integer ranges in which changes might be found, then use a standard diff algorithm to find the actual changes. This tends to be very efficient in processing incremental changes but still gives you the best updates.
This is particularly nice when you paste in a replacement paragraph that is almost identical to the original: Using the text change information will indicate the whole paragraph is new, but using diff (ie. this technique) will mark only those symbol runs that are actually different.
The code for computing the change range is simple: Represent the change as four integers (oldstart, oldend, newstart, newend). Run through each change:
If changestart is before newstart, reduce newstart to changestart and reduce oldstart an equal amount
If changeend is after newend, increase newend to changeend and increase oldend an equal amount
Once this is done, extract range [oldstart, oldend] from the old document and the range [newstart, newend] from the new document, then use the standard diff algorithm to compare them.

Resources