Regex in C - pattern matches too early and drops trailing character(s) - c

I'm working on an interpolation module in C. I have access to both POSIX and PCRE2 regex parsers. Essentially, I have a block of text which contains tokens. These tokens are substituted with values from a lookup table. Each token has the ability to specify alternate text if said token is not found in the lookup table.
Nominally, a token is formatted as ?{token} To handle situations where a token may not be defined and you want alternate text, you suffix the token with a ':' (color) and the alternate, thusly ?{token:alternate} The alternate can be plain text or another token.
The module has the ability to process nested tokens; this includes a token as alternate text. Such a definition looks like ?{token:?{alt_token}} And herein lies the problem. When I parse this, I get ?{token:?{alt_token} as the match. Notice that the closing '}' is missing. When I un-greedy the pattern where it captures the alternate text, the regex over-grabs.
A pathological example would be ?{some:?{deep:?{replacement:?{inside:all this}}}} Ideally, this would grab and resolve (on each pass) the following:
?{inside:all this}
?{replacement:?{inside:all this}}
?{deep:?{replacement:?{inside:all this}}}
?{some:?{deep:?{replacement:?{inside:all this}}}}
... but what I get is each one missing its closing '}'
I'm using several capture groups in my regex to group/grab the token and its optional alternate text. Here's my regex:
`"\\Q?{\\E([\\Q:=\\E]{0,1}[A-Za-z_-][0-9A-Za-z_\\.-]*?)(\\:(.*?)){0,1}\\Q}\\E"`
match[0] is the whole RE; match[1] is the token; match[3] is the alternate text
Any suggestions how I might modify my regex to retain the closing '}' when the token to parse ends with more than one '}' ... ?

Related

Azure Search Define Custom Analyzer

I'm defining the Index schema. One of the field is "InvoiceNumber" which it can be something like "459" or "00459" or "P00459".
I want the text "00459" while indexing tokenize to 2 tokens "459" and the original "00459".
And the text "P00459", tokenize to 3 tokens "459", "00459" and the original "P00459".
Is there a way to define the custom analyzer for this?
configuring pattern_capture token filter with appropriate regex is able to produce multiple tokens based on the same text while preserving the original text.
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers
https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html
This is the example from the latter link
"(https?://([a-zA-Z-_0-9.]+))" when matched against the string "http://www.foo.com/index" would return the tokens "https://www.foo.com" and "www.foo.com".

regex with OR condition not working in angularjs [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

apache camel <simple trim="true" is not working

I'm reading a simple content from a file, say "80631". i validate it against a regex("^\d+$") to check it's just digits. But the validation fails. When i inspect the content read from the file it's something like "80631 ". I tried to trim the whitespace with , but it didn't work. Do we have any other way to trim the whitespace?
<camel:setProperty propertyName="messageId">
<simple trim="true">${body}</simple>
</camel:setProperty>
You should likely show the code to get better help. But <simple trim="true"> ... </simple> is trimming the output of the expression.
Its not for trimming message body.
You need to use message transformation beforehand to trim the message body. Or write a regular expression that ignore leading/ending whitespace.

Issues with searching special characters in Solr

I'm using Solr 6.1.0
When I use defType=edismax, and using debug mode by setting debug=True, I found that the search for "r&d" is actually done to search on just the character "r".
http://localhost:8983/solr/collection1/highlight?q="r&d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r",
"querystring":"\"r",
"parsedquery":"(+DisjunctionMaxQuery((text:r)))/no_coord",
"parsedquery_toString":"+(text:r)"
Even if I search with escape character, it is of no help.
http://localhost:8983/solr/collection1/highlight?q="r\&d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r\\",
"querystring":"\"r\\",
"parsedquery":"(+DisjunctionMaxQuery((text:r)))/no_coord",
"parsedquery_toString":"+(text:r)",
But if I'm using other symbols like "r*d", then the search is ok.
http://localhost:8983/solr/collection1/highlight?q="r*d"&debugQuery=true&defType=edismax
"debug":{
"rawquerystring":"\"r*d\"",
"querystring":"\"r*d\"",
"parsedquery":"(+DisjunctionMaxQuery((text:\"r d\")))/no_coord",
"parsedquery_toString":"+(text:\"r d\")",
What could be the reason behind this?
Regards,
Edwin
First - if you're using the URL as you've pasted, & is the separator between different arguments in the URL, and have to be properly urlencoded if it belongs to an argument, and is not an argument separator.
q=text:"foo&bar"&fl=..
is parsed as
q=text:"foo
bar"
fl=..
Your Solr library usually handles this for you transparently. text%3A%22r%26d%22 is the urlencoded version of text:"r&d".
Secondly, any further parsing will depend on the analysis chain and tokenizer for the field you're searching. This determines which characters are kept and how the text is tokenized (split into separate tokens) before the tokens are matched between the querying text and the indexed text.
What Analyzer are you using for your field . Better try a Analyzer that doesn't tokenize your field much like KeyWordTokenizerFactory.

Sublime Text Snippet: Create camelcased string from the hyphenated file name

I am trying to create a Sublime Text snippet for AngularJs. This snippet should expand to AngularJs controller (or service, etc or any ng component). In the resulting code, it should construct the controller name in camelCase from the hyphenated file name.
For example:
when I type the snippets strings, say, ngctrl in an empty file called employee-benefits-controller.js, it should expand as given below:
angular.module('').controller('EmployeeBenefitsController', ['', function(){
}]);
I am trying to use the $TM_FILENAME variable by applying a regex on it to achieve this conversion. If anyone has already done this, please let us know.
You could use something like this:
<snippet>
<content><![CDATA[
angular.module('${1:moduleName}').controller('${TM_FILENAME/(^|-|\.js)(.?)|/\U\2\E/g}', ['', function(){
${2://functionCode}
}]);
]]></content>
<tabTrigger>ngctrl</tabTrigger>
</snippet>
Notes:
Note 1: maybe you want to change the scope so that the snippet its only triggered in javascript context.
Note 2: I'm not familiar with angularjs, so I don't know its naming conventions (I have supposed that an uppercase letter its needed after a hyphen [-] character and at the begining of the name, but I don't know if a uppercase character its needed after a dot character for example). So, you'll probablly have to adapt the snippet.
Note 3: expression explained:
${TM_FILENAME/(^|-|.js)(.?)/\U\2\E/g}
TM_FILENAME its the var_name item
(^|-|.js)(.?) its the regex (the parts of the variable we select).
\U\2\E its the format_string (how we format what we have selected).
g its the options (g means globally, so every time something its selected the format its given).
TM_FILENAME: the file name with the extension included.
\U => init uppercase conversion. \E => finish uppercase conversion. \2 => second group, i.e. second parénthesis, (.?), its a single char or an empty string.
(^|-|.js)(.?) First we look for the beginning of the word (^), or for a hypen character (-), or for the extension (.js).
(.?) Then we select in a parenthesis group (second group) the character (if any) after that hypen (or at the beginning of the word or after the extension).
Finally we use the uppercase conversion over that selected character as explained. Note that as there is not character after the extension, we are simply removing the extension from the output.
Note 4: as you probablly know, using ${1:moduleName} and ${2://functionCode} allows you to quickly move (using tab) and edit the important parts of the snippet once it has been triggered, such as the module or the function code.

Resources