Fuzzy regex match using TRE - c

I'm trying to use the TRE library in my C program to perform a fuzzy regex search. I've managed to piece together this code from reading the docs:
regex_t rx;
regcomp(&rx, "(January|February)", REG_EXTENDED);
int result = regexec(&rx, "January", 0, 0, 0);
However, this will match only an exact regex (i.e. no spelling errors are allowed). I don't see any parameter which allows to set the fuzziness in those functions:
int regcomp(regex_t *preg, const char *regex, int cflags);
int regexec(const regex_t *preg, const char *string, size_t nmatch,
regmatch_t pmatch[], int eflags);
How can I set the level of fuzziness (i.e. maximum Levenshtein distance), and how do I get the Levenshtein distance of the match?
Edit: I forgot to mention I'm using the Windows binaries from GnuWin32, which are available only for version 0.7.5. Binaries for 0.8.0 are available only for Linux.

Thanks to #Wiktor Stribiżew, I found out which function I need to use, and I've successfully compiled a working example:
#include <stdio.h>
#include "regex.h"
int main() {
regex_t rx;
regcomp(&rx, "(January|February)", REG_EXTENDED);
regaparams_t params = { 0 };
params.cost_ins = 1;
params.cost_del = 1;
params.cost_subst = 1;
params.max_cost = 2;
params.max_del = 2;
params.max_ins = 2;
params.max_subst = 2;
params.max_err = 2;
regamatch_t match;
match.nmatch = 0;
match.pmatch = 0;
if (!regaexec(&rx, "Janvary", &match, params, 0)) {
printf("Levenshtein distance: %d\n", match.cost);
} else {
printf("Failed to match\n");
}
return 0;
}

Related

How do I get C to successfully match a regex?

So, I am trying to check the format of a key using the regex.h library in C. This is my code:
#include <stdio.h>
#include <regex.h>
int match(char *reg, char *string)
{
regex_t regex;
int res;
res = regcomp(&regex, reg, 0);
if (res)
{
fprintf(stderr, "Could not compile regex\n");
return 1;
}
res = regexec(&regex, string, 0, NULL, 0);
return res;
}
int main(void)
{
char *regex = "[\\w-]{24}\\.[\\w-]{6}\\.[\\w-]{27}|mfa\\.[\\w-]{84}";
char *key = "xxxxxxxxxxxxxxxxxxxxxxxx.xxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx";
if (match(regex, key) == 0) printf("Valid key!\n");
else printf("Invalid key!\n");
return 0;
}
When I run this code, I get the output:
Invalid key!
Why is this happening? If I try to test the same key with the same regex in Node.JS, I get that the key does match the regex:
> const regex = new RegExp("[\\w-]{24}\\.[\\w-]{6}\\.[\\w-]{27}|mfa\\.[\\w-]{84}");
undefined
> const key = "xxxxxxxxxxxxxxxxxxxxxxxx.xxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx";
undefined
> regex.test(key)
true
How could I get the right result using C?
Thanks in advance,
Robin
There are at least two issues here and one extra potential problem:
The limiting quantifiers will work as such in a POSIX ERE flavor, thus, as it has been pointed out in comments, you need to regcomp the pattern with a REG_EXTENDED option (i.e. res = regcomp(&regex, reg, REG_EXTENDED))
The \w shorthand character class does not work inside bracket expressions as a word char matching pattern, you need to replace it with [:alnum:]_, i.e. [\w-] must be replaced with [[:alnum:]_-]. The solution will be:
char *regex = "[[:alnum:]_-]{24}\\.[[:alnum:]_-]{6}\\.[[:alnum:]_-]{27}|mfa\\.[[:alnum:]_-]{84}";
Besides, if your regex must match the two alternatives exactly, you need to use a group around the whole pattern and add ^ and $ anchors on both ends. The solution will be:
char *regex = "^([[:alnum:]_-]{24}\\.[[:alnum:]_-]{6}\\.[[:alnum:]_-]{27}|mfa\\.[[:alnum:]_-]{84})$";
See this C demo:
#include <stdio.h>
#include <regex.h>
int match(char *reg, char *string)
{
regex_t regex;
int res;
res = regcomp(&regex, reg, REG_EXTENDED);
if (res)
{
fprintf(stderr, "Could not compile regex\n");
return 1;
}
res = regexec(&regex, string, 0, NULL, 0);
return res;
}
int main(void)
{
char *regex = "^([[:alnum:]_-]{24}\\.[[:alnum:]_-]{6}\\.[[:alnum:]_-]{27}|mfa\\.[[:alnum:]_-]{84})$";
char *key = "xxxxxxxxxxxxxxxxxxxxxxxx.xxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx";
if (match(regex, key) == 0) printf("Valid key!\n");
else printf("Invalid key!\n");
return 0;
}
// => Valid key!

Can Ghidra re-compile and run a short function?

I've picked out a short and "self-contained" function from the Ghidra decompiler. Can Ghidra itself compile the function again so I can try to run it for a couple different values, or would I need to compile it myself with e.g. gcc?
Attaching the function for context:
undefined8 FUN_140041010(char *param_1,longlong param_2,uint param_3)
{
char *pcVar1;
uint uVar2;
ulonglong uVar3;
uVar3 = 0;
if (param_3 != 0) {
pcVar1 = param_1;
do {
if (pcVar1[param_2 - (longlong)param_1] == '\0') {
if ((uint)uVar3 < param_3) {
param_1[uVar3] = '\0';
return 0;
}
break;
}
*pcVar1 = pcVar1[param_2 - (longlong)param_1];
uVar2 = (uint)uVar3 + 1;
uVar3 = (ulonglong)uVar2;
pcVar1 = pcVar1 + 1;
} while (uVar2 < param_3);
}
param_1[param_3 - 1] = '\0';
return 0;
}
Can Ghidra itself compile the function again so I can try to run it for a couple different values
The P-Code emulator of Ghidra is intended for this kind of scenario.
If it is just a short function and doesn't use other libraries, syscalls, etc like your example then the emulator can easily handle this without further effort on your side to emulate library functions. Ghidra knows the semantics of each instruction and converts them to the standardized P-Code format for e.g. decompilation, but this can also be combined with a "P-Code virtual machine".
It will most likely still involve a bit of scripting, though there exist plugins like TheRomanXpl0it/ghidra-emu-fun to make this easier. There are also more general tutorials if you want to understand the basic idea and usage of the Emulator API (which is not exposed in the GUI in any way)
If you run into issues while scripting the emulator I would recommend asking specific questions about the emulator API at the dedicated Reverse Engineering Stack Exchange
You can, but you'll have to change some of the types to be standard C, or just add typedefs like so:
#include <stdint.h>
typedef uint8_t undefined8;
typedef long long int longlong;
typedef unsigned long long int ulonglong;
typedef unsigned int uint;
undefined8 FUN_140041010(char *param_1,longlong param_2,uint param_3)
{
char *pcVar1;
uint uVar2;
ulonglong uVar3;
uVar3 = 0;
if (param_3 != 0) {
pcVar1 = param_1;
do {
if (pcVar1[param_2 - (longlong)param_1] == '\0') {
if ((uint)uVar3 < param_3) {
param_1[uVar3] = '\0';
return 0;
}
break;
}
*pcVar1 = pcVar1[param_2 - (longlong)param_1];
uVar2 = (uint)uVar3 + 1;
uVar3 = (ulonglong)uVar2;
pcVar1 = pcVar1 + 1;
} while (uVar2 < param_3);
}
param_1[param_3 - 1] = '\0';
return 0;
}
Then you can call it like any function:
int main(int argc, char const* argv[])
{
char* mystr = "hello";
printf("%hhu\n", FUN_140041010(mystr, /* not sure about this arg */ 0, strlen(mystr));
return 0;
}

How mbtowc uses locale?

I have hard time using mbtowc, which keeps returning wrong results. It also puzzles me why the function even uses locale? Multibyte unicode chars points are locale independent. I implemented custom conversion function that convert it well, see the code below.
I use GCC 4.8.1 on Windows (where sizeof wchar_t is 2), using Czech locale (cs_CZ). The OEM codepage is windows-1250, console by default uses CP852. These are my results so far:
#include <stdio.h>
#include <stdlib.h>
// my custom conversion function
int u8toint(const char* str) {
if(!(*str&128)) return *str;
unsigned char c = *str, bytes = 0;
while((c<<=1)&128) ++bytes;
int result = 0;
for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
int mask = 1;
for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
result|= (*str&mask)<<(6*bytes);
return result;
}
// data inspecting type for the tests in main()
union data {
wchar_t w;
struct {
unsigned char b1, b2;
} bytes;
} a,b,c;
int main() {
// I tried setlocale here
mbtowc(NULL, 0, 0); // reset internal mb_state
mbtowc(&(a.w),"ř",6); // apply mbtowc
b.w = u8toint("ř"); // apply custom function
c.w = L'ř'; // compare to wchar
printf("\na = %hhx%hhx", a.bytes.b2, a.bytes.b1); // a = 0c5 wrong
printf("\nb = %hhx%hhx", b.bytes.b2, b.bytes.b1); // b = 159 right
printf("\nc = %hhx%hhx", c.bytes.b2, c.bytes.b1); // c = 159 right
getchar();
}
Here are setlocale settings and the results for a:
setlocale(LC_CTYPE,"Czech_Czech Republic.1250"); // a = 139 wrong
setlocale(LC_CTYPE,"Czech_Czech Republic.852"); // a = 253c wrong
Why mbtowc doesn't give 0x159 - the unicode number of ř?

regular expressions in C match and print

I have lines from file like this:
{123} {12.3.2015 moday} {THIS IS A TEST}
is It possible to get every value between brackets {} and insert into array?
Also I wold like to know if there is some other solution for this problem...
to get like this:
array( 123,
'12.3.2015 moday',
'THIS IS A TEST'
)
My try:
int r;
regex_t reg;
regmatch_t match[2];
char *line = "{123} {12.3.2015 moday} {THIS IS A TEST}";
regcomp(&reg, "[{](.*?)*[}]", REG_ICASE | REG_EXTENDED);
r = regexec(&reg, line, 2, match, 0);
if (r == 0) {
printf("Match!\n");
printf("0: [%.*s]\n", match[0].rm_eo - match[0].rm_so, line + match[0].rm_so);
printf("1: %.*s\n", match[1].rm_eo - match[1].rm_so, line + match[1].rm_so);
} else {
printf("NO match!\n");
}
This will result:
123} {12.3.2015 moday} {THIS IS A TEST
Anyone know how to improve this?
To help you you can use the regex101 website which is really useful.
Then I suggest you to use this regex:
/(?<=\{).*?(?=\})/g
Or any of these ones:
/\{\K.*?(?=\})/g
/\{\K[^\}]+/g
/\{(.*?)\}/g
Also available here for the first one:
https://regex101.com/r/bB6sE8/1
In C you could start with this which is an example for here:
#include <stdio.h>
#include <string.h>
#include <regex.h>
int main ()
{
char * source = "{123} {12.3.2015 moday} {THIS IS A TEST}";
char * regexString = "{([^}]*)}";
size_t maxGroups = 10;
regex_t regexCompiled;
regmatch_t groupArray[10];
unsigned int m;
char * cursor;
if (regcomp(&regexCompiled, regexString, REG_EXTENDED))
{
printf("Could not compile regular expression.\n");
return 1;
};
cursor = source;
while (!regexec(&regexCompiled, cursor, 10, groupArray, 0))
{
unsigned int offset = 0;
if (groupArray[1].rm_so == -1)
break; // No more groups
offset = groupArray[1].rm_eo;
char cursorCopy[strlen(cursor) + 1];
strcpy(cursorCopy, cursor);
cursorCopy[groupArray[1].rm_eo] = 0;
printf("%s\n", cursorCopy + groupArray[1].rm_so);
cursor += offset;
}
regfree(&regexCompiled);
return 0;
}

Working example of substitution using PCRS

I need to to substitution in a string in C. It was recommended in one of the answers here How to do regex string replacements in pure C? to use the PCRS library. I downloaded PCRS from here ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Contrib/ but I'm confused as to how to use it. Below is my code (taken from another SE post)
const char *error;
int erroffset;
pcre *re;
int rc;
int i;
int ovector[100];
char *regex = "From:([^#]+).*";
char str[] = "From:regular.expressions#example.com\r\n";
char stringToBeSubstituted[] = "gmail.com";
re = pcre_compile (regex, /* the pattern */
PCRE_MULTILINE,
&error, /* for error message */
&erroffset, /* for error offset */
0); /* use default character tables */
if (!re)
{
printf("pcre_compile failed (offset: %d), %s\n", erroffset, error);
return -1;
}
unsigned int offset = 0;
unsigned int len = strlen(str);
while (offset < len && (rc = pcre_exec(re, 0, str, len, offset, 0, ovector, sizeof(ovector))) >= 0)
{
for(int i = 0; i < rc; ++i)
{
printf("%2d: %.*s\n", i, ovector[2*i+1] - ovector[2*i], str + ovector[2*i]);
}
offset = ovector[1];
}
As opposed to 'pcre_compile' and 'pcre_exec' what functions do I need to use from PCRS?
Thanks.
Simply follow the instructions in the INSTALL file:
To build PCRS, you will need pcre 3.0 or later and gcc.
Installation is easy: ./configure && make && make install
Debug mode can be enabled with --enable-debug.
There is a simple demo application (pcrsed) included.
PCRS provides the following functions documented in the man page pcrs.3:
pcrs_compile
pcrs_compile_command
pcrs_execute
pcrs_execute_list
pcrs_free_job
pcrs_free_joblist
pcrs_strerror
Here's an online version of the man page. To use these functions, include the header file pcrs.h and link your program against the PCRS library using the linker flag -lpcrs.

Resources