C Regex capturing group

C Regex capturing group - c

I'm having troubles understanding how the regex in C work.
Basically I have an XML file (I can't use an XML parser) containing lines like this:
<Node Bla="blabla" Name="this is my name" .... />
<Node Name="this is my name" Bla="blabla" .... />
What I would like to do is extract the name part of each line. So far I have been using the following regex:
char *regex_str = "Name=\"([^\"]*)\"";
But this gives me Name="this is my name", I'm only looking for the this is my name part.
What am I doing wrong?

You may not need a capturing group.
Assuming your library has lookbehind (which it definitely does if it's PCRE), you can use this regex to match the name:
(?<=[Nn]ame=")[^"]+
See regex demo.
Explanation
the lookbehind (?<=[Nn]ame=") asserts that what precedes is Name=" or name="
[^"]+ matches one or more chars that are not a "
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

Just use a lookbehind to capture the characters which are just after to the string Name upto the first " symbol,
(?<=Name=\")([^\"]*)
Explanation:
(?<=Name=\") Sets the matching marker just after to the string Name"
([^\"]*) Captures all the characters not of " zero or more times.

Related

Understand the word sense disambiguation data set format

I am trying to evaluate a WSD model using well-known WSD data set (SemEval, SensEval). But I am don't understand the format of the gold key text file.
seneval3.gold.key.txt
d000.s000.t000 man%1:18:00::
d000.s000.t001 say%2:32:01::
d000.s001.t000 peer%2:39:00::
d000.s001.t001 companion%1:18:00::
d000.s001.t002 bleary%5:00:00:indistinct:00
d000.s001.t003 eye%1:08:00::
d000.s002.t000 have%2:40:00::
d000.s002.t001 ready%5:00:01:available:00
d000.s002.t002 answer%1:04:00::
d000.s002.t003 much%3:00:00::
d000.s002.t004 surprise%1:12:00::
d000.s002.t005 fit%1:26:00::
d000.s002.t006 coughing%1:26:00::
d000.s003.t000 man%1:18:00::
d000.s003.t001 drunk%3:00:00::
d000.s003.t002 crazy%5:00:00:insane:00
d000.s004.t000 newfound%5:00:00:new:00
I know that in the first line d000.s000.t000 talking about the document #0 sentence #0 token #0 by looking at the data file.
senseval3.data.xml
<sentence id="d000.s000">
<wf lemma="that" pos="DET">That</wf>
<wf lemma="&apos;" pos="VERB">&apos;s</wf>
<wf lemma="what" pos="PRON">what</wf>
<wf lemma="the" pos="DET">the</wf>
<instance id="d000.s000.t000" lemma="man" pos="NOUN">man</instance>
<wf lemma="have" pos="VERB">had</wf>
<instance id="d000.s000.t001" lemma="say" pos="VERB">said</instance>
<wf lemma="." pos=".">.</wf>
</sentence>
But I don't know what is meant after %, for example 1:18:00:: for lemma man.

This answer is composed based on the comment given for this SO post.
The number sequence followed by % is the lex_index. Lex index composed as follows.
ss_type:lex_filenum:lex_id:head_word:head_id
More information is in the WordNet documentation.

Solr Data Config Error, Open quote is expected for attribute "driver"

I have a postgres data-config file.
<dataConfig>
<dataSource driver=”org.postgresql.Driver” url=”jdbc:postgresql://127.0.0.1:5432/mydb” user=”user” password=”pw” />
...
</dataConfig>
But when I run it, it shows error
Data Config problem: Open quote is expected for attribute "driver" associated with an element type "dataSource".
What's the problem here. is driver information that I put wrong?

Your quotes are wrong.
” and " are not the same kind of quotes (see the different presentation). Only " is a valid double quote in an XML file (and in most/all programming contexts).
The examples in your config file seems to have been mangled by a blog or a text editor on the way.

How to escape LDAP dn for lookup?

I have a small script (spring/groovy/ldap) that finds, in Active Directory, the 'management tree' under a person,
i.e. from a 'root person' the script finds the root person's direct reports then uses recursion: for each direct report find their direct reports, etc.
the directReports users attribute specifies a list of DN's in the form:
CN=Simpson\, Homer,OU=OU_0731DevOps,OU=OU_0100Monitor Services,OU=OU_0001U*Nuclear Energy Corporation,OU=OU_UNuclearUsers,DC=corp,DC=unucleargrp,DC=com
The script does an "ldap lookup" for each direct report by DN, e.g.:
obj = ldapTemplate.lookup(pDn, new UserAttributesMapper())
Problem
The ldap lookup throws an InvalidNameException
[LDAP: error code 34 - 0000208F: LdapErr: DSID-0C090787
I've tried various combinations of escaping but still get the error.
What am I missing???
More Info
This url https://social.technet.microsoft.com/wiki/contents/articles /5312.active-directory-characters-to-escape.aspx shows which characters to escape:
Active Directory requires that the following ten characters be escaped
with the backslash "\" escape character if they appear in any of the
individual components of a distinguished name:
Comma ,
Backslash character \
Pound sign (hash sign) #
Plus sign +
Less than symbol <
Greater than symbol >
Semicolon ;
Double quote (quotation mark) "
Equal sign =
Leading or trailing spaces
Tools
Groovy
Spring Boot
JVM
thanks!

I found the answer by poking around with LDAPNameBuilder.
TLDR:
ldapTemplate.lookup requires stripping off the "DC.." portion of the DN.*
If you know a cleaner/more-official solution, please post!
LDAP Lookup fails with a DN like this:
This DN has "DC=.." components and fails using spring ldap lookup.
CN=Simpson\, Homer,OU=OU_0731DevOps,OU=OU_0100Monitor Services,OU=OU_0001U*Nuclear Energy Corporation,OU=OU_UNuclearUsers,DC=corp,DC=unucleargrp,DC=com
LDAP succeeds with this (no "DC" components):
This DN has no "DC=" components. Spring LDAP template provides the basedn.
CN=Simpson\, Homer,OU=OU_0731DevOps,OU=OU_0100Monitor Services,OU=OU_0001U*Nuclear Energy Corporation
Context Reminder
This application traverses 'management tree.' It gets a persons managees by the 'directReports' attribute (which lists the full-DN's of each direct report). This application wanted to lookup that user by his/her DN.
Tweak/Example
This tweak got the ldap lookup to work:
User lookupUserByDn(String pDn) {
// needed this to get it to work
String dn=pDn.replace(",${ldapConfig.base}","")
ldapTemplate.lookup(dn, new UserAttributesMapper())
}
for the record, my application.yml ldap portion looked like this:
spring:
ldap:
urls: ldap://dc.corp.unucleargrp.com:389
base: DC=corp,DC=unucleargrp,DC=com
username: username_val
password : password_val

According to this https://docs.spring.io/spring-ldap/docs/2.3.1.RELEASE/reference/#contextsource-configuration
Removing the base attribute, All operations going back and forth will use full DNs.

How to get right attribute which contain "&" through libxml2

When I try to get the attribute of URL in a test XML:
<Test> <Item URL="http://127.0.0.1?a=1&b=2"/>
</Test>
After I call: attr=xmlGetProp(cur, BAD_CAST "URL");
The libxml2 give a message: Entity: line 1: parser error : EntityRef: expecting ';'
and return value of attr is "http://127.0.0.1?a=1=2"
How can I get the completion attribution of URL? Thanks

You cannot get the “correct” URL here because the XML file is not well-formed. the & should have been written as &. You have to ask the creator of the XML file to create a syntactically valid, well-formed XML file.
XML is not created by just putting strings together, they also have to be encoded properly.

libxml2 SAX query

I am trying to parse an XML file using the SAX interface of libxml2 in C.
My problem is that whitespace characters between end of a tag and start of a new tag are causing the callback
"Characters" to be executed...Hi All,
i.e.
<?xml version="1.0"?>
<doc>
<para>Hello, world!</para>
</doc>
produces these events:
start document
start element: doc
start element: para
characters: Hello, world!
end element: para
characters:
end element: doc
characters:
end document
It would be really nice if somehow these whitespaces don't get recognized as "characters".
Anybody got any idea why this is happening or how this can be prevented from happening???

This is, of course, happening since whitespace between elements is significant in XML. So it's just operating according to specification.
See, for instance, this discussion.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

C Regex capturing group - c

Just use a lookbehind to capture the characters which are just after to the string Name upto the first " symbol, (?<=Name=\")([^\"]) Explanation: (?<=Name=\") Sets the matching marker just after to the string Name" ([^\"]) Captures all the characters not of " zero or more times.

Related

Understand the word sense disambiguation data set format

Solr Data Config Error, Open quote is expected for attribute "driver"

How to escape LDAP dn for lookup?

How to get right attribute which contain "&" through libxml2

libxml2 SAX query

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

C Regex capturing group - c

Just use a lookbehind to capture the characters which are just after to the string Name upto the first " symbol, (?<=Name=\")([^\"]*) Explanation: (?<=Name=\") Sets the matching marker just after to the string Name" ([^\"]*) Captures all the characters not of " zero or more times.

Related

Understand the word sense disambiguation data set format

Solr Data Config Error, Open quote is expected for attribute "driver"

How to escape LDAP dn for lookup?

How to get right attribute which contain "&" through libxml2

libxml2 SAX query

Categories

Resources

Just use a lookbehind to capture the characters which are just after to the string Name upto the first " symbol, (?<=Name=\")([^\"]) Explanation: (?<=Name=\") Sets the matching marker just after to the string Name" ([^\"]) Captures all the characters not of " zero or more times.