Understand the word sense disambiguation data set format - dataset

I am trying to evaluate a WSD model using well-known WSD data set (SemEval, SensEval). But I am don't understand the format of the gold key text file.
seneval3.gold.key.txt
d000.s000.t000 man%1:18:00::
d000.s000.t001 say%2:32:01::
d000.s001.t000 peer%2:39:00::
d000.s001.t001 companion%1:18:00::
d000.s001.t002 bleary%5:00:00:indistinct:00
d000.s001.t003 eye%1:08:00::
d000.s002.t000 have%2:40:00::
d000.s002.t001 ready%5:00:01:available:00
d000.s002.t002 answer%1:04:00::
d000.s002.t003 much%3:00:00::
d000.s002.t004 surprise%1:12:00::
d000.s002.t005 fit%1:26:00::
d000.s002.t006 coughing%1:26:00::
d000.s003.t000 man%1:18:00::
d000.s003.t001 drunk%3:00:00::
d000.s003.t002 crazy%5:00:00:insane:00
d000.s004.t000 newfound%5:00:00:new:00
I know that in the first line d000.s000.t000 talking about the document #0 sentence #0 token #0 by looking at the data file.
senseval3.data.xml
<sentence id="d000.s000">
<wf lemma="that" pos="DET">That</wf>
<wf lemma="&apos;" pos="VERB">&apos;s</wf>
<wf lemma="what" pos="PRON">what</wf>
<wf lemma="the" pos="DET">the</wf>
<instance id="d000.s000.t000" lemma="man" pos="NOUN">man</instance>
<wf lemma="have" pos="VERB">had</wf>
<instance id="d000.s000.t001" lemma="say" pos="VERB">said</instance>
<wf lemma="." pos=".">.</wf>
</sentence>
But I don't know what is meant after %, for example 1:18:00:: for lemma man.

This answer is composed based on the comment given for this SO post.
The number sequence followed by % is the lex_index. Lex index composed as follows.
ss_type:lex_filenum:lex_id:head_word:head_id
More information is in the WordNet documentation.

Related

Anybody know if OrcTableSource supports S3 file system?

I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:
OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();
seems this path is incorrect but anyone can help out? appreciate a lot!
Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)
By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.
DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);
The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
See this link for more information about setting up S3 for Hadoop.
This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.

How to see yago2s ontology like dbpedia owl?

Hello I'm totally newbie in ontology.
I downloaded dbpedia ontology .owl file and open it using topbraid composer.
Topbraid composer shows dbpedia class( owl:Thing -> Activity, Agent, .. etc). Each class also has its own instances.
However, yago2s only provides many .ttl files( yagoSchema.ttl, yagoFact.ttl .. etc).
Cause I think these ttl files are similar to owl file, I also open it using topbraid composer. I expected to see the structure like dbpedia owl file, but it wasn't similar to dbpedia owl file..
They provide schema ttl file, instances ttl file, ... files respectively, but i wanna see the whole thing at once.
Should I get yago2s owl file? or is there any ways to see yago ttl files like dbpedia owl??
Thanks in advance.
The error message when I tried to open yagoTypes.ttl file is
java.lang.reflect.InvocationTargetException
at org.eclipse.jface.operation.ModalContext.run(ModalContext.java:421)
at org.eclipse.jface.dialogs.ProgressMonitorDialog.run(ProgressMonitorDialog.java:507)
at org.eclipse.ui.internal.progress.ProgressMonitorJobsDialog.run(ProgressMonitorJobsDialog.java:275)
at org.eclipse.ui.internal.progress.ProgressManager$3.run(ProgressManager.java:960)
at org.eclipse.swt.custom.BusyIndicator.showWhile(BusyIndicator.java:70)
at org.eclipse.ui.internal.progress.ProgressManager.busyCursorWhile(ProgressManager.java:995)
at org.eclipse.ui.internal.progress.ProgressManager.busyCursorWhile(ProgressManager.java:970)
at org.topbraidcomposer.core.io.TBCIO$3.run(TBCIO.java:501)
at org.eclipse.swt.widgets.RunnableLock.run(RunnableLock.java:35)
at org.eclipse.swt.widgets.Synchronizer.runAsyncMessages(Synchronizer.java:135)
at org.eclipse.swt.widgets.Display.runAsyncMessages(Display.java:4145)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3762)
at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine$9.run(PartRenderingEngine.java:1113)
at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:332)
at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.run(PartRenderingEngine.java:997)
at org.eclipse.e4.ui.internal.workbench.E4Workbench.createAndRunUI(E4Workbench.java:140)
at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:611)
at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:332)
at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:567)
at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:150)
at org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication.java:124)
at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:110)
at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:79)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:354)
at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:636)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:591)
at org.eclipse.equinox.launcher.Main.run(Main.java:1450)
at org.eclipse.equinox.launcher.Main.main(Main.java:1426)
Caused by: java.lang.NullPointerException
at org.topbraid.core.model.Classes.getMetaClasses(Classes.java:548)
at org.topbraid.core.model.Classes.computeMetaClasses(Classes.java:45)
at org.topbraidcomposer.core.session.AbstractSessionWithCache.getCachedMetaClasses(AbstractSessionWithCache.java:67)
at org.topbraid.core.model.Classes.getMetaClasses(Classes.java:166)
at org.topbraidcomposer.editors.ResourceEditorLauncher.checkVisibility(ResourceEditorLauncher.java:270)
at org.topbraidcomposer.editors.ResourceEditorLauncher.access$4(ResourceEditorLauncher.java:269)
at org.topbraidcomposer.editors.ResourceEditorLauncher$5.run(ResourceEditorLauncher.java:577)
at org.topbraidcomposer.core.io.TBCIO$2.run(TBCIO.java:482)
at org.eclipse.jface.operation.ModalContext$ModalContextThread.run(ModalContext.java:121)
and this same error occurs when I concatenate yagoTypes.ttl and yagoFacts.ttl using cat command, and try to open this concatenated file..
Where to get the data
If you got the data from YAGO2s Downloads, it says right at the beginning of the page:
You can download the entire YAGO2s ontology in one piece. (Extracted
from 2012-12-01 version of Wikipedia.)
Download YAGO2s ontology in
.ttl format! (2.2 Gb compressed, 18.5 Gb uncompressed)
That sounds like what you want. If you just want to see the class hierarchy, though, then you might want the yagoTaxonomy files:
yagoTaxonomy The entire YAGO taxonomy. These are all rdfs:subClassOf facts derived from Wikipedia and from WordNet.
The format of the data
OWL is a ontology language with an abstract structure that can be serialized in a number of different ways including OWL/XML, the OWL Functional Syntax, the Manchester Syntax, and encoded as RDF. Now, RDF is also an abstract format, and can be serialized in a number of ways, including N-Triples, N3, Turtle (ttl), and RDF/XML. Most .owl files that you find are actually RDF/XML files that are serializations of the RDF encoding of an OWL ontology. That's probably what your .owl file is. The .ttl files you're seeing are the Turtle serialization of the RDF encoding of an OWL ontology. Standard RDF processing tools should be able to process it.

Solr 4: disable compression on stored fields: how to actually configure custom codec?

The short question is :
I want to disable stored field compression on Solr 4.3.0 index. After reading :
http://blog.jpountz.net/post/35667727458/stored-fields-compression-in-lucene-4-1
http://wiki.apache.org/solr/SimpleTextCodecExample
http://www.opensourceconnections.com/2013/06/05/build-your-own-lucene-codec/
I've decided to follow the path described there, and make my own codec. I'm pretty sure I've followed all the steps, however, when I actually try to use my codec (affectionatelly named "UncompressedStorageCodec"), I get the following error in Solr log:
java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
From the output I get that Solr is not picking up the jar with my custom codec, and I don't get why?
Here's all the horriffic details:
I've created a class like this:
public class UncompressedStorageCodec extends FilterCodec {
private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat();
protected UncompressedStorageCodec() {
super("UncompressedStorageCodec", new Lucene42Codec());
}
#Override
public StoredFieldsFormat storedFieldsFormat() {
return fieldsFormat;
}
}
in package: "fr.company.project.solr.transformers.utils"
The FQDN of "FilterCodec" is: "org.apache.lucene.codecs.FilterCodec"
I've created a basic jar file out of this (exported it as jar from Eclipse).
The Solr installation I'm using to test this is the basic Solr 4.3.0 unzipped, and started via it's embedded Jetty server and using the example core.
I've placed my jar with the codec in [solrDir]\dist
In:
[solrDir]\example\solr\myCore\conf\solrconfig.xml
I've added the line:
<lib dir="../../../dist/" regex="myJarWithCodec-1.10.1.jar" />
Then in the schema.xml file, I've declared some fieldTypes that should use this codec like so:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" postingsFormat="UncompressedStorageCodec"/>
<fieldType name="string_lowercase" class="solr.TextField" positionIncrementGap="100" omitNorms="true" postingsFormat="UncompressedStorageCodec">
<!--...-->
</fieldType>
Now, if I use the DataImportHandler component to import some data into Solr, at commit time it tells me:
java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'UncompressedStorageCodec' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.
The current classpath supports the following names: [Pulsing41, SimpleText, Memory, BloomFilter, Direct, Lucene40, Lucene41]
at org.apache.lucene.util.NamedSPILoader.lookup(NamedSPILoader.java:109)
What I find strange is that the above mentioned codec jar also contains some Transformers for the DataImportHandler component. And those are picked up fine. Also, other jars placed in the dist folder (and declared in the same way in solrconfig.xml), like the jdbc driver are picked up fine. I'm guessing that for the codec there's this SPI thingy which loads things differentlly, and there's somethign he's missing...
I've also tried placing the codec jar in:
[solrDir]\example\solr-webapp\webapp\WEB-INF\lib\
as well as inside the WEB-INF\lib folder of the solr.war file, which is found in:
[solrDir]\example\webapps\
but I'm still getting the same error.
So basically, my question is, what's missing so that my codec jar is picked up by Solr?
Thanks
I'm going to answer this question myself since it sort of become moot due to some benchmarks I've made: long story short, I had arrived at the (wrong) conclusion that for really large stored fields, Solr 3.x and 4.0 (without field compression) is faster than Solr 4.1 and above (with field compression). However that was mostly due to some errors in my benchmarks. After repeating them I've obtained results where when you go from non-compressed to compressed fields even for very large stored fields, the index time is between 0% and 15% slower, which is really not bad at all, considering that afterwards queries on the compressed fields indexes are 10-20% times faster (the document fetching part).
Also, here's some remarks on how to speed up indexing:
Use the DataImportHandler plugin. It bypasses the Solr Rest (HTTP based) API and writes directly to the Lucene index.
Check out said plugins sources to see how it accomplishes this, and do your own plugin if the DataImportHandler doesn't meet your needs
If for whatever reason you want to stick to the Solr Rest API, use ConcurrentUpdateSolrServer and play around with the queue size and number of threads parameters. It will normally be a lot faster (up to 200% in my case) than the basic HttpSolrServer.
Don't forget to enable the javabin data serialization like this:
ConcurrentUpdateSolrServer solrServer = new ConcurrentUpdateSolrServer("http://some.solr.host:8983/solr", 100, 4);
solrServer.setRequestWriter(new BinaryRequestWriter());
I'm explicitly showing the code because I believe there migth be a small bug here:
If you look at the ConcurrentUpdateSolrServer constructor, you'll see that by default it already sets the request writer to binary:
//the ConcurrentUpdateSolrServer initializes HttpSolrServer objects using this constructor:
public HttpSolrServer(String baseURL, HttpClient client) {
this(baseURL, client, new BinaryResponseParser());
}
However after debugging I've noticed that if you don't explicitly call the setWriter method with the Binary writer argument, it will still use the XmlSerializer.
Going from XML to Binary serialization reduces the size of my documents about 3 times as they are being sent to the server. This makes my index times for this case about 150-200% faster.
I have recently tried and succeeded to get something very similar to work. The only difference is that I want to enable the best compression instead of no compression, and Solr defaults to the fastest compression. I also got the "SPI class [...] does not exist" error at some point, and here is what I have found out from various articles, including the ones you have linked to.
Lucene uses SPI to find the codec classes to load. Lucene requires the list of codec classes be declared in the file "org.apache.lucene.codecs.Codec", and the file must be on the class path. To get Solr to load the file: When you create your JAR file "myJarWithCodec-1.10.1.jar", make sure that it contains a file at "META-INF/services/org.apache.lucene.codecs.Codec". The file should have one full class name per line, like this:
org.apache.lucene.codecs.lucene3x.Lucene3xCodec
org.apache.lucene.codecs.lucene40.Lucene40Codec
org.apache.lucene.codecs.lucene41.Lucene41Codec
org.apache.lucene.codecs.lucene42.Lucene42Codec
fr.company.project.solr.transformers.utils.UncompressedStorageCodec
And in solrconfig.xml, replace:
<codecFactory class="solr.SchemaCodecFactory" />
with:
<codecFactory class="fr.company.project.solr.transformers.utils.UncompressedStorageCodec" />
You might also need to remove postingsFormat="UncompressedStorageCodec" from schema.xml if Solr complains. I think this particular parameter is for specifying the postings format, not the codec. Hope it helps.

libxml2 SAX query

I am trying to parse an XML file using the SAX interface of libxml2 in C.
My problem is that whitespace characters between end of a tag and start of a new tag are causing the callback
"Characters" to be executed...Hi All,
i.e.
<?xml version="1.0"?>
<doc>
<para>Hello, world!</para>
</doc>
produces these events:
start document
start element: doc
start element: para
characters: Hello, world!
end element: para
characters:
end element: doc
characters:
end document
It would be really nice if somehow these whitespaces don't get recognized as "characters".
Anybody got any idea why this is happening or how this can be prevented from happening???
This is, of course, happening since whitespace between elements is significant in XML. So it's just operating according to specification.
See, for instance, this discussion.

Apple iWork Mime Types

I was wondering what the mime type for iWork's Pages is? And also what the mime type is for the rest of the software in the iWork suite? I looked around online and I didn't see it anywhere.
I recently needed this for work and ended up just uploading some files and querying the mimetypes. I found the following:
keynote: application/x-iwork-keynote-sffkey
pages: application/x-iwork-pages-sffpages
numbers: application/x-iwork-numbers-sffnumbers
2021 Update
Please note that this answer is now outdated and the following content types have been approved by IANA:
application/vnd.apple.pages
application/vnd.apple.keynote
application/vnd.apple.numbers
Looks like Apple doesn't much care, since installing iWork does not add any mime type information to any of its system mime-type info reps (in /etc/cups and /etc/apache2), "Get Info" on an iWork file shows no mime-type, etc. The only hint I've found is in Page's info.plist (a copy's online here) which mentions:
<key>public.filename-extension</key>
<array>
<string>pages</string>
</array>
<key>public.mime-type</key>
<array>
<string>application/x-iwork-pages-sffpages</string>
</array>
and a similar one for filename-extension "template", with "-sfftemplate" as the suffix instead of "-sffpages".
application/vnd.apple.keynote
application/vnd.apple.pages
application/vnd.apple.numbers
Just got it approved with IANA. You will find the list at the below link.
https://www.iana.org/assignments/media-types/media-types.xhtml.
You can use mime-db https://github.com/jshttp/mime-db to validate using javascript
This URL shows some other types in case new readers need it:
Apache Jira Issue TIKA-588
application/vnd.apple.keynote, application/vnd.apple.pages, application/vnd.apple.numbers
Actually, those files are all a masked zipfile. So, some systems might indicate their mimetype simply as application/zip.

Resources