I'm relatively new to Jackrabbit. In our application we never turned on SearchIndex section within repository.xml (so as workspace.xml) files because we always go directly to a given document using the JCR UUID reference. We are using Jackrabbit v2.2.1 and Oracle as the repository. Now our requirements are getting expanded as we would like to use the document metadata feature to store contextual info about a document so that we can use the metadata to retrieve a selected set of documents.
As the first step, I added the default SearchIndex section in workspace.xml file and restarted the JCR.
I saw a bunch of lines like this in my log file - then I saw it created the index folder under workspace area.
2011-07-05 15:04:01.724 INFO [WebContainer : 0] MultiIndex.java:1204 indexing... /vfs:metaData/21ee130e-978e-415f-bfd1-7aa03d91608c/vfs:attributes (3500)
I have the folder structure like this. When I create a document in JCR, I specify the metadata info as part of the document which is by a complex XSD type with tags like docType, uploadedBy, contextValue, etc.
/ (root)
/MyApp (sub-folder)
/documents/ (sub-folder)
/document-1.pdf (file)
/document-2.pdf (file)
/accounts/ (sub-folder)
/account.txt (file)
etc...
The following XPath expression works.
//jcr:root/vfs:metaData//*[vfs:attributes/vfs:docType='TAX_DOCS']
If I give wrong value, for example instead of 'TAX_DOCS', 'TAX', it returns no documents as expected which is great. This proves that the metadata is correctly stored as expected and it is used in the filter process correctly.
The problem with this query is that it starts searching from the root folder but I want to search from /MyApp/documents sub-folder only. So I tried this:
//jcr:root/MyApp/documents//vfs:metaData//*[vfs:attributes/vfs:docType='TAX_DOCS']
It returns nothing. Then I tried this too but no success.
//jcr:root/MyApp/documents//*[vfs:metaData/vfs:attributes/vfs:docType='TAX_DOCS']
So what am I doing wrong? Is anything in workspace.xml configuration that we need to set or missing?
Any help is appreciated.
Thanks, Jack
Drop the double slashed from anything but the last path component and use the # notation for the attribute value, resulting in:
/jcr:root/MyApp/documents//*[vfs:attributes/#vfs:docType='TAX_DOCS']
The // construct looks for the whole subtree instead of just the immediate children like / does. The JCR specification only requires implementations to support the // construct as the last step of the XPath query.
Related
I am attempting to understand the 2020q1 data set found here: https://www.sec.gov/dera/data/financial-statement-data-sets.html,
and am using the reference documentation inside the 2020q1 folder as a “readme” file. The reference documentation specifies that
within the Presentation (pre) data set, the “report” field is a numeric (integer) whose “value refers to the “R file” as posted
on the EDGAR Web site.” I have found no such file after extensive search, and am left with no method of interpreting the “report”
field and all associated data. Please link to the appropriate R File or guide me in the right direction for assistance if possible. Thanks!
So a point of clarification upfront, cause this confused me as well, the "R file" in question is not a script file of the R language. Instead, it simply seems to be a report file that holds the formatted data.
So after digging deeper into the readme, I found the following detail in the description for the SUB.txt data.
Note: To access the complete submission files for a given filing, please see the SEC EDGAR website. The SEC website folder HTTP(s)://www.sec.gov/Archives/edgar/data/{cik}/{accession}/ will always contain all the data sets for a given submission. To assemble the folder address to any filing referenced in the SUB data set, simply substitute {cik} with the cik field and replace {accession} with the adsh field (after removing the dash character). The following sample SQL Query provides an example of how to generate a list of addresses for filings contained in the SUB data set:
· select name,form,period, 'http(s)://www.sec.gov/Archives/edgar/data/' + ltrim(str(cik,10))+'/' + replace(adsh,'-','')+'/'+instance as url from SUBM subm order by period desc, name
Therefore, it looks like we have to correlate each "adsh" submission ID with the "cik" company ID in order to get the link we are looking for.
Doing this for the first entry of pre.txt, we get an adsh value of "0001032208-20-000006". I simply searched through sub.txt with notepad and found its associated cik of "1032208" which belongs to "SEMPRA ENERGY". Therefore, we generate the following link: http://www.sec.gov/Archives/edgar/data/1032208/000103220820000006
From there, we find a directory of files associated with the given submission. Inside is a collection of files with the prefix of "R". Simply clicking on them will open them in your browser, using the "report" and "line" fields, we can then correlate which file we want. Notice that we can add "/R{number}.htm" at the end of the link we generated to find this folder to get a given report number.
If you know what you are looking for, doing this by hand with "ctr+f" find functionality should be fine. Otherwise, you may want to open these docs in excel to generate the links for you.
I have a Solr/Lucene set up where I have indexed a set of documents (MS Word files) and can happily search the content of these documents. However I would like to return a snippet from within the content of the document which shows where the matching line (+/- 5 words from the match term) is. I have tried to follow a range of Google hits but my indexing does not seem to have a direct access to the "content".
Can anyone give me some basic and simple pointers to where I might have made any errors on this - I have based all my work so far on the guidance and examples of the Solr Reference Guide - so I am not sure if the issue is in the search parameters or the original index.
I am doing this to create a clear set of user requirements for building an end solution rather than creating the end solution myself, so I am no expert on the tools and do not need to become one, just need to evidence what is possible with this tool set.
As MatsLindh noted above the issue was that the config was not drawing across the actual content of the Tika parse into a specific field, and so there was no full content of the text to display and highlight
To resolve this I followed the link (https://lucene.apache.org/solr/guide/7_1/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler) to the guidance documents and reviewed the part on fmap and used the example given for Last Modified Date as a guide on what to apply.
I then went to my solrconfig.xml file in the relevant core folder and added in the following line in the code beneath an already present fmap entry:
<str name="fmap.content">testcontent</str>
I had previously set up the testcontent field under the solr web interface in my core. I then re-ran my indexing line via a command prompt and that seemed to do the trick in terms of pulling out the basic content and rapping it with a basic emphasis.
All thanks for the input on this - still a lot more I want to test to help develop a clear requirement set but this really helps prove some of the basics are not complected.
I'm using Solr 6.5 to index files from multiples ftp files into multiples cores (having one core for each type of document, like audio file, image, software, video and documents).
The situation is that I'm doing this to populate an app that in its front end has a social networking approach in which every user can add new tags or modify other metadata without restriction.
So when I execute again data import handler to add new files to my application, it erase the index that previosly was modified for the user and set up with the data-config default configuration.
My question: is there a way to tell DIH, if the id exists, continues without importing and just adds the files which don't have an id in the index?
If this is not possible, can I do something similar in a different way?
Thanks for everything!
Sounds like you are doing a full import with default settings. One of them is clean, which defaults to true and deletes the whole index before the import.
Try setting it to false and also look at preImportDeleteQuery and postImportDeleteQuery for even more precision.
I have been able to set up and search through some documents from a database using this tutorial:
a)
http://www.ibm.com/developerworks/opensource/library/os-xapianomega/index.html?cmp=dw&cpb=dwope&ct=dwnew&cr=dwnen&ccy=zz&csr=110410
The data field is added to every document in the indexing process started with this bash call:
$ omindex --db info --url information /mnt/data0/Information
The call indexes all the files in the dir at /mnt/data0/Information and saves it at a database named
info. According to the last section in the documentation here:
http://xapian.org/docs/omega/overview.html
According to the above documentation, you can set the fields that go into the data field of a document by editing the OmegaScript Template but I have not been able to find this template anywhere. I am hoping I can get some guidance from someone who is familiar with editing an OmegaScript to set up the data field.
I ultimately want data to have the following fields:
sample
caption
type
The standard ones without the url field.
OmegaScript templates are used by omega to render search results (in its web interface), and are stored in the template_dir as mentioned in the IBM tutorial section on the Omega web interface. omindex will have created the fields you require — that documentation also mentions that the OmegaScript command you want to extract those fields is $field{}, which is documented along with all the OmegaScript commands.
So to just display the three fields you would want a fragment of OmegaScript something like:
$hitlist{
Sample: $field{sample}
Caption: $field{caption}
MIME type: $field{type}
}
(which isn't formatted as HTML, but has the advantage of being hopefully clearer as to what is happening).
I am using the Drupal 7 Migrate module to create a series of nodes from JPG and EPS files. I can get them to import just fine. But I notice that when I am done importing them if I look at the nodes it creates, none of the attached filefield and thumbnail files contain filename information.
Upon inspecting the file_managed table I see that both the filename and filemime fields are empty for ONLY the files that I attached via the migrate module. This also creates an issue with downloading the files.
Now I think the problem has to do with the fact that I am using "file_link" instead of "file_copy" as the file operation I specify. The problem is I am importing around 2TB (thats Terabytes) of image files. We had to put in a special request with Rackspace just to get access to that much disk space on our server. So I can't go around copying from one directory to the next because of space issues. So "file_link" seems like the obvious choice.
Now you probably want to see how I am doing this exactly, so here is the code snippet:
$jpg_arguments = MigrateFileFieldHandler::arguments(NULL,
'file_link', FILE_EXISTS_RENAME, 'en', array('source_field' => 'jpg_name'),
array('source_field' => 'jpg_filename'), array('source_field' => 'jpg_filename'));
$this->addFieldMapping('field_image', 'jpg_uri')
->arguments($jpg_arguments);
As you can see I am specifying no base path (just like the beer.inc example file does). I have set file_link, the language, and the source fields for the description, title, and alt.
It is able to generate thumbnails from the JPGs. But still missing those columns of data in the db table. I traced through the functions the best I could but I don't see what is causing this. I tried running the uri in the table through the functions that generate the filename and the filemime and they output just fine. It is like something is removing just those segments of data.
Does anyone have any idea what this could be? I am using the Drupal 7 Migrate module version 2.2. It is running on Drupal 7.8.
Thanks,
Patrick
Ok, so I have found the answer to yet another question of mine. This is actually an issue with the migrate module itself. The issue is documented here. I will be repealing this bounty (as soon as I figure out how).