How can I keep changes in the index when I use DIH fullimport? - solr

I'm using Solr 6.5 to index files from multiples ftp files into multiples cores (having one core for each type of document, like audio file, image, software, video and documents).
The situation is that I'm doing this to populate an app that in its front end has a social networking approach in which every user can add new tags or modify other metadata without restriction.
So when I execute again data import handler to add new files to my application, it erase the index that previosly was modified for the user and set up with the data-config default configuration.
My question: is there a way to tell DIH, if the id exists, continues without importing and just adds the files which don't have an id in the index?
If this is not possible, can I do something similar in a different way?
Thanks for everything!

Sounds like you are doing a full import with default settings. One of them is clean, which defaults to true and deletes the whole index before the import.
Try setting it to false and also look at preImportDeleteQuery and postImportDeleteQuery for even more precision.

Related

Alfresco content archiving?

I have requirement where i need to archive a content item after a particular period but i should be able to access archived item anytime?
Also I want to know what is alfresco's content archiving strategy out of the box & how can i use it?
Can anybody help?
Regards.
Finally, after some googling, i can say that content archiving in alfresco depends upon stores & can be achieved using two ways:Common & Dedicated.
If there is no specific need to create dedicated archival store, archiving can be implemented in same contentstore(say workspace:\\spacestore...) using aspect such as effectivity aspect or any custom aspect, adding which we can tell system that content in context is eligible for archival in same store(contentstore) and using rules/actions, content can be moved or copied in another space of same contentstore and its permissions can be updated (as & when needed).
If there is a need to create dedicated archival store say in large repositories or compliance based systems, new content store can be added using http://wiki.alfresco.com/wiki/Content_Store_Selector and using cm:storeSelector aspect and setting its cm:storeName property, content can be moved/copied between stores and permissions managed.
Step 1 is more of configuration & step 2 is more of customizations.
Happy archiving.
Regards.
P.S. Feel free to add any more approaches.

Difficulty with filename and filemime when using Migrate module

I am using the Drupal 7 Migrate module to create a series of nodes from JPG and EPS files. I can get them to import just fine. But I notice that when I am done importing them if I look at the nodes it creates, none of the attached filefield and thumbnail files contain filename information.
Upon inspecting the file_managed table I see that both the filename and filemime fields are empty for ONLY the files that I attached via the migrate module. This also creates an issue with downloading the files.
Now I think the problem has to do with the fact that I am using "file_link" instead of "file_copy" as the file operation I specify. The problem is I am importing around 2TB (thats Terabytes) of image files. We had to put in a special request with Rackspace just to get access to that much disk space on our server. So I can't go around copying from one directory to the next because of space issues. So "file_link" seems like the obvious choice.
Now you probably want to see how I am doing this exactly, so here is the code snippet:
$jpg_arguments = MigrateFileFieldHandler::arguments(NULL,
'file_link', FILE_EXISTS_RENAME, 'en', array('source_field' => 'jpg_name'),
array('source_field' => 'jpg_filename'), array('source_field' => 'jpg_filename'));
$this->addFieldMapping('field_image', 'jpg_uri')
->arguments($jpg_arguments);
As you can see I am specifying no base path (just like the beer.inc example file does). I have set file_link, the language, and the source fields for the description, title, and alt.
It is able to generate thumbnails from the JPGs. But still missing those columns of data in the db table. I traced through the functions the best I could but I don't see what is causing this. I tried running the uri in the table through the functions that generate the filename and the filemime and they output just fine. It is like something is removing just those segments of data.
Does anyone have any idea what this could be? I am using the Drupal 7 Migrate module version 2.2. It is running on Drupal 7.8.
Thanks,
Patrick
Ok, so I have found the answer to yet another question of mine. This is actually an issue with the migrate module itself. The issue is documented here. I will be repealing this bounty (as soon as I figure out how).

Torrent file protocol - custom field

I am wondering if there is any available field in the .torrent files that could be used for some custom functionality in someone's implementation of a torrent client? For example, one might want to encode an URL to the file owner's website, someone else - some custom message to be displayed when opening the files, etc. Is something like this feasible in the current implementation of .torrent files?
Yes. .torrent files are just bencoded dictionaries and can hold arbitrary key-value pairs.
The main consideration when adding a custom field is to determine whether it should go into the root of the .torrent or inside the info dictionary.
If it goes into the root, it will not affect the info hash (which is the unique identifier of the torrent), and it will also not be available when downloading magnet links.
If it goes into the info dictionary, it is sort of locked down to the info-hash, in the sense that the info-hash depends on it. It will be transferred as part of the metadata when downloading magnet links and it cannot be changed (without changing the info-hash and thus creating a separate swarm).
So, if it's something you want 3rd parties should be able to change after the torrent was created, it should go in the root, if you want it to be entered once when the torrent is created and never change, it should go in the info dict.

Jackrabbit XPath Issue

I'm relatively new to Jackrabbit. In our application we never turned on SearchIndex section within repository.xml (so as workspace.xml) files because we always go directly to a given document using the JCR UUID reference. We are using Jackrabbit v2.2.1 and Oracle as the repository. Now our requirements are getting expanded as we would like to use the document metadata feature to store contextual info about a document so that we can use the metadata to retrieve a selected set of documents.
As the first step, I added the default SearchIndex section in workspace.xml file and restarted the JCR.
I saw a bunch of lines like this in my log file - then I saw it created the index folder under workspace area.
2011-07-05 15:04:01.724 INFO [WebContainer : 0] MultiIndex.java:1204 indexing... /vfs:metaData/21ee130e-978e-415f-bfd1-7aa03d91608c/vfs:attributes (3500)
I have the folder structure like this. When I create a document in JCR, I specify the metadata info as part of the document which is by a complex XSD type with tags like docType, uploadedBy, contextValue, etc.
/ (root)
/MyApp (sub-folder)
/documents/ (sub-folder)
/document-1.pdf (file)
/document-2.pdf (file)
/accounts/ (sub-folder)
/account.txt (file)
etc...
The following XPath expression works.
//jcr:root/vfs:metaData//*[vfs:attributes/vfs:docType='TAX_DOCS']
If I give wrong value, for example instead of 'TAX_DOCS', 'TAX', it returns no documents as expected which is great. This proves that the metadata is correctly stored as expected and it is used in the filter process correctly.
The problem with this query is that it starts searching from the root folder but I want to search from /MyApp/documents sub-folder only. So I tried this:
//jcr:root/MyApp/documents//vfs:metaData//*[vfs:attributes/vfs:docType='TAX_DOCS']
It returns nothing. Then I tried this too but no success.
//jcr:root/MyApp/documents//*[vfs:metaData/vfs:attributes/vfs:docType='TAX_DOCS']
So what am I doing wrong? Is anything in workspace.xml configuration that we need to set or missing?
Any help is appreciated.
Thanks, Jack
Drop the double slashed from anything but the last path component and use the # notation for the attribute value, resulting in:
/jcr:root/MyApp/documents//*[vfs:attributes/#vfs:docType='TAX_DOCS']
The // construct looks for the whole subtree instead of just the immediate children like / does. The JCR specification only requires implementations to support the // construct as the last step of the XPath query.

How to export text from all pages of a MediaWiki?

I have a MediaWiki running which represents a dictionary of German terms and their translation to a local dialect. Each page holds one term, its translation and a number of additional infos.
Now, for a printable version of the dictionary, I need a full export of all terms and their translation. Since this is an extract of a page's content, I guess I need a complete export of all pages in their newest version in a parsable format, e.g. xml or csv.
Has anyone done that or can point me to a tool?
I should mention, that I don't have full access to the server, e.g. no command line, but I am able to add MediaWiki extensions or access the MySQL database.
You can export the page content directly from the database. It will be the raw wiki markup, as when using Special:Export. But it will be easier to script the export, and you don't need to make sure all your pages are in some special category.
Here is an example:
SELECT page_title, page_touched, old_text
FROM revision,page,text
WHERE revision.rev_id=page.page_latest
AND text.old_id=revision.rev_text_id;
If your wiki uses Postgresql, the table "text" is named "pagecontent", and you may need to specify the schema. In that case, the same query would be:
SET search_path TO mediawiki,public;
SELECT page_title, page_touched, old_text
FROM revision,page,pagecontent
WHERE revision.rev_id=page.page_latest
AND pagecontent.old_id=revision.rev_text_id;
This worked very well for me. Notice I redirected the output to the file backup.xml. From a Windows Command Processor (CMD.exe) prompt:
cd \PATH_TO_YOUR_WIKI_INSTALLATION\maintenance
\PATH_OF_PHP.EXE\php dumpBackup.php --full > backup.xml
I'm not completely satisfied with the solution, but I ended up specifying a common category for all pages and then I can add this category and all of the containing page names in the Special:Export box. It seems to work, allthough I'm not sure if it will still work when I reach a few thousand pages.
Export
cd maintenance
php5 ./dumpBackup.php --current > /path/wiki_dump.xml
Import
cd maintenance
php5 ./importDump.php < /path/wiki_dump.xml
It looks less than simple. http://meta.wikimedia.org/wiki/Help:Export might help, but probably not.
If the pages are all structured in the same way, you might be able to write a web scraper with something like Scrapy
You can use the special page, Special:Export to export to XML; here is Wikipedia's version.
You might also consider Extension:Collection if you want it eventually human readable (e.g. PDF) form.
You can set https://www.mediawiki.org/wiki/Manual:$wgExportAllowAll to true, then export all pages from Special:Export.

Resources