In our company there are many projects that are all contain several informations, e.g. source code, project informations, bug reports or emails. The informations are not on a central place so if you want to search for a solved problem in a past project, you have to search for yourself.
The idea is now to build a project archive that you can search through. We want to use Apache Solr to create a Webapp with which you can search for several informations.
Indexing pdf, word or java files is not the problem in this case. The question is, what is the best solution to gather all the files from different systems. The documents are present in systems like MS Sharepoint, Atlassian Confluence, Jira, SVN or Git.
What is the best strategy to export all the informations from the different systems to gather them in a central place, where the indexing can easily be done, maybe automatically.
Related
I am working on implementing a research web application or portal that integrates different research portal or website using an open source platform called search kit. The web application will act as a central point of access to research publications on different research portals. To do this, I also need to implement a third party system that does the following:
Searches for documents based on user query on the other different research portals and presents or displays the results to the users on my web application.
Index the documents
Should be used by system administrators to configure the web application. Whereby system administrators can add,remove or modify the URL of the website Solr is pulling documents from
Displays the results to the user in one standard format.
My question is, can apache solr be used to implement the third party system? if not, what open source platform or way would you recommend I used to implement the third party system?
In general, Solr seems like a good fit here, but you might need some custom code (apart from configuration) here and there. To go through the points:
Querying is one of the main features of Solr, so this is definitely possible.
Indexing is handled by Solr.
There was a component for Solr called "Data Import Handler" that supported indexing from URLs (see the docs). However, this was removed from the main Solr distribution, and was moved to a separate package. This package doesn't seem to be actively maintained though, so you will probably run into some problems if you decide to use it. The alternative is to develop your document-pulling code yourself.
Solr can display the results in multiple formats, but it still might not support the exact format you would like it to be. In this case, you need to build your transformation based on the result from Solr.
We have an experimentation tool that uses YAML config files to run experiments and deploy models. We made this choice some time ago to integrate better with a Kubernetes orchestration.
Right now, we have hundreds of historical experiments, and we are stuck with trying to index them for querying. I have seen several questions about converting YAML files to json for indexing, but we would like to keep them as YAML. I found this YAMLDB from this question before. However this has no support for querying and isn't tied to python, which we'd like for inter-operability.
Would anyone have pointers to any repos or packages or libraries that do this (or perhaps mongo extensions if they exist). In-progress/alpha code is also okay.
Thank you.
I'm feeling stupid, but I want to know how GitHub and Dropbox store user files, because I have a similar problem and I need to store user's project files .
Is it just like storing project files somewhere in the server and refer to the location as a field in the database, or there are other better methods ?
Thanks.
GitHub uses Git to store repositories, and accesses those repos from their Ruby application. They used to do this with Grit, a Ruby library. Grit was written to implement Git in Ruby but has been replaced with rugged. There are Git reimplementations in other languages like JGit for Java and Dulwich for Python. This presentation gives some details about how GitHub has changed over the years and is worth watching/browsing the slides.
If you wanted to store Git repositories, what you'd want to do is store them on a filesystem (or a cluster thereof) and then have a pointer in your database to point to where the filesystem is located, then use a library like Rugged or JGit or Dulwich to read stuff from the Git repository.
Dropbox stores files on Amazon's S3 service and then implements some wrappers around that for security and so on. This paper describes the protocol that Dropbox uses.
The actual question you've asked is how do you store user files. The simple answer is... on the filesystem. There are plugins for a lot of popular web frameworks for doing user file uploads and file management. Django has Django-Filer for instance. The difficulty you'll encounter in rolling your own file upload management system is building a sensible way to do permissions (so users can only download the files they are entitled to download), so it is worth looking into how the various framework plugins do it.
I am running Sitecore 6.5
I have two installations of Sitecore and want to transfer a whole site from one installation to another.
Have found a few articles that go into Serialization and Creating a Package although they don't go into detail about how these two fit together.
How do I transfer a site from one installation to another?
thanks.
Create a package with the package designer.
include these items and their children with the button "items statically". if you have placed your solution specific item in folders, it is only needed to include these.
/sitecore/content
/sitecore/layout
/sitecore/media library
/sitecore/templates/ (only take the templates you have created. e.g. the folder user defined
using the button "files statically", include the folders with you have solution specific changes to like:
/bin
/layouts
/app.config/include (only take the files changed in the solution,
compared to a default sitecore installation)
web.config (if you have made changes to this, compared to default
sitecore web.config)
if you have any user accounts you want to transfer to, you can include them with "security accounts".
then generate zip file and install on empty sitecore and full publish :)
If your systems are similar enough, you may want to consider moving the Sitecore DBs via backup/restore (in SQL) and copying over filesystem assets. Generally I find this faster and less prone to user error than creating/installing very large packages. (Just remember to take back-ups first.)
Large packages have a tendency to break, one option would be to look into this:
http://www.hhogdev.com/Products/Team-Development-for-Sitecore/Overview.aspx
TDS can sync all your items to XML on your dev box and from that you can create a different sort of installation package which is significantly more robust than a regular package you create through the Sitecore desktop. It's the same sort of package that Sitecore use when you upgrade versions.
I believe there is a 60 day trial on this product so plenty of time to try it out.
Note: when transferring user accounts, passwords will not be migrated when using either packages or serialization.
Solution is here - cowboy-aspx from Sitecore :)
https://kb.sitecore.net/articles/242631
I'm currently working on a web archiving project. Basically, what we try to do is archive a collection of websites (using heritrix crawler) and provide access to the archived contents through a web interface.
We also offer full-text search throughout the archives. Currently, the index is generated using nutchwax (a customised version of apache Nutch, tailored to index .warc files, as generated by heritrix). Nutchwax dumps out a Lucene index and for using it in Solr, all that has to be done is to generate a correct schema.
This is all done and its running like it should, however the archive is not static and there are new .warc files generated periodically.
What I can do now, is to generate a new index, merge it with the existing one and import it back into Solr. However, to do that Solr has to be restarted.
It would be great if the index could be updated "on the fly" as this is usually the case (when updating the index via http requests)
Does anyone have an idea, how this can be done? My first shot at this was generating .xml files out of the Lucene index file and posting them to Solr. Is this worth a try or are there more elegant solutions?
You could probably leverage the use of multiple cores to accomplish what you need. See the Solr Wiki - CoreAdmin for more details. I think you could leverage the MergeIndexes capability or the ability to Swap cores for a better experience in your scenario.