i need a sitemap which can help to people and google to know pages as well.
I've tried WebSphinx application.
I realize if I put wikipedia.org as the starting URL, it will not crawl further.
Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?
Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?
Crawling wikipedia is a bad idea. It is hundreds of TBs of data uncompressed. I would suggest offline crawling by using various dumps provided by wikipedia. Find them here https://dumps.wikimedia.org/
You can create a sitemap for wikipedia using page meta information, external links, interwikilinks and redirects databases to name a few.
Related
We have to extract almost 1,000 documents for a divestiture. Doing it by clicking is going to take a long time. I know Jive has an API, but I can find anything that would let us download multiple files from multiple groups.
Any ideas are appreciated.
Thanks!
Sure. Use /contents/{contentID} to grab a document.
There's more detail in the Document Entity Section of the Jive REST API Documentation.
You might find your list of documents to retrieve by using the Search methods of the API. Here's a curl example:
curl -v -u <username:password> https://<url>/api/core/v3/search/contents?filter=search(<search term>)
Also, just so you know, there is an active Jive Developer Community where questions like this are likely to get more eyeballs. And, as a start to Development with Jive in general, check out https://developer.jivesoftware.com/
I have a Drupal website that has a ton of data on it. However, people can quite easily scrape the site, due to the fact that Drupal class and IDs are pretty consistent.
Is there any way to "scramble" the code to make it harder to use something like PHP Simple HTML Dom Parser to scrape the site?
Are there other techniques that could make scraping the site a little harder?
Am I fighting a lost cause?
I am not sure if "scraping" is the official term, but I am referring to the process by which people write a script that "crawls" a website and parses sections of it in order to extract data and store it in their own database.
First I'd recommend you to google over web scraping anti-scrape. There you'll find some tools for fighting web scrapig.
As for the Drupal there should be some anti-scrape plugins avail (google over).
You might be interesting my categorized layout of anti-scrape techniques answer. It's for techy as well as non-tech users.
I am not sure but I think that it is quite easy to crawl a website where all contents are public, no matter if the IDs are sequential or not. You should take into account that if a human can read your Drupal site, a script also does.
Depending on your site's nature if you don't want your content to be indexed by others, you should consider setting registered-user access. Otherwise, I think you are fighting a lost cause.
I have managed to get apache nutch to index a news website and pass the results off to Apache solr.
Using this tutorial
https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup the only difference is I have decided to use Cassandra instead.
As a test I am trying to crawl Cnn, to extract out the title of article's and the date it was published.
Question 1:
How to parse data from the webpage, to extract the date and the title.
I have found this article for a plugin. It seems a bit out dated and am not sure that it still applies. I have also read that Tika can be used as well but again most tutorials are quite old.
http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/
Another SO article is this
How to extend Nutch for article crawling. I would prefer to use Nutch, only because that is what I have started with. I have do not really have a preference.
Anything would be a great help.
Norconex HTTP Collector will store with your document all possible metadata it could find, without restriction. That ranges from the HTTP Header values obtained when downloading a page, to all the tags in that HTML page.
That may likely be too much fields for you. If so, you can reject those you do not want, or instead, be explicit about the ones you want to keep by adding a "KeepOnlyTagger" to your <importer> section in your configuration:
<tagger class="com.norconex.importer.tagger.impl.KeepOnlyTagger"
fields="title,pubdate,anotherone,etc"/>
You'll find how to get started quickly along with configuration options here: http://www.norconex.com/product/collector-http/configuration.html
I have a website made with CakePHP 1.3.7. This website has it's own login system. Now the client wants to include a forum in the website.
I've been looking at different free solutions and phpBB and SMF seem to be what I'm looking for. The only thing I'm not so sure is about integrating those forums with the login system that I already have.
I mean, if a user has already an account for the website (or creates a new one), he/she should be able to use that same account (username) in the forum section.
Is that possible? Any clue pointing me in the right direction would be much appreciated! I mentioned both forum solutions in case one is easier to integrate than the other one, that would be also good to know (or if there's any other better option).
Thanks so much in advance!
It's possible to use both but I personally prefer SMF. You have to configure CakePHP's session component to use database sessions and create a model that will use the forums session table.
You can decide if you want or need a separate users table besides the forums users table (or its called members, don't know right now).
The "hard" part is to make the cake app read/write the sessions and cookies in the same fashion SMF does to allow a smooth transition from the cake app to the forum and backwards.
Technically you can use both forums and archive your goal with both, it's just a matter of getting the frameworks components utilized right.
I ended up using: this
It has all that I needed and integrates perfectly into Cake :)
Search engine bots crawl the web and download each page they go to for analysis, right?
How exactly do they download a page? What way do they store the pages?
I am asking because I want to run an analysis on a few webpages. I could scrape the page by going to the address but wouldn't it make more sense to download the pages to my computer and work on them from there?
wget --mirror
Try HTTrack
About the way they do it:
The indexing starts from a designated starting point (an entrance if you prefer). From there, the spider follows recursively all hyperlinks until a given depth.
Search engine spiders work like this as well, but there are many crawling simultaneously and there are other factors that count. For example a newly created post here in SO will be picked up by google very fast, but an update at a low traffic web site will be picked up even days later.
You can use the debugging tools built into Firefox (or firebug) and Chrome to examine how the page works. As far as downloading them directly, I am not sure. You could maybe try viewing the page source in your browser, and then copy and paste the code.