arachnode.net webpage table is huge - database

I have used arachnode.net crawler to crawl a website. The resulting crawl data has resulted in a database at the size of +100 gb!!!
I have looked around at the arachnode.net database and found the table "webpages" to be the culprit. When I crawl a website I do not download, images, media or anything a like, I only download the html code. However in this case I can see that the html webpages contains huge about of hidden viewdata and javascript.
So I need to do the crawling once again and this time strip out the hidden viewdata and javascript code before saving to the webpages table.
Anyone have some idea on how to achieve it.
Thanks.

Yes, you can write a plugin which modifies the CrawlRequest.Data and CrawlRequest.DecodedHtml before the data is inserted into the database.
Create a PostRequest CrawlAction as shown here: http://arachnode.net/Content/CreatingPlugins.aspx

Related

PDF content saving to database with navigation preserved

Gooday all.
I am currently doing some research on the best ways you save pdf content in a database (as HTML content), with the ability to navigate using the table of content later on the front end when rendered as HTML.
The table of content could also have nested content. I am currently looking at saving the table of content as JSON data in the database, and pdf pages as separate columns in another table, but that wouldn't look like a good option as updating later wouldn't be straightforward.
Any pointer to resources and a proposed solution would be appreciated.

Where to store image if the image is used in two different websites

I want to upload the image in my website. The same image will be shown in another website. So this is the scenario where image is uploaded in one website and displayed in both website. These two websites are hosted in two different servers as well as they both have their own database.
I am using Angular JS, Entity Framework, Web API and SQL Server 2014 as backend for both of the website. Currently I am using ngFileUpload to upload the images. Please answer me on below questions:-
Should I upload the image in database(as nvarchar-max) or filesystem(FTP or local web server file system)? I read many articles and get to know that Database retrieval of image has affect on performance but it is more secured. However File System is easy in performance but complex on maintenance like back ups. So I am just not able to decide which to choose among these two as both have pros and cons. Which option will be more suitable to my requirement where same image will be displayed in both website. Please note that there can big images like upto 5 MB uploaded in the application but the number of images will not be huge as compare to any social networking or online shopping site.
How to create different size of images(thumbnail, medium, large etc) automatically upon uploading of image in website? Is there any tool or directive already available in Angular JS to achieve this?
I know my question is broad but I need suggestion to start with my requirements.
Please help.
I use the file system to host my images. If the image is displayed on someone's computer screen, they can use an image capturing software to copy it anyway. Also, while storing them to a database may be more secure, I don't need the extra overhead in my code where a simple url to retrieve the image will suffice.
As for resizing an image using Angular, check out these links:
https://www.scientiamobile.com/page/angular-image-resize
https://github.com/FBerthelot/angular-images-resizer

one database, same content, two wordpress sites, different structure? or another solution to have 2 websites with the same content?

I have a WordPress site that shows posts that it gets from rss using a plugin.
I want to split the "jobs" in 2 different websites because in this configuration is not working properly (high amount of rss sources to be parsed).
I need to set one site to get the posts, and the other to show them on a template; i dont want to have the rss plugin on the site that display the posts.
Is this possible using the same database, same content but different wordpress configuration?
Is there another solution to have 2 different databases and autoupdate one from another?
Any ideea other then these 2?
Thanks
if I understand your problem correctly the solution will depend on what you can do to resolve this issue.
Firstly I think you only need one website the issue you have sounds like your website is displaying items directly from the many RSS feeds which will be slow, instead I would recommend creating a process to import the posts on a daily basis to then get written to your database.
This will probably take the form of a cron job on your server.
However if you don't have access to do this I would recommend creating pages in the back end that will import the rss content when you click import.

Database search field database

I'm looking for a way to create a search box in wordpress, where visitors can search a number from the database. Is this possible? I have several package numbers in my database. I want to give my visitors the ability to search for their package number and request the information that comes with the number.
What you want to do can be done.
I suggest a different approach than using wp-exec. (I just looked at wp-exec website, and that plugin was created for WordPress 1.5, which means it hasn't been updated in about 5 years).
The content you want to display exists entirely outside of WordPress. I suggest you use a custom page template - see
http://codex.wordpress.org/Pages#Creating_Your_Own_Page_Templates
In this case you would not use WordPress posts or pages or custom post types. On the custom page template you would write (or have written if you don't have the knowhow to do it yourself) PHP code to extract the info from the database and display it on a page.
For pages like that you would be using WordPress only as a container within which to display the results - they custom page would appear in the site Nav, The page of results would use the site's theme to display so it looks like the rest of the site.
But the code to display from the database would not use the WordPress loop. It would be PHP / MySQL data retrieval and display code.
I really doubt you will find a plugin that lets you display results from an external database, formatted the way you want them to appear. The reason is every external database is different, has different tables and table structures. And no two sites will want the external data visually displayed in the same way. So there is little generalization to encapsulate in a plugin as everyone wants it different.
I've created pages on some sites along the lines of what you want to do thus I know it can be done. But it requires writing custom code.

Save web data into PostGIS

Is it possible to store harvested data from a website(Nestoria) upon implementing their APIs using PHP?
I am able to extract the data using PHP and it displays the result on a web browser, but I need to dump or save them into my PostGIS database. (I am using XAMPP and PostGIS on windows 7)
Most companies wouldn't have a problem with you doing that, for instance Ebay's API. However, as Mapperz pointed out - Nestoria's terms require you not to compete with them for originality of the content. So, any data that comes from their API and stored in your database should not be able to be indexed by search engines.
This isn't as difficult to comply with as you might think. You could have the content loaded through an iframe that uses the "NoIndex, NoFollow" meta tag attribute in the HTML of the page being loaded, or pull the content from the database to your page's DOM using AJAX/JavaScript after the page has loaded.
I personally would go with the second option (AJAX).

Resources