What is most efficient way to automatically scrape new news articles from news sources? - screen-scraping

I want to build a news aggregator application. I have one problem that I don't know how should I take new news articles from news webpages.
I wrote a scraper script in python in which when I run it takes all the news from the source (published today the time of running) and saves them in to a CSV file (I save: URL, Title, Date, Time, Image URL, Category, Content). When I run the script again it checks with the CSV file if it processed the URLs so it does not write duplicate content, only writes the new content. And at the end I want to write these results to my database.
But with this script I have to run it periodically to (lets say every 10 mins) to check if there is new content published.
Is this the write way to accomplish this?
Is there a better way to listen to news sources which can take when the new content is published?
If this is the way to do it how can I set the script to run periodically?
Greatly appreciate your help.

I run the script again it checks with the CSV file if it processed the URLs so it does not write duplicate content, only writes the new content.
You might add to your question:
website address
python code that you've already done
My suggestion for you: Get the most recent URLs from DB (say 100-200, number should be comparable with URLs number on the web page to scrape) and check them against the present URLs on the web page. If new URLs appear, do scrape them.

Related

UWP database of music library storage

I want to know that how can I achieve following functionality in a c#/xaml UWP app.
when the app runs first time it scans all of the music library, folders and files and stuff.
Then it stores in some kind of database at backend.
so whenever u launch app again, it doesnt have to scan again everytime, and it just runs and it already has all the necessary data.
i am guessing that a sqlite database can be used in applicationdata.localfolder
but is there a way that I can save the state and data of all the visual elements of all the pages of my app somewhere in a better way? so that everytime app launches it appears that it was just minimized and then maximized?
thanks in advance.
Save the meta data of songs(name,Path etc)to database table on first launch. On subsequent launch compare the songs count returned from Music API and database table. If it is same fetch from database or else you have to fetch from API It will be much faster on second launch
You have to do first launch operation that is looping through songs collection returned from API and saving to database in async Task so that it won't hang UI
No you can't store storagefile to db.. There is no such compatible type in sqllite. Instead that you can store the path of the song. Actually you have to use path to play the song instead of StorageFile. MediaPlayer play method also takes path to play the song as Uri. Check that.

one database, same content, two wordpress sites, different structure? or another solution to have 2 websites with the same content?

I have a WordPress site that shows posts that it gets from rss using a plugin.
I want to split the "jobs" in 2 different websites because in this configuration is not working properly (high amount of rss sources to be parsed).
I need to set one site to get the posts, and the other to show them on a template; i dont want to have the rss plugin on the site that display the posts.
Is this possible using the same database, same content but different wordpress configuration?
Is there another solution to have 2 different databases and autoupdate one from another?
Any ideea other then these 2?
Thanks
if I understand your problem correctly the solution will depend on what you can do to resolve this issue.
Firstly I think you only need one website the issue you have sounds like your website is displaying items directly from the many RSS feeds which will be slow, instead I would recommend creating a process to import the posts on a daily basis to then get written to your database.
This will probably take the form of a cron job on your server.
However if you don't have access to do this I would recommend creating pages in the back end that will import the rss content when you click import.

getting feeds to update in Managing News 1.2 (Drupal 7 distribution)

I am attempting to create a half decent substitute for Google Reader but am running into a problem: when I first set up the distribution named in the title I was able to add a bunch of feeds which were displaying as hoped for (most recent posts first).
My assumption was that every time I visited the site, the RSS feeds would update and show any new content. However, the only content displayed is that which was new the day I added the feeds.
How can I address this? I notice that if I add a new feed, all the other feeds update to their newest content, if this helps put my problem in context.
Grateful for any help!
Feeds are fetched during creation and cron run in managing news distribution.
So try setting up cron jobs properly.
Also make sure the feeds that you are fetching has new contents ( by manually inspecting the feed xml).

arachnode.net webpage table is huge

I have used arachnode.net crawler to crawl a website. The resulting crawl data has resulted in a database at the size of +100 gb!!!
I have looked around at the arachnode.net database and found the table "webpages" to be the culprit. When I crawl a website I do not download, images, media or anything a like, I only download the html code. However in this case I can see that the html webpages contains huge about of hidden viewdata and javascript.
So I need to do the crawling once again and this time strip out the hidden viewdata and javascript code before saving to the webpages table.
Anyone have some idea on how to achieve it.
Thanks.
Yes, you can write a plugin which modifies the CrawlRequest.Data and CrawlRequest.DecodedHtml before the data is inserted into the database.
Create a PostRequest CrawlAction as shown here: http://arachnode.net/Content/CreatingPlugins.aspx

How to auto generate a webpage after user submits form

I am looking for some initial direction on this one because I cannot seem to find my way with it. Let me explain...
I am developing a website wherein a logged in site member (Joomla 1.6) can fill out a simple form and attach a pdf to be uploaded upon submission. The user then clicks the submit button and the page will refresh to a new and unique web page.
User Submits data on http://www.examplesite.com and then after submission a new web page is generated that is called http://www.examplesite.com/userSubmittedValue
This newly generated web page would come from a template that is specified by the administrator and, most important, it will display all of the information that the user submitted. Also, there would be a link to download the pdf they uploaded. The user could then view a list of all the pages they have created in this manner via their profile.
I have seen this all over, but I am at a loss for how to generate this. Any help is much appreciated.
This is not something you will be able to easily do or get a detailed answer for here. If you just wanted to do the submission form with a thank you page that shows the data submitted you could use any number of form wizard type extensions - http://extensions.joomla.org/extensions/contacts-and-feedback/forms
If you just needed a way for users to upload PDfs and have access to them you could use one of the file management extensions that offer front end upload features - http://extensions.joomla.org/extensions/directory-a-documentation/downloads
If the additional data that is being submitted is simply data related to the file - title, description, etc then one of the file download components should work fine for you. The choices are limited in 1.6 at this time though so you might have to go with 1.5 to get the extension that works best for your needs.
So this probably isn't the best way to do it if your using Joomla but it just might help.
I would use PHP and inside of you're directory have a file like "template.html". Then I would create some php to handle the task of....
Opening "template.html"
Finding and replacing the values that the user passed you
Save the "template.html" under a new name (userSubmittedValue.html)
Again, I never really use Joomla. If you were to try this I'd suggest checking out php's file system functions (http://us2.php.net/manual/en/ref.filesystem.php).
Hope this helps a bit.

Resources