What programming language should I use if I want to scrape an RSS feed? - screen-scraping

I wasn't sure if one was better to use than another, ie. Java, PHP, or Perl.

The best one is the one you are most comfortable working with.

It doesn't really matter, as long as you are using the right tools to do the job.
You need to consider where you are deploying your application (web versus desktop), the time you want to spend learning a new technology/language, and availability of libraries for parsing RSS and/or XML and/or HTML. The three languages that you named are all good candidates, though.

RSS files are just formatted XML that you obtain over the internet. All you need in a language is that it can make a HTTP request and has ways to parse the XML.

The framework code can be in anything, but consider using XSL transforms (or XPath queries) to get the XML into a more palatable format. Espec. if you're looking for small subsets of the data, or individual values.
It's hardly "scraping" if the source data was meant to be machine-parsed in the first place. :)

If you are stronger with one particular technology and you have a dead line (or other factors) then go with that technology as they all have capabilities.
If this is not the case then it falls to the requirements of the project you are undertaking and also if you want to/are able to learn a new technology.
PHP is the most naturally web based technology and you can use a library like this Simple HTML DOM Parser (it supports XML as well) to get quick results as well as delve deeper into the complexities of web scraping which PHP will support as well.
Java has a nice project called Web Harvest which I have used in the past with good results (all though you to learn a non-standard xml syntax but it's similar to xslt) and once your system is set up your web scraping can be easily modified.
Perl is the strongest when it comes to regex (Java and especially PHP can become a bit messy when working with regex I find) and regex is a nice skill to have so depending on what you want to do with your information this is a reasnoble option as well.

If you are writing a server application that needs to run often and aggregate content across a large number of sites then performance should be a significant criteria for you. This would mean a language capable of processing a large volume of data quickly.
If you just need a program to run occasionally and pick out bits of data from many pages then you can consider a specialized language. The product TestPlan offers a very simply language that would let you grab RSS content quickly and expose it in a simple fashion.
I've used it in some significant scraping projects. While not blazingly fast the scripts are extremely easy to maintain.

Related

Looking for advice on a Visual Studio/VS Code language for non-developer

In short, I am looking to create a POC (proof of concept) for an interface I have designed for aggregating data from several different sources. Currently I am just using flat tables in SQL with no relationships (although there are fields that are in the data for that purpose) to store the data, which is being gathered by various means (mostly PowerShell scripts). I also want it to be modular enough so I can add hooks to REST calls to other sources. So, if I make a call to get data about a certain asset, I don't have to know where that data is at that level. The middleware will know where to find it and use the appropriate method to get it (SQL, REST, etc.).
I have been in IT for over 31 years, I have experience with SQL (and have been a professional DBA at one point), a few scripting languages, and wrote a similar app in C# at one point, but I was given a template interface from a profession developer and I just ran with that and changed it as I went.
I want a high level view with colors indicating general health of a location, and the ability to search the data for assets (this could be anything from a VM, BM server, ethernet switches, FC switches, connected SAN, hypervisors, vCenters, Nutanix clusters, etc.) to find their location, health and most importantly, their connections to other assets.
I know a little about everything, enough to be dangerous in most and some a master of. What direction should I go here? I know BASIC, PowerShell well enough to call myself an expert, but more than likely, just a hacker. I know the basic concepts of coding well enough (I was an adjunct at a CC 20+ years ago) to teach it to others. I want to use a language I will be able to hand over to a professional to maintain once we have a new DevOps member on the team.
EDIT: I wrote a very small program earlier this year using React. That was broken into the interface and the 'middleware' with two separate project. Java is not an intuitive language for me and I could go this direction again, but I am looking for possible alternatives.
I'm not sure there's a 'non-developer' way of doing this unfortunately. If you really want to do this yourself, I'd suggest a C# API since you have some experience in that (even if it's minimal, it's still comfort), and luckily for you they just released minimal API's which should make your life much easier (https://dotnetcoretutorials.com/2021/07/16/building-minimal-apis-in-net-6/).
In terms of actually displaying the data, if it's a really simple UI I'd just keep it simple and use JavaScript and HTML/CSS (ie no frameworks such as React). It's a POC, and if someone comes in and wants to move over to using a framework because there's more work to be done on it, it's not that hard to take your existing code and move it into something more modern. But if it's small, a framework would be over engineering this, especially if you need to learn it all.
For smaller projects like this, you can kind of learn the code you need for the project, without needing to truly learn to code. Get your components from a library like Bootstrap (it's basically premade elements like your dropdowns, progress bars etc, their documentation is pretty good), and keep it as simple as possible. Good luck!

Ideal database for a minimalist blog engine

So I'm designing this blog engine and I'm trying to just keep my blog data without considering comments or membership system or any other type of multi-user data.
The blog itself is surrounded around 2 types of data, the first is the actual blog post entry which consists of: title, post body, meta data (mostly dates and statistics), so it's really simple and can be represented by simple json object. The second type of data is the blog admin configuration and personal information. Comment system and other will be implemented using disqus.
My main concern here is the ability of such engine to scale with spiked visits (I know you might argue this but lets take it for granted). So since I've started this project I'm moving well with the rest of my stack except the data layer. Now I've been having this dilemma choosing the database, I've considered MongoDB but some reviews and articles/benchmarking were suggesting slow reads after collections read certain size. Next I was looking at Redis and using its persistence features RDB and AOF, while Redis is good at both fast reading/writing I'm afraid of using it because I'm not familiar with it. And this whole search keeps going on to things like "PostgreSQL 9.4 is now faster than MongoDB for storing JSON documents" etc.
So is there any way I can settle this issue for good? considering that I only need to represent my data in key,value structure and only require fast reading but not writing and the ability to be fault tolerant.
Thank you
If I were you I would start small and not try to optimize for big data just yet. A lot of blogs you read about the downsides of a NoSQL solution are around large data sets - or people that are trying to do relational things with a database designed for de-normalized data.
My list of databases to consider:
Mongo. It has huge community support and based on recent funding - it's going to be around for a while. It runs very well on a single instance and a basic replica set. It's easy to set up and free, so it's worth spending a day or two running your own tests to settle the issue once and for all. Don't trust a blog.
Couchbase. Supports key/value storage and also has persistence to disk. http://www.couchbase.com/couchbase-server/features Also has had some recent funding so hopefully that means stability. =)
CouchDB/PouchDB. You can use PouchDB purely on the client side and it can connect to a server side CouchDB. CouchDB might not have the same momentum as Mongo or Couchbase, but it's an actively supported product and does key/value with persistence to disk.
Riak. http://basho.com/riak/. Another NoSQL that scales and is a key/value store.
You can install and run a proof-of-concept on all of the above products in a few hours. I would recommend this for the following reasons:
A given database might scale and hit your points, but be unpleasant to use. Consider picking a database that feels fun! Sort of akin to picking Ruby/Python over Java because the syntax is nicer.
Your use case and domain will be fairly unique. Worth testing various products to see what fits best.
Each database has quirks and you won't find those until you actually try one. One might have quirks that are passable, one will have quirks that are a show stopper.
The benefit of trying all of them is that they all support schemaless data, so if you write JSON, you can use all of them! No need to create objects in your code for each database.
If you abstract the database correctly in code, swapping out data stores won't be that painful. In other words, your code will be happier if you make it easy to swap out data stores.
This is only an option for really simple CMSes, but it sounds like that's what you're building.
If your blog is super-simple as you describe and your main concern is very high traffic then the best option might be to avoid a database entirely and have your CMS generate static files instead. By doing this, you eliminate all your database concerns completely.
It's not the best option if you're doing anything dynamic or complex, but in this small use case it might fit the bill.

Product management system

We are looking for a system that is able to do the following:
Browse trough records containing plain text data (name/ID's/suppliers/show links etc) but is also able to store pdf's that go along with this information.
The system should preferably be server/web based to be accesible over intranet.
Any idea's?
Thanks
There are usually two opposite ways to solve IT tasks. Take something that fits your requirements or build it yourself. Your task sounds not so complicated and so I think one solution could be to set it up yourself. (Or ask somebody to do so.)
I would prefer perl as the programming language, the dancer web framework and may be a mongodb to store your data. You can easily run the app with starmen and use apache in front of it as proxy.
This environment is easy to expand to your needs and would fit exactly to your reqiurements, nothing more nothing less.

What are Advantages to Content Repositories (not talking about CMS's)

Given that a lot of people use content repositories. There must be a good reason. I'm building out a new web application that will need to store content. Can someone help me understanding this?
What are the advantages to using a content repository like Apache Jackrabbit as opposed to writing your own code/API to store images or text pages? Writing your own requires time etc. but so too does implementing and learning a new framework like the content repository API. A benefit to rolling your own seems to me that you know your code and have immediate expertise if you need to enhance or fix it. Using another framework you need to learn its foibles, and it is always easier to modify code you know that don't know... i.e. you don't know that underlying framework code as well as your own.
As I said a lot of people use them. There must be a reason. I can't see it as being just another "everyone is using them so, so should we." At least I hope it isn't that. :)
A JCR repository allows you to store all your content (from structured database-type data to large multimedia files) in a single place and with a single API, which is extremely convenient and makes your code simpler, avoiding the impedance mismatch between files and data that you usually have in content-based systems.
JCR also provides a lot of infrastructure functionality that you won't have to build or assemble yourself: search (including full-text), observation (callbacks when something changes) versioning, data types including multi-value, ordered nodes, etc...
If you allow a shameless plug, my "JCR - best of both worlds" article at http://java.dzone.com/articles/java-content-repository-best describes this in more detail and also provides a reading list for the JCR spec, that should allow you go get a good overview without reading the whole thing.
The article uses Apache Sling for its examples, which combined with a JCR repository provides a very nice (IMO, but as a Sling committer I'm biased ;-) platform for content-based applications.
My most recent projects have involved both choices: a custom-built data store (MySQL and image files) wtih a multi-level caching mechanism, and a JCR-based commercial repository.
A few thoughts:
In the short run, a DIY solution offers reduced complexity: you only have to build and learn what you need. And there is at least the opportunity to optimize
the data store for your particular application's needs -- more than likely speed of retrieval, but possibly storage footprint, security, or reliability concerns are foremost for you.
However, in the long run, you're looking at a significant increment of work to extend the home-grown system to a new content type (video, e.g.) or to provide new functionality (maybe,
versioning).
Also, it's difficult to separate the choice of a data store approach from the choice of tools that content providers will use to populate and maintain the data store. You'll have to give
your authors something more than an HTML form with a textarea and a submit button.
This is related to the advantages of standardization: compatibility and interchangeability. If everybody writes his own library and API, there is no compatibility and interchangeability, leading to higher cost.

I know the big picture but can't put it in place

I'm interested in web development and by that I mean the bigger projects like facebook or twitter. I know the basics of java, css, php and mysql. I know there is a lot more out there. I read about it. But I don't know what the purpose is and how to put in place.
Things like: Scribe, thrift, casandra, Unix/Linux, shell/perl/python scripting, PostgreSQL, MongoDB, non-relational NoSQL datastores, JVM, nginx
I want to know why they need it, how they use it and what te purpose is.
What I need is a book like technical background of facebook for dummies or so.
Are there any books or websites that explain this from scratch?
Thank you!
EDIT:
Thank you for your answers! You have been very helpful. I was in the assumption, experienced programmers know almost anything about the technology there's used today. But as I read, you can only know so much and I need to figure out which technology to use. I take on the encouragement to start building small. And will take on php and improve my skills from there.
Thanks again!
http://highscalability.com/
This is one of the best sites out there. There are several case studies describing what and why many websites use, and pointers to further references. I would also look at the Google Scalability Conference 2007 talks
http://www.google.com/search?q=Google+Scalability&hl=en&client=firefox-a&hs=YUg&rls=org.mozilla:en-US:official&prmd=v&source=univ&tbs=vid:1&tbo=u&ei=fl4OTPUkorIwueCQxQw&sa=X&oi=video_result_group&ct=title&resnum=4&ved=0CDIQqwQwAw
It's all about choosing the right tool for the job in my eyes. There is so much technology out there it's impossible to learn it all. Just choose the subset that will work for you.
The best place to start is by building small simple websites, and as you come accross problems that you need solved you research the tools needed to solve those problems.
If you attack all of the areas at once, it's going to be overwhelming and you will not get anywhere.
For a general overview on what each of the technologies does, Wikipedia gives a good overview on most technologies.
If you are interested in database content which it seems like you are, a good place to start is reading up on normalisation.
Scribe, thrift, casandra, Unix/Linux, shell/perl/python scripting, PostgreSQL, MongoDB, non-relational NoSQL datastores, JVM, nginx
Those I would search on Wikipedia for to get a quick overview. Facebook is written in PHP/MySQL. There are some books on the subject of creating social networking sites, and some books have gotten decent reviews on Amazon.com, however, I have not read any of them myself.
If I were you, I'd start with PHP/MySQL and sit down and write a simple social network. Break the project down into components and tasks and Google for each challenge you encounter such as sessions, database structure, security, friend structure, and processing POST and GET requests.
You'll learn a lot and you get the big picture. Once you see the big picture, you can take another look at different technologies that are available and then decide which component you could have developed better with other tools. I personally don't think that looking too much into the technology available is good for someone who is still in the beginning stages. Start doing, learn from it, and then your questions become much more specific and a lot of things will make more sense.
The problem you're having is you're looking at smaller, specialty products, and not at larger, more mature technologies. Wikipedia will actually give you a decent overview of most of the medium-and-large projects out there.
Cassandra, Hadoop, Mongo, and NoSQL are all lovely... but they're specialty tools. SQL is a general purpose solution that works for 99% of the sites on the net.
Unix/Linux isn't a specialty tool; you might want to try going to Ubuntu's website and installing Linux, and just using it day-to-day, the way you'd use Windows. When you need to figure out something new, like setting up a webserver, do it on the Linux box and a Windows box, and you'll eventually learn linux pretty darn well.
As far as scripting, O'Reilly makes a great line of books on Bash, Perl, and Python.
JVM is a Java Virtual Machine, which is a core of getting Java code to go. Sun's website has a great set of tutorials on learning Java.
It might be much, much easier to pick a project (or three) that you'd like to learn, and learn some of these by doing. I'd probably suggest learning some SQL before learning the newly established alternatives; that lets you learn the rest of the system, as SQL is pretty easy. Once you've got the rest of the thing solid, try swapping in a NoSQL solution at that point.
There are a lot of frameworks that do a lot of different things. You've named a lot of different things from a lot of different areas. The best way to think of these things is to group them by category. Here's an example:
Suppose you have a laptop and you want to host a website. You'll need the following at a minimum:
1) Web Server software. Two popular options are Microsoft's IIS and Apache Web Server.
That's really all you need. You can set up your www_root folder and load files into it. Assuming everything is configured properly, you can now load HTML pages into that folder and access them through your IP address. Every page you view in your web browser is in HTML format. CSS is a stylesheet language that defines how your HTML will be formatted. You can also start writing Javascript, as most modern browsers support the client-side scripting language.
Chances are you'll want the following as well:
2) Database software. Two popular options are Microsoft's SQL Server and MySQL
3) Server-side scripting. PHP is very popular, as is ASP. You'll need the runtime deployed on your server. Python, Ruby, Perl, etc all fall under this category.
4) Web Application Framework(s). This will provide you with libraries for your language of choice to help develop web applications and websites. CakePHP, Ruby on Rails, and the Google Web Toolkit are examples of web application frameworks.
Additionally, you may want to utilize:
5) Additional libraries. JQuery, for example, is quickly becoming a popular library for Javascript that handles a lot of common tasks for you. Instead of writing complex effects code and what-not yourself, just use the pre-written code in the JQuery library.
6) Data interchange technology. If you are passing a lot of information back and forth, you will likely want to encapsulate this data in a logical format. Ideally, this format would describe the data and allow your applications to easily read/process it following a standard. This is where XML and JSON come into play.
I can't recommend a good book for you to learn this stuff, but I feel that the collective replies to your question here should be more than enough to get you started.
Ultimately, what you need to do is determine what technologies you need, and then choose the right one for the job. Don't go building an application using Ruby on Rails just because it's what Twitter used, but rather choose it because it provides some advantage to you over the other options.

Resources