I have a Drupal website that has a ton of data on it. However, people can quite easily scrape the site, due to the fact that Drupal class and IDs are pretty consistent.
Is there any way to "scramble" the code to make it harder to use something like PHP Simple HTML Dom Parser to scrape the site?
Are there other techniques that could make scraping the site a little harder?
Am I fighting a lost cause?
I am not sure if "scraping" is the official term, but I am referring to the process by which people write a script that "crawls" a website and parses sections of it in order to extract data and store it in their own database.
First I'd recommend you to google over web scraping anti-scrape. There you'll find some tools for fighting web scrapig.
As for the Drupal there should be some anti-scrape plugins avail (google over).
You might be interesting my categorized layout of anti-scrape techniques answer. It's for techy as well as non-tech users.
I am not sure but I think that it is quite easy to crawl a website where all contents are public, no matter if the IDs are sequential or not. You should take into account that if a human can read your Drupal site, a script also does.
Depending on your site's nature if you don't want your content to be indexed by others, you should consider setting registered-user access. Otherwise, I think you are fighting a lost cause.
Related
I was a little startled to learn that Nancy has its own razor implementation, that may or may not behave like razor. In practice, does this cause issues? What are "most people" using for a Nancy view engine? Why was the real razor not used?
First the easy answer. The Razor engine is, by far, the most downloaded view engine available for Nancy https://www.nuget.org/packages?q=nancy.viewengines
Now for the more complicated questions
Why was the real razor not used?
Because the "real" (and by real I'm going to assume that you mean the one used by the ASP.NET stack) Razor engine is tied to the HTTP abstractions that are built into the ASP.NET stack (HttpContext and all its friends) so there is no straight forward way to use it with Nancy.
The slightly longer answer for this is that you have to understand that Razor is really a parser and a Razor view engine is what sits in the middle of the consumer and the parser.
Nancy uses the Razor parser but we have to have our own view engine because that's what enabled Nancy to parse and execute Razor templates.
Now, it does get even more complicated. Many of the features you see in the ASP.NET Razor view engines, such as master pages, partials, various helpers, _ViewStart and so on, are not Razor (the parser) features, but they are an additional feature set that have been built into the view engine (you can almost think of it as middleware).
This means that for our engine we've had to re-implement much of these features because that's what's come to be expected from a Razor view engine.
I would like to point out that, if it was possible, then we would love to ditch our own implementation and use the one built by Microsoft (less code for us to maintain and it would mean we'd support 100% the same feature set), but unfortunately that's not our decision to make.. we can't take a dependency on their abstractions I am afraid
Hope this clears things up
/A
We have been using the Razor implementation from Nancy for a while now. We have run into a couple of issues that are making us either switch to SSVE or abandon Nancy (we do really love Nancy).
The first issue with Razor is you cannot precompile views like you can in MVC which leads to much longer startup times. We have had many complaints about this.
The second issue is there seems to be a long-standing bug in the Razor implementation with Nancy which causes a situation that is only resolved by recycling the application pool. I'm not an expert but it seems when the project is loaded there is a temporary DLL being compiled and generated at that time (this explains the slower load times) but sometimes there a problem which leaves to the instance not being able to be created. It seems to be at this location: https://github.com/NancyFx/Nancy/blob/master/src/Nancy.ViewEngines.Razor/RazorViewEngine.cs#L238. Basically "viewAssembly.GetType("RazorOutput.RazorView")" is NULL at various times which causes only an error message to be displayed on every page, for every user, at all times and the only way to fix it is to reload the application (recycle the application pool)
Just my two cents and I know this post is older but maybe others will see some of the problems we have run into. I've opened a GitHub issue but the bug is hard to reproduce for us and it hasn't gone anywhere.
i need a sitemap which can help to people and google to know pages as well.
I've tried WebSphinx application.
I realize if I put wikipedia.org as the starting URL, it will not crawl further.
Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?
Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?
Crawling wikipedia is a bad idea. It is hundreds of TBs of data uncompressed. I would suggest offline crawling by using various dumps provided by wikipedia. Find them here https://dumps.wikimedia.org/
You can create a sitemap for wikipedia using page meta information, external links, interwikilinks and redirects databases to name a few.
Is there a way to have one product definition and have it publish to multiple sites? I am looking for this ability specifically in DNN or Umbraco, either with free or paid extensions. I did install both the platforms and played with the free extensions and looked for any extension offering such functionality but did not find one. Any links or pointers are highly appreciated!
I had looked up for this info in many places before reaching over to the expert pool here, hoping to get some hints;
In umbraco there is the built in /base extension (http://our.umbraco.org/wiki/reference/umbraco-base) which enables you to access product data that is maintained in Umbraco from other websites. Base is REST-ish so the implementation is well documented - you can access the data as XML or JSON (Returning Json instead of XML with Umbraco Base).
Also as the implementation is REST-ish the other websites that consume the content maintained in the core site could be written in anything that can consume a REST feed eg html & javascript.
It's not 100% clear to me what setup you're after, but if you're looking to set up a traditional Authoring/Delivery configuration - one of the few paid offerings Umbraco has is called Courier. It's a very reasonably priced (~$135USD,/99EUR) deployment manager that handles syncing content between two sites, i.e., Authoring and a Delivery server.
It's a very smart tool that manages content, configuration, and dependencies. It's neat and also supports a great open-source project!
If you're looking to setup something more like a centralized product database that is used by many sites - amelvin is on good pointer with BASE. They have a nice api where you may also set up your own webservice (beyond their own webservice functaionality!).
If you need this centralized product data to notify the other sites to update their caches - i encourage you to look into the 'distributedCall' functionality.
There's a bit of documentation on distributed calls in this load-balancing tutorial that may help understand the concept a bit better.
...Hope this helps get pointed in the right direction.
When Joomla 1.6 came out and onwards I started using the joomla profile plugin to manage my user's profile data.
I have been wondering for quite a while about the pros and cons of such a table where the data is stored in rows and not in fields.
The pro is definitely that I can add new profile fields very easily.
The con is how do you search on the information when different field types are stored in a text field - ie: the dob of the user is stored in a text field.
Perhaps this question is more database than joomla related?
But it boils down to - should I be using the joomla profile table for large numbers of users?
Thanks,
Mat
It really depends on the project you are working on.
If you use a plugin to get the work done, cons are
You will have to spend some time to get to know the plugin.
It will take more time to make changes since you don't know the internal structure of the plugin.
pros are,
Plugin does everything for you. Faster development.
Most probably error free and tested well.
In my opinion if you are to do a lot of data manipulation and if the plugin does what you want to do exactly use the plugin. That's the best part of joomla. Faster development. If you have any problems please ask.
You can use the Joomla extended profile if you want to, or you can use Community Builder which also extends user profiles.
Out of all honest, might be better for lots of users, as you might want to install other extensions in the future, such as the Kunena forum for example, which it integrates with fully, along with lots of other extensions.
I have a website made with CakePHP 1.3.7. This website has it's own login system. Now the client wants to include a forum in the website.
I've been looking at different free solutions and phpBB and SMF seem to be what I'm looking for. The only thing I'm not so sure is about integrating those forums with the login system that I already have.
I mean, if a user has already an account for the website (or creates a new one), he/she should be able to use that same account (username) in the forum section.
Is that possible? Any clue pointing me in the right direction would be much appreciated! I mentioned both forum solutions in case one is easier to integrate than the other one, that would be also good to know (or if there's any other better option).
Thanks so much in advance!
It's possible to use both but I personally prefer SMF. You have to configure CakePHP's session component to use database sessions and create a model that will use the forums session table.
You can decide if you want or need a separate users table besides the forums users table (or its called members, don't know right now).
The "hard" part is to make the cake app read/write the sessions and cookies in the same fashion SMF does to allow a smooth transition from the cake app to the forum and backwards.
Technically you can use both forums and archive your goal with both, it's just a matter of getting the frameworks components utilized right.
I ended up using: this
It has all that I needed and integrates perfectly into Cake :)