Optimisation: use local files or databases for HTML - database

This follows on from this question where I was getting a few answers assuming I was using files for storing my HTML templates.
Before recently I've always saved 'compiled' templates as html files in a directory (above the root). My templates usually have two types of variable - 'static' variables which are not replaced on every usage but are used across the site - basically for ease of maintenanceif I decide to change the site name for example; and dynamic vars that change on every page load.
I always used to save these as files on the server - but my friend pointed out something I'd overlooked: why have 5-10 filesystem calls when you can have one database call?
What I want to know is which is more efficient? Calling several HTML files from the system or calling several rows of templates from the database (in one query/call).

Don't Store Editable HTML In the database
Seriously, because the maintenance overhead for mere changes becomes exhaustive once you realise you can no longer just pop open a text editor.
I worked on many projects which had HTML content in the database, and it was a constant nightmare of "find that row the content is on" and I really would liked to have shot the person whom made it.
Also, DON'T PREMATURELY OPTIMISE . If you find it a problem thats slowing the project down, then change it. Because making the code exhaustively less maintainable to save a millisecond. But design the code well enough that should you need to change where the content comes from later it should be easy to do.
Surely that can be resolved by having
a suitable web interface for editing
the templates?
Erm, really not, unless you're only trying to compete with notepad. Syntax highlighting and all the other full host of features you can get in a standard editor just make your developers suicidal when they find themselves editing web pages by hacking at an undersized text area with awful white on black ( not to mention the extra fun you get with having to worry about entity encoding etc, for instance, try editing html with in a text area where the html content contains a text area element! )
On FileIO While File IO can be a bottleneck, keep in mind that if you have a proper linux install, and plenty of memory, a handy thing known as "disk-cache" takes effect, which in effect, keeps files used in memory, so file IO becomes mere memcpy.
On the contrary, in real stress tests on any of the code I have used, the biggest slowdowns have been in the database!, primarily the nice slow CONNECT string, query parse time, extra php<->mysql interactions. You're not really looking at gaining anything. Filesystem lookup is close to database index lookup, and you don't have any unknowns other than "you need to stream it from a disk", no table locking stuff to worry about!
You should probably try something like a caching library, X-Cache comes highly recommended, thats more likely to give you visible performance gains.

I pretty much agree with the gist of Kent Fredric's answer. But, if you really want to know, which is more efficient/faster, you cannot reasonably expect to get the answer here. If you want that answer, there is only one way to get it: profile the application both ways.

Related

Howto programmatically create a PDF from a predesigned template made in InDesign

The goal is to design beautiful templates in InDesign, which are then being used to programmatically generate printable PDFs within a special application connected to a database, so I can fill data from the database into the templates.
I have no idea how to approach this. I found a lot of HTML to PDF conversion related info, but that approach has its limitations.
Did anybody face the same question and might point me in the right direction?
Yes, the scenario you described can be fully handled by InDesign Scripting using ExtendScript. I have done this in the past several times and it works quite well. The key in my opinion is to have a designer prepare the file for you as finished as possible and make good use of the built in InDesign automations. That means they will do the layout, but also set up all the paragraph styles, character styles, object styles and possibly grep styles as well as master spreads for each different page.
Then the job of the script that you run will mostly be to fill in the contents and to assign the mentioned styles and master spread as needed. If everything is set up properly, most of the layout should fall into place automatically.
Also, contrary to the comments to your question, I don't think you need InDesign server for that. Especially if you run everything locally anyways.

Recreating KaTeX emulation in La/TeX?

I'm working on a site that is using KaTeX for rendering math. However, the interface for entering the math content is (really) not ideal, so it is actually faster for me to work in an editor, like Sublime Text 3 and import the work; however, an issue I run into is that when I import, I discover various functions / environments aren't supported (i.e. emulated) by KaTeX.
If it were just me working on the material, I would simply learn as I go and consult the KaTeX documentation page; however, I have several contractors working on digitizing content who do not have access to the site (and I don't have the ability to give them access), and so cannot learn by trial-and-error. Instead, I end up with piles of documents that all need to be manually adjusted, to render as desired with KaTeX.
As such, I wanted to assemble a preamble for a LaTeX document that would recreate the abilities (i.e. functions and environments) KaTeX can emulate, and was wondering if such a preamble / package already exists? I have tried a few quick searches, but because I'm looking for something that imitates an emulator, I'm finding it tricky to find the right choice of words to get relevant results.
I wasn't sure if this were best posted here or on the TeX.se - I suspect it falls in-between the two - so I apologize if my guess was wrong and I should've tried there first. Any suggestions would be very much appreciated, as this is creating a substantial bottle-neck in my workflow but is also just outside of my ability to solve on my own.
Supported functions is one thing. To tackle that you might actually stand a fair chance of just tokenizing the input, looking for backslash name sequences and checking them against a list extracted from KaTeX sources to see which are supported.
I guess one could even try to remove all other functions from LaTeX. Or rather hide them, such that the user input can't access them but third party libraries can. Getting rid of language features (as opposed to macros) such as \def would probably be even harder. Better askn on the TeX stack exchange for details of you really want to follow this route.
As an alternative I guess you might be able to perform the check I described above in TeX. Write a macro which reads the current file as plain text instead of TeX source, to perform this analysis. Or some such. But a separate stand-alone tool would be much easier.
If you are going for a separate tool, you might as well write it in JavaScript for Node, and have it run KaTeX on the input. That way you can at least tell whether it will get typeset to something or error out.
Whether the rendering is what you expect from LaTeX may be another question. In general KaTeX aims to reproduce LaTeX behaviour, so any difference might indicate a bug. But bugs exist, so all of this might not avoid the need for checks. How about you just processing the math part of the input with KaTeX to some HTML which authors can check without access to the site?
As for existing tools or macro packages, I know of none, but tool or library questions are off topic on stack exchange anyway.

When exactly a site uses it's database for retrieval

my knowledge on how a database exactly works is close to zero and I'm trying to understand when exactly a site uses it's database to retrieve information. So for example the site retrieves all the information the moment i load the site(so when i choose for example "funny pictures" it doesn't have to retrieve anything from the database) or it retrieves information only when i make a specific choice? I hope you kind of understand my question, I'm sorry for my bad English.
It depends how it has been implemented and with whihc technology.
Most of time, It load a specific set of information only relevant of the current context.
It depends on SW technology used, settings and site code. Normally it is set to show something only when needed, but it is possible to change this behaviour.
There is one more possibility you haven't noticed, but often used - for example, you have a set of pictures. You see all their small icons on the page, but the whole picture - only if you click on an icon. Then while you are looking on the icons, or on one of the pictures, others pictures are downloaded at the background, to be quickly shown when needed. Sometimes even some primitive prediction system works, to guess what you'll look at the next time.
And all these behaviours work not only for data from databases, but for all data on the pages.
Again, the main thought - it is for you to choose.

Creating the Front End MDE

I created a database for tracking metrics, with some automation tricks (email, .doc,.ppt presentations, etc) with a very large Main-table, and lots of forms/GUI. This is the first time I have ever I worried about an MDE/front-end for the thing. So if you would be so kind to answer a few questions, or offer any advice, it would be greatly appreciated (I would hate for all this work to not be utilized).
What is the first thing I need to do? It the 2000 version that must be converted to 03 to create the MDE, but does that get done before I use the database splitter?
Will the amount of objects in the database effect the ability to do this? I have something like 80 forms, 70 queries, 20+ macros, 12 tables, etc...but does the amount of objects prevent some of this from working well once the front end is there?
when i split the database, can I continue to work/make changes and such on the "back end", and have those changes directly effect the front end?
These may be some basic questions, but I don't know the answer so.....Thanks!
Here is my 2 ยข.
Question 1 - I have never used the database splitter as I feel I have more control doing it manually. If you do it manually you can do it to a version that does not have a database splitter. But if you do use the splitter then--yes--you will have to upgrade to a version that has a splitter before doing it.
To do it manually here are the steps.
Backup everything.
Create a copy of your file into the same directory. So if you have an MyApp.MDB create a copy into the same directory with a new name, such as MyAppDATA.mdb.
Open the new DATA file (MyAppDATA.mdb) and delete all of the objects EXCEPT the TABLES.
Open the App file (MyApp.mdb) and delete all of the tables.
Also in MyApp.mdb...go to the File/Get External Data/Link Tables menu to link the tables in MyAppDATA.mdb to MyApp.mdb. Select All and create the links.
That should do it. And if you screw up you made a backup...right?
A couple of tips and gotchas...be sure that you go to Tools/Options and that you are NOT showing System and Hidden tables. You just don't want to delete system tables from MyApp. Another way to do it is do NOT delete tables that start with MSys or USys.
Question 2 - Does not matter how many object you have. In fact you don't have that many objects anyway.
Question 3 - Yes...you will make backend changes in MyAppData.mdb and when you open MyApp.mdb those changes will auto-magically be there to see and query against etc. (In the query designer you may need to save/close/reopen to see new fields if you made the mod while in the query). The EXCEPTION to that is New Tables You will have to use the File/Get External Data/Link Tables option to create links to new tables.
One thing to remember (and that I hope you already realize) is that the one downside of splitting the database is that when you deploy the front end file that usually the relative path to the data will vary from machine to machine and there is no automatic re-linking of tables in access. If your target clients have full access you can always use Tools/Database Utilities/Linked Table Manager to refresh the links to the right location. If you can't do that then you will have to do one of the following:
1. Write code that does the automatic re-linking for you. Basically it will check the links...if invalid it will prompt the user for the data location (or look it up in an INI file) and re-link the tables.
2. Always deploy your app to the same location on all machines. If you have commercial visions for your application this won't work...I mention it for academic reasons. It might be doable for a limited deployment where you have a lot of control over file placement on each machine.
3. Put the Data file (MyAppDATA.mdb) onto a network share and link the table across the network using a drive mapping or UNC (\myserver\mydata\ApplicationData\MyAppData.mdb). The latter is preferred but both of them run the same risks as number two.
Seth
PS This answer assumes Access 2003.
PPS If you have commercial visions for your application then the table linking has got to be REALLY robust.
PPPS I agree with the commenter that you may want to take the plunge and do SQL if it is in your skill set.
One thing that hasn't been discussed, and that's the issue of whether the compile to MDE could fail. Basically, if your code compiles in your front-end MDB, it will convert to an MDE. But I've noticed that lots of people never compile.
Some hints for keeping your VBA code in good shape:
in VBE options, turn off COMPILE ON DEMAND.
add the COMPILE button to your standard VBE toolbar and USE IT OFTEN.
periodically, backup your MDB and decompile/recompile it.
Also, remember that you must keep the MDB source, as the VBA code is not editable in an MDE and not recoverable by any good method.
EDIT:
Steps for a decompile:
backup your MDB.
start an instance of Access with the /decompile commandline argument. For, instance, I have a shortcut on my deskstop that has this as the target:
"C:\Program Files\Microsoft Office\OFFICE11\MSACCESS.EXE" /decompile
having opened that instance of Access, open the MDB you want to decompile. You will see nothing happen. DO NOTHING FURTHER IN THIS INSTANCE OF ACCESS -- close this instance of Access (the reason for this is that Michael Kaplan, who knows a thing or two about this, recommended that you never do any work in an Access instance opened with the decompile switch because he said there was no guarantee that the Access application code executed under those circumstances in a way that was fully safe for all kinds of Access work).
open the just-decompiled MDB holding down the shift key (you want to be sure that startup routines don't run because that would likely recompile the product before you've finished your cleanup) and compact the MDB (holding down the shift key again).
open the code editor and compile the project (DEBUG -> COMPILE [db name] for those who haven't step #2 in my original compiling instructions at the top of the post before the edit).
compact the MDB (doesn't matter if you bypass startup, since it's already fully compiled).
Why so many steps?
Because the purpose of the decompile is to get rid of the compiled p-code in order to start afresh from the canonical VBA code. Following the steps above insures that you have completely cleared the data pages storing the compiled code before you recompile. The reason for this is that without the compact step after the decompile, under some very rare circumstances, the code can behave strangely. I can't imagine that the old discarded p-code is being used again, but there's something about the pointers between the canonical code and the compiled code that apparently doesn't get completely flushed by a decompile without a compact.
This would be a comment to Seth's answer, but my rep isn't high enough to comment yet.
Seth did a great job answering your questions, I just wanted to add a bit more to part #1 about using the Database Splitter. The Database Splitter in the Tools menu works fine. Doing it manually is alright too, but it's a whole lot faster and easier to use the Database Splitter. I've used it a dozen times and never encountered any issues after using it.
http://www.databasedev.co.uk/split_a_database.html has a decent page about some of the pros, cons of splitting your database.
http://www.accessmvp.com/TWickerath/articles/multiuser.htm also has some good info when dealing with a split database in a multi-user environment.
Seth gave you a very good answer. But I'll add a few comments.
The number of objects only becomes relevant when you get close to about 1000 forms, reports and modules which have code. There's a limit about there. If you do get that message when trying to make an MDE then you almost certainly have a code error and need to compile to find the error
Another resource is "Splitting your app into a front end and back end Tips"
See the Auto FE Updater downloads page to make the process of distributing new FEs relatively painless.. The utility also supports Terminal Server/Citrix quite nicely.

How do screen scrapers work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I hear people writing these programs all the time and I know what they do, but how do they actually do it? I'm looking for general concepts.
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
Lots of accurate answers here.
What nobody's said is don't do it!
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout?
Of course, sometimes you have no choice :(
In general a screen scraper is a program that captures output from a server program by mimicing the actions of a person sitting in front of the workstation using a browser or terminal access program. at certain key points the program would interpret the output and then take an action or extract certain amounts of information from the output.
Originally this was done with character/terminal outputs from mainframes for extracting data or updating systems that were archaic or not directly accessible to the end user. in modern terms it usually means parsing the output from an HTTP request to extract data or to take some other action. with the advent of web services this sort of thing should have died away, but not all apps provide a nice api to interact with.
A screen scraper downloads the html page, and pulls out the data interested either by searching for known tokens or parsing it as XML or some such.
In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Here's a tiny bit of screen scraping implemented in Javascript, using jQuery (not a common choice, mind you, since scraping is usually a client-server activity):
//Show My SO Reputation Score
var repval = $('span.reputation-score:first'); alert('StackOverflow User "' + repval.prev().attr('href').split('/').pop() + '" has (' + repval.html() + ') Reputation Points.');
If you run Firebug, copy the above code and paste it into the Console and see it in action right here on this Question page.
If SO changes the DOM structure / element class names / URI path conventions, all bets are off and it may not work any longer - that's the usual risk in screen scraping endeavors where there is no contract/understanding between parties (the scraper and the scrapee [yes I just invented a word]).
Technically, screenscraping is any program that grabs the display data of another program and ingests it for it's own use.In the early days of PC's, screen scrapers would emulate a terminal (e.g. IBM 3270) and pretend to be a user in order to interactively extract, update information on the mainframe. In more recent times, the concept is applied to any application that provides an interface via web pages.
With emergence of SOA, screenscraping is a convenient way in which to services enable applications that aren't. In those cases, the web page scraping is the more common approach taken.
Quite often, screenscaping refers to a web client that parses the HTML pages of targeted website to extract formatted data. This is done when a website does not offer an RSS feed or a REST API for accessing the data in a programmatic way.
Typically You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
As an example, consider an RSS aggregator, then consider code that gets the same information by working through a normal human-oriented blog interface. Which one breaks when the blogger decides to change their layout.
One example of a library used for this purpose is Hpricot for Ruby, which is one of the better-architected HTML parsers used for screen scraping.
You have an HTML page that contains some data you want. What you do is you write a program that will fetch that web page and attempt to extract that data. This can be done with XML parsers, but for simple applications I prefer to use regular expressions to match a specific spot in the HTML and extract the necessary data. Sometimes it can be tricky to create a good regular expression, though, because the surrounding HTML appears multiple times in the document. You always want to match a unique item as close as you can to the data you need.
Screen scraping is what you do when nobody's provided you with a reasonable machine-readable interface. It's hard to write, and brittle.
Not quite true. I don't think I'm exaggerating when I say that most developers do not have enough experience to write decents APIs. I've worked with screen scraping companies and often the APIs are so problematic (ranging from cryptic errors to bad results) and often don't give the full functionality that the website provides that it can be better to screen scrape (web scrape if you will). The extranet/website portals are used my more customers/brokers than API clients and thus are better supported. In big companies changes to extranet portals etc.. are infrequent, usually because it was originally outsourced and now its just maintained. I refer more to screen scraping where the output is tailored, e.g. a flight on particular route and time, an insurance quote, a shipping quote etc..
In terms of doing it, it can be as simple as web client to pull the page contents into a string and using a series of regular expressions to extract the information you want.
string pageContents = new WebClient("www.stackoverflow.com").DownloadString();
int numberOfPosts = // regex match
Obviously in a large scale environment you'd be writing more robust code than the above.
A screen scraper downloads the html
page, and pulls out the data
interested either by searching for
known tokens or parsing it as XML or
some such.
That is cleaner approach than regex... in theory.., however in practice its not quite as easy, given that most documents will need normalized to XHTML before you can XPath through it, in the end we found the fine tuned regular expressions were more practical.

Resources