Google App Engine PDF converter - google-app-engine

I'm looking for a good, open source, PDF generator/library that will convert html (with styling etc.) into a PDF file.
Requirement:
Must be Java or Python and run on Google App Engine.
Must be Free, open-source.
Must be easy to use/consume.
Yes I have tried searching for this myself - I've tried many "solutions" that I've found on Google etc. None yet satisfy me. Many seem incomplete, buggy or don't work well on GAE. So I figured I would appeal to the StackOverflow community for opinions or suggestions.

For HTML/image to PDF I use the Python library http://www.xhtml2pdf.com/ which uses Pisa, Reportlab, pyPdf, and html5lib running on GAE. I have been using it to generate very nice article PDFs with embedded images and once I figured out how to get the page size correct I have found this to be a very good library.
You will need the xhtml2pdf library and it's dependencies:
https://github.com/chrisglass/xhtml2pdf
I threw together some example Python code and put it in this pastebin:
http://pastebin.com/FFEZjNs3
The pdf_data you get at the end is the binary PDF file data. The html_data you give to pisa is really any string containing an HTML document.
There are some recommended things to include in your HTML to get a well formatted PDF output. Here is an example HTML document similar to the base template I use. Note the author meta field and the #page CSS:
http://pastebin.com/q1wRm9nJ
Here are the docs about the compatible CSS and HTML:
https://github.com/chrisglass/xhtml2pdf/blob/master/doc/usage.rst#supported-css-properties
You can include images using either the URL of the external image or you can use a dataUri and xhtml2pdf has a function for creating these "pisa.makeDataURI()".
Hopefully that helps.

Related

Using Google Sheets to create a .xslx file from sql data

I am trying to create a .xlsx file with data I retrieve from mysql to a node.js server that serves an angularjs project, but after hours of trying to find something via npm or google I almost gave up!
The two main problems I have are:
My data is in hebrew (i.e. rtl styling + different characters).
The Excel file that I export needs to be styled in a specific way, and it is a pain trying to style an excel file grammatically.
And then I had an idea!
What if I could create a google sheet doc in my google drive as a template including the styling, and then when the user clicks to create a new doc, I would just duplicate this template, and change the values to the new data.
But just trying to understand the google api is a headache on its on, apparently, there are 3! different api's: Drive, Sheets and auth.
So my question is as follows.
Is my idea valid? does anyone think it could work?
Where would I start, is there some guide or npm that would help?
Please don't comment to look in the docs, I am having a hard time to understand where to start from there.
I would suggest creating the template file locally instead of opting to google spreads.
There is a decent module I used sometime back, which does styling pretty well, Its called exceljs.
Though there is always the xlsx module, Which is very powerful but difficult to use
Also if you end up using google spreadsheets, I would suggest giving node-google-spreadsheet module a look

ruby on rails, file upload and database

maybe it's a question already ask but i did not find a answer.
Hello all.
I'm new ruby on rails.
I'm using Ruby 1.9.3 and rails 3.0.5, i want to create a page to upload file (video) on the shared folder of the server and when the file is upload i want to save the name of the file on my database (SQLITE3).
After i want to use the of HTML5 to list all the video i have on an other page.
The structure look like : Home -> click on button upload file -> upload page file -> return home to see the video list.
If possible i want to make it hand made, no gems.
For now i have try something like mix : http://french.railstutorial.org/chapters/user-microposts#top and http://www.tutorialspoint.com/ruby-on-rails/rails-file-uploading.htm. But did not get anything so far the work.
Thank you for the help.
p.s. : sorry if my english is not really good.
There is a reason people use gems. They provide flexibility and simplicity into application. Gems is blessing and not using gems when you can is a total waste of time. As for this application, please consider Carrierwave gem. One was specifically designed for file uploads. Here is a railscast video on how to use it. I'm sure you'll find it pretty simple and awesome to use.
I think you can use paperclip gem to upload image.

How to scrape logos from websites?

First off, this is not a question about how to scrape websites. I am fully aware of the tools available to me to scrape (css_parser, nokogiri, etc. I'm using Ruby to do the scraping).
This is more of an overarching question on the best possible solution to scrape the logo of a website starting with nothing but a website address.
The two solutions I've begun to create are these:
Use Google AJAX APIs to do an image search that is scoped to the site in question, with the query "logo", and grab the first result. This gets the logo, I'd say, about 30% of the time.
The problem with the above is that Google doesn't really seem to care about CSS image replaced logos (ie. H1 text that is image replaced with the logo). The solution I've tentatively come up with is to pull down all CSS files, scan for url() declarations, and then look for the words header or logo in the file names.
Solution two is problematic because of the many idiosyncrasies of all the people who write CSS for websites. They use Header instead of logo in the file name. Sometimes the file name is random, saying nothing about a logo. Other times, it's just the wrong image.
I realize I might be able to do something with some sort of machine learning, but I'm on a bit of a deadline for a client and need something fairly capable soon.
So with all that said, if anyone has any "out of the box" thinking on this one, I'd love to hear it. If I can create a solution that works well enough, I plan on open-sourcing the library for any other interested parties :)
Thanks!
Check this API by Clearbit. It's super simple to use:
Just send a query to:
https://logo.clearbit.com/[enter-domain-here]
For example:
https://logo.clearbit.com/www.stackoverflow.com
and get back the logo image!
More about it here
I had to find logos for ~10K websites for a previous project and tried the same technique you mentioned of extracting the image with "logo" in the URL. My variation was I loaded each webpage in webkit so that all images were loaded from CSS or JavaScript. This technique gave me logos for ~40% of websites.
Then I considered creating an app like Nick suggested to manually select the logo for the remaining websites, however I realized it was more cost effective to just give these to someone cheap (who I found via Elance) to do the work manually.
So I suggest don't bother solving this properly with a fully technical solution - outsource the manual labour.
Creating an application will definetely help you, but I believe in the end there will some manual work involved. Here's what I would do.
Have your application store in a database a link to all images on a website that are larger than a specified dimension so that you can weed out small icons.
Then you can setup a form to access these results. You may want to setup the database table to store the website url and relationship between the url and image links.
Even if it we're possible to write an application to truly figure out if it was a logo or not seems like it would be a massive amount of code. In the end, it would probably weed out even more than the above, but you have to take into account it could be faster for human to visually parse the results then the time it took for you to write and test the complex code.
Yet another simple way to solve this problem is to get all leaf nodes and get the first
<a><img src="http://example.com/a/file.png" /></a>
you can lookup for projects to get html leaf nodes on the net or use regular expressions to get all html tags.
I used C# console app with HtmlAgilityPack nuget package to scrape logos from over 600+ sites.
Algorithm is that you get all images that have "logo" in url.
The challenges you will face with during such extraction are:
Relative images
Base url is CDN HTTP/HTTPS (if you don't know
protocol before you make a request)
Images have ? or & with query
string at the end
With that things in mind I got approximately 70% of success but some images were not actual logos.

CMSMS File Download Problems

I'm using the CMS Made Simple platform; which I'm not very familiar with!
The site has a secure frontend, which contains a document library for members. Files are stored outside the document root and links are generated by the CMS so you should only be able to get the documents if you're logged in.
At first glance the setup works fine; however certain PDFs uploaded in this fashion are corrupt upon download, and line endings in text files aren't preserved.
Sorry if this is a bit vague, I'm hoping someone has come across a similar problem but any ideas would be greatly appreciated.
Regards,
Rich
You should check which module/tag is handling the download / upload.
Are you sure they´re intact when uploaded?
Check for headers, and content size calculation, try different browsers and different methods to force the download.

I plan to use jsfiddle for html demos on my site, but how can I store my actual code?

My site will feature dozens and dozens of front end live demos ( html pages with cross browser bugs ), but instead of just throwing it on jsfiddle.net and linking to demos from articles I would actually like to store them in a database or organized dynamically generated flat files.
Example:
http://site/css-bug/ will feature an article on a certain bug in X browser. I can have many ( demos ) to one ( bug/article ). They will contain HTML, CSS and some Javascript.
Another possibility I was pondering about was making my own jsfiddle.net clone, and in doing so I would have to mimic the way jsfiddle stores them ( however it does ). I'm thinking this is the best route to go, but would appreciate advice.
Background info:
As of now I am manually making static html files in directories and linking to them, and I am using Django for my application which links to these demos ( which reside on a media server ).
You can use jsFiddle for that.
To get the files locally you can get the parts of the saved fiddle using an undocumented API (this mean it may disappear and not be valid). Add /show_js/ or /show_html/ or /show_css/ to the end of the url.
You may wait some time until we will add export to gists on github (implementing this shouldn't take very long, but we want hit beta first).
To increase the speed of loading the example it would be great if you'd load the embedded version on demand. Display [Show example] button which will create an iframe with embedded fiddle. We plan to write cross-browser support for that as well.

Resources