How does Google Docs store documents (on the backend)? - file-format

I half imagine there being these great .docs in the sky... but another part of me doubts that my documents are even being stored in anything we'd traditionally call a "file." Does Google have its own document format? I feel like it must. Some branch of some existing format like ODF, maybe? Any idea what it's like, what's special about it (if anything), and/or why it is the way it is?

As far as I'm aware, Google Docs originally generated RTF files. Now, however, with the recent push of HTML5 and integration of the ContentEditable module, they may very well just store documents as plain HTML within their database.

I would guess that google definitly extracts some information for indexing from the file. For editing purposes however, I do not think the internal format will be so much different from ODF/MS-Office or other file formats. But those are only guesses, maybe someone else knows more.

Related

How to store data with version controll and make it easy to edit and use in applications

I want to store a set of data (like drop tables for a game) that can be edited and "forked" (like a open source project, just data, so if I stop updating it, someone can continue with it) like a coding project. I also want that data to be easy to implement in code (for example, the same way you can use a database in code to get your values) for people that makes companion apps for said game.
What type of data storage would be the best for this scenario?
EDIT: By type of data storage I mean something Like XML or JSON or a database like Access or SQL as well as noSQL
It is a very general question, but I'm getting the feeling that you're looking for something like GitHub. If you don't know what that is, then you should probably look into it. GitHub supports svn and allows you to edit your code quite easily and let you look back to previous versions of your code. Hope this helps!

How can i gather lots of files from one filetype?

Im trying to fuzz some tools but i need a huge amount of .zip or .jpg files for that. I ve tried crawlers like webripper but its not very effective (or im doing it wrong). Is there a better way to get lots of different files?
Ok, for the offchance that someone else might need sth like this:
In the end i used Webripper and instead of generating links to google/bing results with the "filetype" parameter i just put some upload/freeware pages as targeted rip job with the max link depth.
Webripper might crash sometimes and it will take quite some time but well it works somewhat.
A possible better solution would probably be to use the google API (e.g. c#SearchAPI ). Then extract the clean links from the results and call asynch download for those. Using the direct result link most likely wont work because google will block it after some files "Unusual datatransfer".

How to collect data from a website

Preface: I have a broad, college knowledge, of a handful of languages (C++, VB,C#,Java, many web languages), so go with which ever you like.
I want to make an android app that compares numbers, but in order to do that I need a database. I'm a one man team, and the numbers get updated biweekly so I want to grab those numbers off of a wiki that gets updated as well.
So my question is: how can I access information from a website using one of the languages above?
What I understand the problem to be: Some entity generates a data set (i.e. numbers) every other week and you have a need to download that data set for treatment (e.g. sorting).
Ideally, the web site maintaining the wiki would provide a Service, like a RESTful interface, to easily gather the data. If that were the case, I'd go with any language that provides easy manipulation of HTTP request & response, and makes your data manipulation easy. As a previous poster said, Java would work well.
If you are stuck with the wiki page, you have a couple of options. You can parse the HTML your browser receives (Perl comes to mind as a decent language for that). Or you can use tools built for that purpose such as the aforementioned Jsoup.
Your question also mentions some implementation details such as needing a database. Evidently, there isn't enough contextual information for me to know whether that's optimal, so I won't address this aspect of the problem.
http://jsoup.org/ is a great Java tool for accessing content on html pages
Consider https://scraperwiki.com/ - it's a site where users can contribute scrapers. It's free as long as you let your scraper be public. The results of your scraper are exposed as csv and JSON.
If you don't know what a "scraper" is, google "screen scraping" - it's a long and frustrating tradition for coders, who have dealt with the same problem you have since the beginning of networked computing.
You could check out :http://web-harvest.sourceforge.net/
For Python, BeautifulSoup is one of the most tolerant HTML parsers out there. The documentation also lists similar libraries in Ruby and Java, so you'll probably find something relevant there.

Best Way to automatically find links to your content?

So, here is the task I've found myself thinking of. Pretend for a moment, that I have a large body of content. I want to see what websites are linking to my content. I know that I could look into TrackBack or PingBack but what about those that aren't using tools capable of dealing with that?
It would seem that some form of Web Crawler that looks for pages linking to the original document might be useful. My question to the greater community is what would be the best way to get started here? Do TrackBack and PingBack do more than I assume? Are there services or tools out there that already do what I'm thinking?
Google is your friend!
Use the link prefix:
link:whatsite.com
And yes, trackbacks do more.
If you have HTTP referers setup in your logs, you can mine them.
You can even discover pages taht does not know about.
Else, there is the paying Linkscape from Seomoz or the free majesticSEO (if you confirm ownership of the domain).
MajesticSEO has a bigger backlink index and an API (need to login!).

Apache module FORM handling in C

I'm implementing an Apache 2.0.x module in C, to interface with an existing product we have. I need to handle FORM data, most likely using POST but I want to handle the GET case as well.
Nick Kew's Apache Modules book has a section on handling form data. It provides code examples for POST and GET, which return an apr_hash_t of the key+value pairs in the form. parse_form_from_POST marshalls the bucket brigade and flattens it into a buffer, while parse_form_from_GET can simply reference the URL. Both routines rely on a parse_form_from_string routine to walk through each delimited field and extract the information into the hash table.
That would be fine, but it seems like there should be an easier way to do this than adding a couple hundred lines of code to my module. Is there an existing module or routines within apache, apr, or apr-util to extract the field names and associated data from a GET or POST FORM into a structure which C code can more easily access? I cannot find anything relevant, but this seems like a common need for which there should be a solution.
I switched to G-WAN which offers a transparent ANSI C scripts interface for GET and POST forms (and many other goodies like charts, GIF I/O, etc.).
A couple of AJAX examples are available at the GWAN developer page
Hope it helps!
While, on it's surface, this may seem common, cgi-style content handlers in C on apache are pretty rare. Most people just use CGI, FastCGI, or the myriad of frameworks such as mod_perl.
Most of the C apache modules that I've written are targeted at modifying the particular behavior of the web server in specific, targeted ways that are applicable to every request.
If it's at all possible to write your handler outside of an apache module, I would encourage you to pursue that strategy.
I have not yet tried any solution, since I found this SO question as a result of my own frustration with the example in the "Apache Modules" book as well. But here's what I've found, so far. I will update this answer when I have researched more.
Luckily it looks like this is now a solved problem in Apache 2.4 using the ap_parse_form_data funciton.
No idea how well this works compared to your example, but here is a much more concise read_post function.
It is also possible that mod_form could be of value.

Resources