As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a dataset with millions of U.S. addresses. I would like to geocode this dataset. Yahoo had an API with the most generous rate limit (50K per day, still too low for my purposes), but this is defunct. I don't think any API, unless I can do over 100K requests per day, will suit my needs.
Is there any simple-to-configure software I can download to do this from my own computer?
In particular, to those who have experience with it, will
http://www.datasciencetoolkit.org/developerdocs#setup
suit my needs?
Would an API that supports millions of requests per day suit your needs?
There are few services which do this. In particular, LiveAddress by SmartyStreets can handle that kind of load and is actually built for it. You can upload files (like Excel or CSV, etc, zipped up esp. if you have that many) or query the API (each request can support 100 addresses).
So while the program doesn't get downloaded to your computer, it will actually be faster than a localized, in-house solution, because it scales up and when the load is high. LiveAddress is geo-distributed and is powered by RAM drive servers which spin up more nodes when there's lots of work to do. LiveAddress is known for handling millions of addresses quickly (like in a few hours).
I work at SmartyStreets. We kind of dare you to see how fast you can legitimately query the API or upload and process all your lists. There's plenty of sample code on GitHub for the API or you can (programmatically or manually) upload your list files for batch geocoding.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
We have an app that on average has 30 QPS (queries per second) and our costs around $1 per hour what gives a rough number for our GAE costs: $1 per million requests.
About half of this requests is to serve content (to real clients and search engines's bots) and another half is deferred tasks that update entities, reset cache, pre-generate some HTML, etc. We do not use backends anymore (we did but found it too difficult to sync during deployments and moved everything to task queues and are not looking back).
Just wonder how it is comparing to others? Normal? Too much? Very good?
I'm asking before we've got a new member in our team who is agitating that our costs are too high and we need to migrate to our own stand-alone server(s).
Am newbie at StackOverflow and not sure if I should/can disclose the name of the website in question (I would be happy to provide if it is allowed).
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
We are new to NoSQL and now are starting on a project that aims to record sensor data from many different sensors, each recording a timestamp - value pair, into a cloud based database. The amount of sensors should scale, so the solution should be able handle the sizes of hundreds of millions or possibly even billion(s) writes a year.
Each sensor has its own table with key(timestamp) - value and sensor metadata is in its own table.
The system should support search functions such as the most recent values (fast data retrieval) of certain sensor types and values from time frame of sensors in certain areas (from metadata).
So the question is which cloud database service would be most suited to our needs?
Thanks in advance.
Couchbase is a great option for this type of use case.
Try Apache Cassandra. DataStax provide easy to install packages that includes some useful extras.
I wholeheartedly agree with #Ben that this isn't an answerable question; nevertheless, I would at least consider the reasons for choosing a simple k/v type store over a typical RBDMS. It sounds like this data will likely be aggregated and counted; an RBDMS will typically answer those questions very quickly with correct indexing. 1B writes/yr (or even 30B/yr) is really not that high.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am doing a project in the university which requires running of multiple instances (1000s) of a program I've written (in C++), which runs for quite a while (say 2 hours). The program is very self contained - it does not require input files, and the only dependency I think is boost.
I'm currently using the university-owned cluster of computer. However, it's quite old and the jobs dispatching and monitors services are pretty bad.
So I was wondering whether I can run my jobs elsewhere, for some money. For example, I looked a bit into Google App Engine, but as it seems every job must end after 30 seconds it is not suitable for me. Maybe Amazon EC2?
Do you know of such options?
Amazon EC2 is the classic approach for this.
Google App Engine is great, but probably to restrictive for your use case.
EC2 is definitely a very good option, as Peter says. Since you're at a university I'm guessing that cost may be an important factor, so take a look at Rackspace's cloud service as well; depending on what kind of server resources you need, this can work out quite a bit cheaper than EC2. (I don't work for Rackspace).
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
What applications/IDEs are out there to develop BASE database systems from?
BASE systems (Basically Available,
Soft state, Eventually consistent) are
an alternative to RDBMS, that work
well with simple data models holding
vast volumes of data. Google's
BigTable, Dojo's Persevere, Amazon's
Dynamo, Facebook's Cassandra are some
examples.
It seems like you are looking into the recently popular NoSQL moniker for "databases". Which also includes MongoDB, Voldemort (must not be named), Hbase, Tokyo Cabinet, and CouchDB. There are a lot of them. I am not sure what your question is?
Each one has its own advantages, implementation difficulty, and performance differences. Although they are all designed to scale. There are some good articles on highscalability.com, http://highscalability.com/blog/tag/nosql
Then there are the systems that are designed to enhance and scale searching from traditional databases (i.e. MySQL). For example, Solr based on Lucene. That's more geared towards full text searching. That falls in the "eventually consistency" since it synchronise with the database periodically.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Where can I get, or how can I generate a large formatted collection of fake user data (names, email address, locations, etc.) that can be used for testing an application?
It can be clearly fake, this will be limited to the development server. But I'm sure anything would be better than what I could come up with.
There are some tools built just for this. I've used http://www.generatedata.com/ before to generate data for MySql databases. RedGate has a nice tool to fill your SQL Server database with test data called SQL Data Generator. The RedGate tool costs about $300, but there is a free trial.
UPDATE:
Faker.js is now available. It is a project built on node.js, and looks pretty comprehensive.
ANOTHER UPDATE: Mockaroo is great!
If you'd like an HTTP API of fake user data, check out Random User Generator
This is a open source tool for generating various types of test data. http://www.generatedata.com
http://www.fakenamegenerator.com/ is a good resource for creating test data with realistic looking users complete with SSN, email address, ... They have a bulk download option too.
Check out this list of "Funny Names" some of them are classic
http://www.ethanwiner.com/funnames.html
Another open source test generator tool is my own http://code.google.com/p/csvtest.
For anyone looking for an updated solution to this problem...
I wrote a test data generator project for Data Synchronisation Studio. It can generate a large dataset ranging from 1 to 100s of millions of rows of realistic testing data (lots of OFs there :D) Anyway, here is a blog post all about it. http://www.simego.com/Blog/2012/02/Test-Data-Generator-Download-for-Data-Sync